Enhancing spatial detection accuracy for syndromic surveillance with street level incidence data
© Savory et al; licensee BioMed Central Ltd. 2010
Received: 6 November 2009
Accepted: 18 January 2010
Published: 18 January 2010
The Department of Defense Military Health System operates a syndromic surveillance system that monitors medical records at more than 450 non-combat Military Treatment Facilities (MTF) worldwide. The Electronic Surveillance System for Early Notification of Community-based Epidemics (ESSENCE) uses both temporal and spatial algorithms to detect disease outbreaks. This study focuses on spatial detection and attempts to improve the effectiveness of the ESSENCE implementation of the spatial scan statistic by increasing the spatial resolution of incidence data from zip codes to street address level.
Influenza-Like Illness (ILI) was used as a test syndrome to develop methods to improve the spatial accuracy of detected alerts. Simulated incident clusters of various sizes were superimposed on real ILI incidents from the 2008/2009 influenza season. Clusters were detected using the spatial scan statistic and their displacement from simulated loci was measured. Detected cluster size distributions were also evaluated for compliance with simulated cluster sizes.
Relative to the ESSENCE zip code based method, clusters detected using street level incidents were displaced on average 65% less for 2 and 5 mile radius clusters and 31% less for 10 mile radius clusters. Detected cluster size distributions for the street address method were quasi normal and sizes tended to slightly exceed simulated radii. ESSENCE methods yielded fragmented distributions and had high rates of zero radius and oversized clusters.
Spatial detection accuracy improved notably with regard to both location and size when incidents were geocoded to street addresses rather than zip code centroids. Since street address geocoding success rates were only 73.5%, zip codes were still used for more than one quarter of ILI cases. Thus, further advances in spatial detection accuracy are dependant on systematic improvements in the collection of individual address information.
In the wake of the recent H1N1 pandemic, interest in medical surveillance for early outbreak detection and medical situational awareness continues to grow. Central to this trend are syndromic surveillance systems that employ near real-time monitoring of local or regional clinical records to detect the occurrence of unusual patterns of disease syndromes. Detection methods include both temporal and spatial algorithms with the former typically receiving the most attention. This study focuses on spatial detection and attempts to improve the effectiveness of a commonly used statistic by increasing the spatial resolution of patient location data. The syndromic surveillance system behind this investigation is ESSENCE - Electronic Surveillance System for Early Notification of Community-based Epidemics - which is administered by the Military Health System (MHS) within the Department of Defense (DoD). ESSENCE monitors outpatient visits at non-combat clinics in more than 450 Military Treatment Facilities (MTF) worldwide.
ESSENCE performs daily spatial detection analysis to search for irregular clustering of cases in each of 10 disease syndromes. This study compares the zip code based spatial detection method currently used by ESSENCE with alternative scenarios that vary both the spatial resolution of patient data and statistical nature of the analysis. Using Influenza-Like Illness (ILI) as a test syndrome, the Bernoulli statistical model, street address level patient data, and an alternative background population estimate have been explored as a means of improving the authenticity and spatial accuracy of detected alerts. Accuracy was assessed by superimposing simulated disease clusters on ILI case data and measuring the displacement of detected clusters. Detected cluster size distributions were also evaluated for compliance with simulated cluster sizes.
In a similar study, Olson et al  examined the effect of varying levels of address precision on cluster detection by integrating simulated case clusters with actual syndromic surveillance data. Detection accuracy was assessed by considering the proportion of simulated points in identified clusters. In contrast, this study uses the exact location and geometry of identified clusters as a measure of accuracy. Perhaps more significant is the difference in scale - Olson examines a single medical community whereas this study considers regional locales within a global surveillance system. Ozonoff et al  investigated spatial detection of simulated data at 12 different levels of aggregation. Again, the proportion of simulated data points in identified disease clusters were used as a measure of accuracy. The proportion of points correctly and incorrectly included in detected clusters was calculated to measure false negative and false positive rates. In both studies, detection accuracy was greatest when exact locations were used and decreased with increasing spatial aggregation. This study attempts to quantify this supposition through spatial analysis. Regarding spatial aggregation, Grubesic and Matisziw  provide a detailed analysis of the pitfalls associated with using zip codes for epidemiological analysis.
Spatial Scan Statistic
The spatial detection software used by ESSENCE is adapted from SaTScan, a program developed by Kulldorff  which is widely accepted as the de facto standard for spatial-temporal detection of disease clusters. Kulldorff's scan statistics are typically used to detect clusters of disease incidents in both time and space. With ESSENCE, purely spatial methods are used and a non-mathematical description of that statistic is given here. In short, a circular window is scanned across geographic space evaluating the number of observed and expected incidents inside the window at each location. Multiple window sizes are assessed at each location and adjustments are made for the variable density of the background population and the number of cases observed. A cluster is recorded if the null hypothesis is rejected: the spatial distribution of incidents is a random sample from an expected distribution. Ultimately the overall maximum likelihood cluster is determined, i.e., that least likely due to chance. A probability value is assigned to this and any additional clusters detected. Statistics are based on one of several models which include the Poisson and Bernoulli models employed in this study. Details of the statistical theory behind the scan statistic are described further by Kulldorff .
ESSENCE spatial detection is based on the Poisson model. Here, the cases at each location are considered to be Poisson distributed, and the expected number of cases is proportional to the population size. This model requires case and population counts for each data location. In ESSENCE, syndrome cases are aggregated by zip code and their centroids are used as the geographic location. Obtaining actual population data typically presents a challenge when relying on medical records since treatment facilities do not serve the entire regional population. This is especially true with MTFs where only military personnel and their families make up the population. One conventional solution known as the Baseline-mean approach  utilizes recent historical records to determine expected cases . For example in ESSENCE, data from the 4 week period prior to the analysis date are used to calculate the mean daily syndrome incidents for each zip code . These are adjusted for the day of week and holidays and used as the background population. Note that using street addresses with the Poisson model is problematic given that the statistic requires population data for each case location and households do not have a background population per se.
The Bernoulli model is an alternative scan statistic wherein cases and "non-cases" are analyzed, e.g., patients with ILI symptoms and those without ILI symptoms. These variables are referred to as cases and controls respectively, and their sum is considered the population. Thus, controls can be obtained from records on the date of analysis in contrast to the historical baseline data used in the Poisson implementation. An additional advantage of this model is that it provides for the use of street addresses since cases and non-cases are input at their respective geographic locations.
ESSENCE Influenza-Like Illness Syndrome
ICD - 9
VIRAL INFECTION NOS
OTITIS MEDIA NOS
ACUTE SINUSITIS NOS
ACUTE URI NOS
PNEUMONIA, ORGANISM NOS
CHILLS (WITHOUT FEVER)
Spatial Detection Scenarios
ESSENCE - Zip
Bernoulli - Zip
Bernoulli - Street
Although patient confidentiality is recognized as an important issue linked to the use of patient addresses, consideration of this issue is beyond the scope of this study. There are numerous studies that address this issue exclusively [8, 9]. The authors believe the protection of patient identities to be of utmost importance and that such protection is fully achievable in systems modelled as part of this study.
Resolving Patient Addresses
Ideally, medical surveillance seeks to determine the source of disease outbreaks where it occurs, be it residence or workplace. However due to the nature of available datasets, only home addresses are generally provided including many incomplete or erroneous entries. In this study, most MTF patients are active duty personnel living in close proximity to military installations. However, a considerable number of beneficiaries reside in outlying areas within a couple hours drive. The situation is further complicated by patients away from their primary residence on temporary duty that submit non-local addresses. To resolve this issue, we applied a version of the "100-mile rule" as outlined by Xing et al :
1) Determine the distance of the patient address from the MTF address.
2) The patient zip code or street address was used if the distance was within 100 miles.
3) The MTF zip code or street address was used if the distance was greater than 100 miles.
4) If the home zip code or the street address field was empty or not geocode-able, the MTF zip code/street address was used.
This method was applied to both zip code and street address based scenarios. The main assumption is that an address located more than a short drive from the MTF is a distant permanent address submitted by a patient visiting the local installation. Of the patients in this study, roughly 13% were visiting patients as defined by the 100-mile rule.
Calculation of Background Populations
The scan statistic requires the calculation of expected incidents at each analysis location and this factor is partially based on the density of the background population. Data streams that reliably provide military and dependent population data are not currently available for use in ESSENCE. The conventional solution used by ESSENCE has been referred to as the Baseline-mean approach . An alternative solution was possible through use of the Bernoulli statistical model and its case/non-case population representation. Methods for making this estimate are detailed by scenario below.
• ESSENCE - Zip Code: Population is derived from ILI case data from a 28 day baseline period prior to the analysis date. That is, the mean expected cases for each zip code are calculated from the baseline days of the same day of week as the analysis date. A two day buffer separates the analysis date and baseline period. The purpose of the buffer period is to diminish the effect a current outbreak might have on the baseline statistics . Federal Holidays are grouped with Sundays to model patient behavior. Zip codes with a "population" of zero are set to 1 to comply with scan statistic requirements.
• Bernoulli - Zip Code: With the Bernoulli model, the sum of case and control (non-case) counts represents the population. Here, ILI and non-ILI visits are aggregated by zip code for a given analysis date and used as case and control counts, respectively. Non-cases consist of all patient visits that do not contain any of the ICD-9-CM codes mapped to ILI syndrome, i.e., all other visits including, injuries and well-visits.
Geocoding Success Rates
Success Rate (%)
Success Rate w/100 Mile Rule
MTF - 100 Mile
Detection Accuracy Analysis
Simulated Cluster Specifications
Cluster Radius (mi)
Subsequently, the displacement of detected clusters from the loci was measured. For this measure, detection accuracy is inversely proportional to displacement distance. Detected cluster size distributions were also analyzed for comparison with simulated radii. A close match between the size of original and detected clusters is employed as a second measure of accuracy. As part of the size analysis the rate of zero radius clusters was also recorded. Zero radius clusters, otherwise known as 'singlets', have dubious worth since they represent only a single generalized location.
The majority of the data processing for this project was accomplished with the relational database and development tools provided by Microsoft Access 2003. Numerous applications were developed using Visual Basic for Applications (VBA) to process the clinical data including resolving patient addresses, extracting ILI cases by location (zip/street address), generating population and control data by location, and formatting and exporting these data for input to SaTScan. Geo-coding to the street address level was accomplished with ArcView Geographic Information System software (ESRI, Inc.). Key SaTScan analysis specifications are listed below.
Type of Analysis: Purely Spatial
Probability Model: Poisson or Bernoulli
Search Locations: Search only from case locations
Maximum Cluster Size: 50% of population at risk/40 miles radius
Criteria for Secondary Clusters: No Geographical Overlap
Post-processing applications were developed for measuring displacement of detected from simulated clusters and production of cluster size distributions. The Haversine formula , an equation for measuring spherical distances on the Earth's surface, was used to measure displacement distances.
To measure accuracy, detection was applied to simulated disease clusters superimposed on ILI case data for 35 randomly selected dates. Detected cluster radius distributions were also analyzed for comparison with actual simulated sizes. In addition, the rate of zero radius clusters was assessed. Comparison of the ability of zip based and street address based methods to correctly detect both the location and size of these clusters provides a relative measure of detection accuracy.
Cluster Location Analysis
Simulated Cluster Displacement Statistics
Cluster Radius (mi)
Coefficient of Variation (c v = σ/μ)
Cluster Size Analysis
Rate of Zero Radius Clusters at Simulated Radii
2 Mile Radius
5 Mile Radius
10 Mile Radius
ESSENCE - ZIP
Bernoulli - Zip
Bernoulli - Street
The increasing use of geo-spatial technologies in public health and epidemiology has made geocoding - the process of assigning approximated geographic coordinates to address data - a common data processing operation. Consequently, the quality of geocoding methods and its effect on analytical outcomes has become a concern. Issues such as geocoding accuracy, success rates, and address data quality can substantially affect or even drive conclusions drawn from spatial analysis [13, 14]. The impetus behind this study was the need to increase the geographic specificity of cluster detection methods commonly used in syndromic surveillance. Increasing the spatial resolution of geocoding methods from the zip code to the street address level was tested towards this end. Interestingly, further improvement of geocoding methods is ultimately what is needed to realize viable street level spatial detection.
Cluster Location Analysis
The Bernoulli-Street scenario yielded the most promising results in the location analysis with improvements in accuracy of 65% relative to ESSENCE-Zip for the 2 and 5 mile radius clusters, and 31% at 10 miles radius. The larger clusters displayed more modest improvements due to the influence of distant unassociated cases on detection results. It is noteworthy that the results for the Bernoulli-Zip method did not differ significantly from that of ESSENCE-ZIP. This indicates that zip code based spatial detection yields less than optimal spatial accuracy regardless of scan statistic probability model.
Analysis results also indicated that greater displacements, and therefore reduced detection accuracy, may occur in rural areas where zip code areas are larger. An additional confounding issue noted at rural installations is the tendency for military personnel to live in close proximity on base and submit non geo-codable building information as their address. Urban based MTFs responded more consistently to the Street Bernoulli Method since zip codes are smaller and populations more dispersed.
Cluster Size Analysis
In general, the Street-Bernoulli cluster distributions were quasi normal and sizes slightly exceeded the simulated radii. The offset of the distributions are to be expected since actual incident data surrounding the simulated clusters naturally expands detected sizes. On the contrary, ESSENCE-ZIP distributions displayed fragmented distributions, at least partially due to the incidental spacing of zip code centroids. They also had unacceptably high rates of zero radius clusters, known as 'singlets'. These have dubious worth since they represent only a single generalized location within one zip code area. Surveillance alerts that are raised as a result of singlets or relatively large clusters tend to be taken less seriously than small to moderate size clusters. No singlets were found in analysis results for the Street-Bernoulli method.
In summary, the measures employed for this study indicate that the Bernoulli-Street scenario displayed the best detection accuracy with regard to both location and size. Displacement of detected clusters from simulated loci was dramatically reduced when the street level incident data was used with the Bernoulli statistical model. Cluster size distributions were also more favorable and than with both zip code based test scenarios.
Improving geo-coding accuracy and success rates may further enhance the accuracy of street level spatial detection. Since ESSENCE-ZIP uses zip code level geocoding, it tends to concentrate incidents at zip code centroids. This incident "stacking" contributes to the high rates of zero radius clusters. A significant amount of stacking occurs even with street addresses, i.e., zip code centroids are used if geo-coding fails or the patient is not a local resident (100 mile rule). Given that street address geocoding success rates were only 73.5%, zip codes were used for more than one quarter of mapped incidents.
Further improvements in spatial detection accuracy are dependant on systematic improvements in the collection of individual-level address information. Individual data utilized by MTFs are generally captured by the Defense Enrollment Eligibility Reporting System (DEERS) and pushed to AHLTA on a monthly basis. Patients should be encouraged to submit accurate home street addresses in a strict standardized format during enrollment. Visiting MTF patients present a dilemma since their permanent address is automatically used. Ideally, provision should be made for visiting patients to submit a workplace or lodging address in the interest of successful disease surveillance. The best case scenario would be to record both work and residential addresses in all cases, but presently this may be difficult to implement. Lastly, on-base residents must be encouraged to submit geo-codable street addresses rather than building or barrack names. Enhanced address records would improve geocoding success rates resulting in less reliance on zip code centroids. Consequently, the intuitive effect of street address level geo-coding is realized: true incident spatial patterns emerge and the location and size of detected clusters are more accurate.
During the initial stages of this study Dr. Kenneth L. Cox served as a Colonel in the U.S. Air Force, Director of Global Health Surveillance - Force Health Protection & Readiness (FHP&R), and Chief Functional Proponent for ESSENCE. FHP&R is a program of the Office of the Assistant Secretary of Defense (DASD) - Health Affairs and is the sponsor of this research.
We thank Colonel Michael G. Butel, current Director of Global Health Surveillance, FHP&R, for authorizing this publication. We acknowledge the Planned Systems International, Inc. program management team of Daniel Boccolucci and Marilyn Ehrhardt for their support during this study. We also thank Howard Burkom of the Johns Hopkins University Applied Physics Laboratory for discussions on analytical approaches and insights into the ESSENCE implementation of the spatial scan statistic.
- Olson KL, Grannis SJ, Mandl KD: Privacy protection versus cluster detection in spatial epidemiology. Am J Pub Health. 2006, 96: 11-10.2105/AJPH.2005.069526.View Article
- Ozonoff A, Jeffery C, Manjourides J, White LF, Pagano M: Effect of spatial resolution on cluster detection: a simulation study. Int J Health Geogr. 2007, 6: 52-10.1186/1476-072X-6-52.PubMedPubMed CentralView Article
- Grubesic TH, Matisziw TC: On the use of ZIP codes and ZIP code tabulation areas (ZCTAs) for the spatial analysis of epidemiological data. Int J Health Geogr. 2006, 5: 58-10.1186/1476-072X-5-58.PubMedPubMed CentralView Article
- Kulldorff M, Information Management Services, Inc: SaTScan™ V7.0: Software for the spatial and space-time scan statistics. 2006, http://www.satscan.org
- Kulldorff M: A spatial scan statistic. Commun Stat Theory Methods. 1997, 26: 1481-1496. 10.1080/03610929708831995.View Article
- Xing J, Burkom H, Moniz L, Edgerton J, Leuze M, Tokars J: Evaluation of sliding baseline methods for spatial estimation for cluster detection in the biosurveillance system. Int J Health Geogr. 2009, 8: 45-10.1186/1476-072X-8-45.PubMedPubMed CentralView Article
- ESSENCE: Block 2 - End User Manual. Executive information/decision support program office, TRICARE management activity, U.S. army medical research, Document ID: MHSESSENCE-DO0001-44-200712-v1-1. 2007, https://eids.ha.osd.mil
- Cassa C, Grannis SJ, Overhage JM, Mandl KD: A context-sensitive approach to anonymizing spatial surveillance data: impact on outbreak detection. J Am Med Inform Assoc. 2006, 39: 160-165.View Article
- Armstrong M, Rushton G, Zimmerman D: Geographically masking health data to preserve confidentiality. Stats in Med. 1999, 18: 497-525. 10.1002/(SICI)1097-0258(19990315)18:5<497::AID-SIM45>3.0.CO;2-#.View Article
- Hutwagner L, Thompson W, Seeman GM, Treadwell T: The bioterrorism preparedness and response early aberration reporting system (EARS). J Urban Health. 2003, 80 (2): i89-i96.PubMedPubMed Central
- Cassa C, Olson KL, Mandl KD: A software tool for creating simulated outbreaks to benchmark surveillance systems. BMC Med Inform Decis Mak. 2005, 5: 22-10.1186/1472-6947-5-22.PubMedPubMed CentralView Article
- Sinnott RW: Virtues of the Haversine. Sky and Telescope. 1984, 68 (2): 159-
- Wey CL, Griesse J, Kightlinger L, Wimberly MC: Geographic variability in geocoding success for West Nile virus cases in South Dakota. Health Place. 2009, 15 (4): 1108-1114. 10.1016/j.healthplace.2009.06.001.PubMedPubMed CentralView Article
- Cayo MR, Talbot TO: Positional error in automated geocoding of residential addresses. Int J Health Geogr. 2003, 2 (1): 10-10.1186/1476-072X-2-10.PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.