Evaluating geographic imputation approaches for zip code level data: an application to a study of pediatric diabetes
© Hibbert et al; licensee BioMed Central Ltd. 2009
Received: 27 April 2009
Accepted: 8 October 2009
Published: 8 October 2009
There is increasing interest in the study of place effects on health, facilitated in part by geographic information systems. Incomplete or missing address information reduces geocoding success. Several geographic imputation methods have been suggested to overcome this limitation. Accuracy evaluation of these methods can be focused at the level of individuals and at higher group-levels (e.g., spatial distribution).
We evaluated the accuracy of eight geo-imputation methods for address allocation from ZIP codes to census tracts at the individual and group level. The spatial apportioning approaches underlying the imputation methods included four fixed (deterministic) and four random (stochastic) allocation methods using land area, total population, population under age 20, and race/ethnicity as weighting factors. Data included more than 2,000 geocoded cases of diabetes mellitus among youth aged 0-19 in four U.S. regions. The imputed distribution of cases across tracts was compared to the true distribution using a chi-squared statistic.
At the individual level, population-weighted (total or under age 20) fixed allocation showed the greatest level of accuracy, with correct census tract assignments averaging 30.01% across all regions, followed by the race/ethnicity-weighted random method (23.83%). The true distribution of cases across census tracts was that 58.2% of tracts exhibited no cases, 26.2% had one case, 9.5% had two cases, and less than 3% had three or more. This distribution was best captured by random allocation methods, with no significant differences (p-value > 0.90). However, significant differences in distributions based on fixed allocation methods were found (p-value < 0.0003).
Fixed imputation methods seemed to yield greatest accuracy at the individual level, suggesting use for studies on area-level environmental exposures. Fixed methods result in artificial clusters in single census tracts. For studies focusing on spatial distribution of disease, random methods seemed superior, as they most closely replicated the true spatial distribution. When selecting an imputation approach, researchers should consider carefully the study aims.
There has long been recognition that place or geographic area can impact health behaviors and health outcomes [1–4]. The advent of geographic information system (GIS) technology and its widespread dissemination has enormously simplified the identification and characterization of place via address match geocoding, i.e. the assignment of geographic coordinates to a street address through interpolation based on a proportional distance between addresses in a record and an address range for a street segment .
The validity of epidemiological studies involving geocoded data relies on the proportion of cases that can be geocoded and on the positional accuracy of the geocodes . Successful address match geocoding relies, in part, on the availability of complete and correct address information . However, address information in combination with health attributes is often considered protected health information under the Health Insurance Portability and Accountability Act (HIPAA). Thus only limited address information, such as a ZIP code, may be available for research .
In the presence of missing or incomplete address data, investigators must decide whether to discard the incomplete data or, based on a variety of assumptions, allocate them to a representative location, e.g. a geometric center or centroid of the smallest geographic unit available, typically in the US, a ZIP code . Discarding incomplete data ensures a database with a high level of accuracy, however may result in a significant reduction in total cases available for analysis. Furthermore, if incompleteness of address data is associated with other attributes under study (i.e. if incomplete data are spatially correlated or predominantly located in rural areas) exclusion could lead to a geographic selection bias .
Allocating cases to the smallest geographic unit available for all data points ensures that the database retains the maximum possible number of cases, although this method contains several drawbacks. When allocated to the centroid of a geographic unit, cases may fall into uninhabited areas such as lakes or national parks. Also, the geographic units themselves may vary greatly in size and location over a short period of time, as has been shown for postal ZIP codes in the United States (U.S.) .
Geo-imputation introduces a third option by using available address data in conjunction with assumptions based on available demographic or geographic data. Spatial apportionment of data has a long history of utilization in social sciences [10–14]. More recently, geo-imputation has become popular in epidemiological studies for allocation of individual study participants to geographic units [15, 16]. Very little is known, however, with respect to the accuracy of geographic imputation methods .
The purpose of the current study was to evaluate the accuracy and utility of a variety of geo-imputation approaches for ZIP code data at the individual level (i.e. correct allocation of individual case to census tract) and at the group level (i.e. appropriate spatial distribution of cases across tracts). In the context of a project on the spatial epidemiology of diabetes, we used data from the SEARCH for Diabetes in Youth study . We also aimed to describe the data at hand with respect to address completeness and geocoding success.
The present study was approved by Institutional Review Boards (IRB) from all participating entities and conducted using HIPAA compliant procedures. The SEARCH for Diabetes in Youth Study was initiated in 2000 to estimate the population prevalence, and incidence of all types of diabetes in youth in the U.S. by age, gender, race/ethnicity, and diabetes type in four geographically defined populations and two membership/health-plan-based populations using consistent methodology for case ascertainment and classification . For the present study, data from the four geographic defined populations were included, which represent four distinct geographic U.S. regions of varying urban and rural characteristics, population densities, and socioeconomic status. Study sites included Colorado (all 64 counties), Ohio (six counties surrounding Cincinnati, OH, including two in Kentucky and one in Indiana), South Carolina (all 46 counties), and Washington (five counties surrounding Seattle, WA). The study areas varied widely with respect to urban and rural landscapes. Washington and Ohio were exclusively confined to the Seattle, WA and Cincinnati, OH areas respectively, which contained the highest mean population densities at the Census tract level per square kilometer (1379.22 and 1327.66 respectively). South Carolina contained the largest amount of rural landscape with a mean tract population density of 416.77 per square kilometer. The regional land area sizes varied from the 6,826 km2 in the Ohio site to 269,736 km2 in the Colorado site. Land area was calculated in ArcGIS 9.3  using an equal area projection.
Geocoding of data
The study population included 2,538 youth aged 0-19 years: 2,068 cases were diagnosed between 2002 and 2003 with type 1 and type 2 diabetes and 470 other diabetes cases that were part of a SEARCH case control study. Cases were geocoded based on street address (address matching), ZIP code, or county depending on the availability of address information. The 2000 TIGER (Topographically Integrated Geographic Encoding and Referencing) road network  was used for geocoding in ArcGIS 9.3  and was complemented with Zip Code Tabulation Areas (ZCTA). The ZCTA was first used in the 2000 Census, and was created to overcome the difficulties in defining the land area encompassed by a ZIP code . ZCTAs are created through the aggregation of Census blocks into areas that most closely correspond with ZIP code areas .
Data completeness and geocoding success by site
Full Address Available
Missing Address (ZIP code only)
Geocoded Full Address
Data Cleaning and Quality
In a first step, topological anomalies in the ZCTA boundaries were removed. While the ZCTA files contain polygons for individual ZIP codes, water bodies and areas where no addressable postal locations existed were also contained in the file. Unlike other statistical entities from the Census, such as a tract or block group, ZCTAs do not necessarily require a contiguous boundary. This means that a given ZCTA may actually be composed of two or more noncontiguous polygons . These anomalies in the ZCTA boundaries file were dealt with using an approach similar to that taken by Grubesic and Matisziw , whereby polygons identified by the Census as water polygons and polygons containing no addresses were removed. ZCTAs composed of multiple polygons were dissolved into a single polygon based on a common ZIP code.
Calculation of Census Tract Weighting Factors
Weighting by land area
ZCTA Area (km2)
Tract Area in ZCTA (km2)
Proportion Tract Area in ZCTA
Two general types of geo-imputation methods were evaluated including fixed (deterministic) and random (stochastic) geo-imputation approaches. For each of these, both population and area based weighting factors were applied.
For the fixed allocation approaches, all cases within a ZCTA were allocated to the tract with the largest weighting factor as described above (i.e. area, total population, or total youth population weighting factor). These methods are abbreviated in the text and tables as a) FixedArea: Fixed area-weighted allocation; b) FixedPop: Fixed total population-weighted allocation; and c) Fixed019: Fixed population-weighted using 0-19 age group. In addition, we performed the most commonly used fixed allocation method which allocates a case to the ZIP centroid, which was designated d) FixedZip.
Weighting and ranges for allocation to tracts
Tract Area in ZCTA (km2)
Proportion Tract Area in ZCTA
0.00 - 0.84
0.84 - 0.91
0.91 - 0.96
0.96 - 0.99
0.99 - 1.00
The random allocation methods are abbreviated in text and table as a) RandArea: Random area-weighted allocation; b) RandPop: Random total population-weighted allocation; c) Rand019: Random population-weighted using 0-19 year age group; and RandRace019: Random method using allocation by population distribution of 0-19 year old population by race/ethnicity. Race/ethnicity groups considered included non-Hispanic white, African American, Asian, Native American, and multi-ethnic/other. These categories represented all possible groups within the dataset.
Data are presented descriptively as percents and absolute numbers. Individual level accuracy assessments are represented as percent cases allocated correctly to a tract through geo-imputation methods. The distribution of cases to tracts achieved by the allocation methods was compared to the true distribution using the Chi-square statistic.
Address data characteristics and geocoding characteristics are summarized in Table 1. No site had complete address information for all cases, but both Colorado and Ohio had a markedly higher proportion of full addresses available than South Carolina and Washington, which were unable to obtain full addresses on a fraction of cases due to HIPAA related restrictions. An address is considered to be full if it contains a street number, street name, street type and ZIP code. South Carolina had a markedly higher number of addresses with PO Box or RR (rural route) designations. Both the Ohio and Colorado sites had the overall highest proportion of successfully geocoded addresses (CO = 86.4%, OH = 89.5%) The geocoding success rate (expressed as a proportion of full addresses available) was highly consistent across sites ranging from 92% in Colorado, 97% Ohio, 88% in South Carolina, and 98% in Washington.
To evaluate the various geo-imputation methods, the dataset was limited to those cases with a geocoded full address (total 1,931 cases). Each of the eight allocation methods were applied to the site-specific data assuming that the only available piece of address information available was a ZIP code (i.e. a worst case scenario) and then compared with the known, true location.
Individual level accuracy of fixed and random geo-imputation methods by site
Results of the evaluation of group level accuracy are summarized in Additional File 1. The column entitled "True" lists the number of tracts that contain a given number of cases ranging from 0 to greater than 5. Given that diabetes in youth is a rare condition and our study was focused on incident cases, it was not surprising that across the entire study area more than 50% of all tracts did not contain a single case. In general, between 24 and 29% of tracts contained a single case with a sequentially decreasing proportion of tracts containing multiple cases. The remainder of the table describes the allocation of cases to tracts achieved by each of the eight imputation methods.
Chi-square statistics associated with group level accuracy
p < 0.0001
p < 0.0001
p < 0.0001
p < 0.0001
p = 0.1848
p = 0.9594
p = 0.9359
p = 0.5799
p < 0.0001
p < 0.0001
p < 0.0001
p < 0.0001
p = 0.686
p = 0.9594
p = 0.9359
p = 0.8673
p < 0.0001
p < 0.0001
p < 0.0001
p < 0.0001
p = 0.5065
p = 0.1786
p = 0.8824
p = 0.9580
p < 0.0001
p = 0.0003
p = 0.0004
p < 0.0001
p = 0.9144
p = 0.9518
p = 0.999
p = 0.5719
The individual level accuracy of eight imputation methods was assessed for over 2,000 cases of diabetes across four U.S. regions. This study is among the few to determine accuracy of geo-imputation methods using collected clinical data that had been geocoded through HIPAA compliant procedures. The vast majority of published epidemiologic work to date that has dealt with incomplete address information has reported allocating missing data to ZIP code centroid [9, 23, 24]. This can be problematic as ZIP codes are less spatiotemporally stable than Census statistical areas such as tracts or block groups . Investigators should pay particular attention when comparing identical ZIP codes from datasets that are temporally dissimilar.
At the level of individual assignment, fixed population-weighted methods showed a mean accuracy of 30.26% (Min 23.03%, Max 33.54% using total population weight) and 30.45% (Min 23.03%, Max 37.98% using youth population ages 0-19 weight). Although these geo-imputation methods led to a disproportionate number of cases allocated to a single tract within a ZCTA, instances exist where this method would be useful. Heavily urbanized residential areas with high population density will contain tracts and ZCTAs smaller in land area and simplify distance calculation to exposure sites .
Although the individual case accuracy of the random methods was lower than fixed methods, randomization allowed for each tract in a ZCTA to have a chance of a case being allocated to it. This allowed for a distribution more closely approximating that seen in reality (i.e. the True column in Additional File 1). Randomized allocation applied to the youth population from Census SF1 was found to provide the best approximation of the true distribution of cases within census tracts for all sites.
Individual accuracy of all methods varied geographically. Colorado results were lowest among most of the eight methods. Colorado comprised the largest total land area and South Carolina was the least densely populated of the four sites. Tract size for Colorado was also largest, averaging 254 km2. Interestingly, it was anticipated that sites containing tracts of smaller land area achieve highest accuracy with Washington and Ohio being smallest with average tract areas of 29.53 km2 and 26.14 km2 respectively. However, South Carolina (average tract area 92.32 km2) results were consistently highest among all eight methods with Ohio and Washington being 2nd or 3rd when comparing each method's accuracy across sites (Table 4).
Compared to the fixed allocation methods, random population-weighted methods showed a mean accuracy of 22.64% at the individual level (Min 20.34%, Max 28.63% using total population weights), 21.07% (Min 17.47%, Max 26.72% using youth population ages 0-19 weights) and 23.83% (Min 18.30, Max 30.13) using youth population and race/ethnicity. Henry and Boscoe  saw a similar accuracy of 25.9% using total population as a weighting mechanism.
At the level of group accuracy, the RandPop and Rand019 methods performed similarly across all sites except Colorado, with RandPop (p = 0.9594) being slightly better than Rand019 (p = 0.9359) and South Carolina with the RandRace019 performing best (p = 0.9580). This may be due in part to both the rural nature of South Carolina, and to the larger amount of people over 65, particularly within coastal areas. RandArea performed the poorest across all sites when compared to the true distribution. To the best of our knowledge, this is the first paper to evaluate the ability of geo-imputation approaches to approximate distribution of cases across space.
In our study geography, a ZIP code overlapped with a median number of 4 (minimum 1, maximum 29) Census tracts. This relationship in fact sets a sort of upper limit on the individual-level accuracy of any imputation method, because as the number of tracts per ZIP code increases, the likelihood of correct assignment of an individual decreases, hence, the low overall magnitude of the individual level accuracy of the geo-imputation methods. Furthermore, this relationship between ZIP codes and tracts is likely responsible for the fact that in our data, the fixed allocation methods performed better than any of the random allocation methods at the individual level.
Henry and Boscoe  showed that weighting using multiple covariates such as race/ethnicity in addition to age achieves higher accuracy. Correspondingly, we refined the weighting using the population of youth aged 0-19 years by additionally considering the race/ethnic composition of the population of youth. Consistent with previous findings, this approach produced a slight increase in accuracy in the Colorado, Ohio, and South Carolina study sites at the individual level. However, the Washington site experienced a 4% drop in accuracy when accounting for race. It is conceivable that in the Washington site, both the lower levels of residential racial segregation in urban Seattle plus the larger ethnic and multi-racial diversity of the Seattle population contribute to the loss in specificity of an assignment, thereby increasing inaccuracy.
It is important to note that the geo-imputation methods shown were conducted entirely within the GIS framework and utilized custom tools developed to handle the random allocation and extend the capabilities of the GIS. Although it is entirely possible to use purely statistical allocation, GIS was essential to both the rapid implementation of the geo-imputation methods as well as the weighting calculations, particularly the area-based weights. Investigators wishing to use geo-imputation methods should take into account the benefits offered in these software packages. Investigators may contact the author to obtain the tool created to perform the geo-imputations presented in this paper.
It has been well established that geocoding success rate can differ significantly with respect to urban and rural areas and can be seen as being correlated with population density [6, 25, 26]. Since address match geocoding is accomplished through interpolation along a street segment, a longer segment common to rural areas may introduce greater error. Furthermore, addresses drawn from rural areas are more likely to contain PO Boxes or Rural Routes as address information, confounding the geocoding process .
A fundamental, very conservative assumption of the present analysis is that a ZIP code is the only address portion available on the entire data set. In many instances geoimputation would only be applied to the non-geocodable subset of the addresses. Addresses lacking other portions of a geocodable address (in this case, street number, street name, street type) would likely produce different results using these imputation methods. Furthermore, geo-imputation cannot fully compensate for low-quality address data, although it can provide a valuable solution in instances where an analysis will be conducted at spatial units smaller than those available for all cases. Other methods such as dasymetric mapping [28, 29], manual intervention/interactive geocoding or re-coding using a different geocoding strategy may in some instances be preferable .
Although ZCTAs are used by the Census to represent the land area covered by a ZIP code, investigators must consider the potential for spatiotemporal mismatch of current ZIP codes to Census derived ZCTAs . Since the primary function of ZIP codes is to aid the USPS in efficient mail delivery, it is necessary that ZIP codes be updated frequently between Census dates to reflect changes in population and the changes may not be well documented .
In summary, our evaluation of geo-imputation approaches for ZIP code level data indicates that while fixed imputation methods yield the greatest accuracy at the individual level, random methods most closely replicate the true distribution of locations across space. Our study illustrates the wide range of geo-imputation approaches that may be considered above and beyond the commonly used ZIP code centroid method. It remains up to the investigator to fully understand the implications of handling missing address data with the methods available and to carefully consider the purpose of the study when selecting an imputation approach.
We would like to thank the SEARCH investigators, staff and participants for making this project possible.
The project was supported by Award Number R01DK077131 from the National Institute Of Diabetes And Digestive And Kidney Diseases. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of Diabetes And Digestive And Kidney Diseases or the National Institutes of Health
- Snow J: On the Mode of Communication of Cholera. 1855, London: ChurchillGoogle Scholar
- Cromley EK, McLafferty SL: GIS and Public Health. 2002, New York: Guilford PressGoogle Scholar
- Gatrell A: Geographies of Health. 2002, Malden, MA: BlackwellGoogle Scholar
- Lawson AB: Statistical Methods in Spatial Epidemiology. 2006, New York: Wiley, 2View ArticleGoogle Scholar
- Zimmerman DL: Statistical methods for incompletely and incorrectly geocoded cancer data. Geocoding Health Data: The Use of Geographic Codes in Cancer Prevention and Control, Research and Practice. Edited by: Rushton G, Armstrong MP, Gittler J, Greene BR, Pavlik CE, West MM, Zimmerman DL. 2007, Boca Raton, Florida: CRC PressGoogle Scholar
- Bonner MR, Daikwon H, Nie J, Rogerson P, Vena JE, Freudenheim JL: Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology. 2003, 14: 408-412.PubMedGoogle Scholar
- Rushton G, Armstrong MP, Gittler J, Greene BR, Pavlik CE, West MM, Zimmerman D: Geocoding in cancer research: a review. Am J Prev Med. 2006, 30: S16-S24. 10.1016/j.amepre.2005.09.011.PubMedView ArticleGoogle Scholar
- Rushton G, Armstrong MP, Gittler J, Greene BR, Pavlik CE, West MM, Zimmerman DL: Geocoding Health Data: The Use of Geographic Codes in Cancer Prevention and Control. 2007, Boca Raton, FL: CRC PressGoogle Scholar
- Krieger N, Waterman P, Chen JT, Soobader MJ, Subramanian SV, Carson R: Zip code caveat: bias due to spatiotemporal mismatches between zip codes and US census-defined geographic areas--the Public Health Disparities Geocoding Project. Am J Public Health. 2002, 92: 1100-1102. 10.2105/AJPH.92.7.1100.PubMedPubMed CentralView ArticleGoogle Scholar
- Mohai P, Saha R: Reassessing Racial and Socioeconomic Disparities in Environmental Justice Research. Demography. 2006, 43: 2-10.1353/dem.2006.0017.View ArticleGoogle Scholar
- Kearney G, Kiros G: A spatial evaluation of socio demographics surrounding National Priorities List sites in Florida using a distance-based approach. International Journal of Health Geographics. 2009, 8: 33-10.1186/1476-072X-8-33.PubMedPubMed CentralView ArticleGoogle Scholar
- Voss P, Long D, Hammer R: When census geography doesn't work: Using ancillary information to improve the spatial interpolation of demographic data. 1999, Center for Demography and Ecology, University of Wisconsin, MadisonGoogle Scholar
- Truelove M: Measurement of spatial equity. Environment and Planning C: Government and Policy. 1993, 11: 1-10.1068/c110019.View ArticleGoogle Scholar
- Saporito S, Chavers JM, Nixon LC, McQuiddy MR: From here to there: Methods of allocating data between census geography and socially meaningful areas. Social Science Research. 2007, 36: 3-10.1016/j.ssresearch.2006.05.004.View ArticleGoogle Scholar
- Klassen AC, Curriero F, Kulldorff M, Alberg AJ, Platz EA, Neloms ST: Missing stage and grade in Maryland prostate cancer surveillance data, 1992-1997. Am J Prev Med. 2006, 30: S77-S87. 10.1016/j.amepre.2005.09.010.PubMedView ArticleGoogle Scholar
- Sheehan JT, DeChello LM, Kulldorff M, Gregorio DI, Gershman S, Mroszczyk M: The geographic distribution of breast cancer incidence in Massachusetts 1988 to adjusted for covariates. International Journal of Health Geographics. 2004, 3: 17-10.1186/1476-072X-3-17.View ArticleGoogle Scholar
- Henry KA, Boscoe FP: Estimating the accuracy of geographical imputation. International Journal of Health Geographics. 2008, 7: 3-10.1186/1476-072X-7-3.PubMedPubMed CentralView ArticleGoogle Scholar
- SEARCH Study Group: SEARCH for Diabetes in Youth: a multicenter study of the prevalence, incidence and classification of diabetes mellitus in youth. Control Clin Trials. 2004, 25: 458-471. 10.1016/j.cct.2004.08.002.View ArticleGoogle Scholar
- ArcGIS 9.3. 2008, Redlands, CA: Environmental Systems Research Institute (ESRI)Google Scholar
- US Census Bureau: Census 2000 ZIP Code Tabulation Areas Technical Documentation.Google Scholar
- Grubesic TH, Matisziw TC: On the use of ZIP codes and ZIP code tabulation areas (ZCTAs) for the spatial analysis of epidemiological data. Int J Health Geogr. 2006, 5: 58-10.1186/1476-072X-5-58.PubMedPubMed CentralView ArticleGoogle Scholar
- US Census Bureau: Census 2000 Summary File 1, Census of Population and Housing. 2001, Washington, DC: US Bureau of the CensusGoogle Scholar
- Brooks N, Sethi R: The distribution of pollution: Community characteristics and exposure to air toxics. Journal of Environmental Economics and Management. 1997, 32: 233-250. 10.1006/jeem.1996.0967.View ArticleGoogle Scholar
- Beyer KMM, Schultz AF, Rushton G: Using ZIP Codes as Geocodes in Cancer Research. Geocoding Health Data: The Use of Geographic Codes in Cancer Prevention and Control, Research and Practice. Edited by: Rushton G, Armstrong MP, Gittler J, Greene BR, Pavlik CE, West MM, Zimmerman DL. 2007, Boca Raton, Florida: CRC PressGoogle Scholar
- Cayo MR, Talbot TO: Positional error in automated geocoding of residential addresses. Int J Health Geogr. 2003, 2: 10-10.1186/1476-072X-2-10.PubMedPubMed CentralView ArticleGoogle Scholar
- Ward M, Nuckols J, Giglierano J, Bonner M, Wolter C, Airola M, Mix W, Colt J, Hartge P: Positional accuracy of two methods of geocoding. Epidemiology. 2005, 16: 4-10.1097/01.ede.0000147106.32027.3e.View ArticleGoogle Scholar
- Hurley S, Saunders T, Nivas R, Hertz A, Reynolds P: Post Office Box addresses: A challenge for Geographic Information System-based studies. Epidemiology. 2003, 14: 4-Google Scholar
- Eicher CL, Brewer CA: Dasymetric Mapping and Areal Interpolation: Implementation and Evaluation. Cartography and Geographic Information Science. 2001Google Scholar
- Holt JB, Lo CP, Hodler TW: Dasymetric Estimation of Population Density and Areal Interpolation of Census Data. Cartography and Geographic Information Science. 2004, 31: 2-10.1559/1523040041649407.View ArticleGoogle Scholar
- Goldberg DW, Wilson JP, Knoblock CA, Ritz B, Cockburn MG: An effective and efficient approach for manually improving geocoded data. Int J Health Geogr. 2008, 7: 60-10.1186/1476-072X-7-60.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.