The effects of local street network characteristics on the positional accuracy of automated geocoding for geographic health studies
© Zimmerman and Li. 2010
Received: 20 August 2009
Accepted: 16 February 2010
Published: 16 February 2010
Automated geocoding of patient addresses for the purpose of conducting spatial epidemiologic studies results in positional errors. It is well documented that errors tend to be larger in rural areas than in cities, but possible effects of local characteristics of the street network, such as street intersection density and street length, on errors have not yet been documented. Our study quantifies effects of these local street network characteristics on the means and the entire probability distributions of positional errors, using regression methods and tolerance intervals/regions, for more than 6000 geocoded patient addresses from an Iowa county.
Positional errors were determined for 6376 addresses in Carroll County, Iowa, as the vector difference between each 100%-matched automated geocode and its ground-truthed location. Mean positional error magnitude was inversely related to proximate street intersection density. This effect was statistically significant for both rural and municipal addresses, but more so for the former. Also, the effect of street segment length on geocoding accuracy was statistically significant for municipal, but not rural, addresses; for municipal addresses mean error magnitude increased with length.
Local street network characteristics may have statistically significant effects on geocoding accuracy in some places, but not others. Even in those locales where their effects are statistically significant, street network characteristics may explain a relatively small portion of the variability among geocoding errors. It appears that additional factors besides rurality and local street network characteristics affect accuracy in general.
Spatial epidemiologic studies commonly include statistical analyses of the spatial locations of study participants' residential addresses in order to, for example, test for geographic clustering of disease or estimate relationships between environmental exposures and disease [1, 2]. Consequently, as part of the study's data assimilation process, the address provided by each study participant must be converted to geographic (e.g. latitude-longitude) coordinates, a procedure which is known as geocoding. In some studies, geocoding is performed by visiting each address with a global positioning (GPS) receiver or by referencing a very accurate (e.g., orthophoto-rectified) image map; however, it is cheaper and hence much more common to obtain geocodes by an automated procedure, which uses widely available GIS software to match each address to a street segment georeferenced within a street database (e.g., a U.S. Census Bureau TIGER file) and then linearly interpolate the position of the address along that segment. This procedure, herein called automated geocoding, is also known as street geocoding. Alternative procedures, such as parcel geocoding and "rooftop" (address-point) geocoding are growing in use, but in the United States, at least, they are not yet as prevalent as street geocoding. Furthermore, parcel geocoding typically has much lower address match rates than street geocoding .
Unfortunately, the geocodes obtained by any procedure contain positional errors, defined as (vector) differences from the locations of addresses ascertained by geocoding to their corresponding true locations. Thus every geocoding procedure has associated with it some level of inaccuracy. Some procedures, however, are more inaccurate than others; in particular, automated geocoding is much more inaccurate than geocoding via GPS receivers or image maps. Several recent investigations have demonstrated that automated geocoding frequently results in positional errors of several hundred meters or more [4–15]. For example, one study  in a four-county area of upstate New York found that 10% of a sample of rural addresses geocoded with errors of more than 1.5 km, and 5% geocoded with errors exceeding 2.8 km.
Zandbergen  lists four main components of positional errors associated with automated geocoding. First, the address may be assigned to the wrong street segment, due to errors in the input address fields or the street database. This typically results in very large positional errors. Second, the address may be assigned to the correct street segment, but the geographic coordinates of the entire segment in the street database are incorrect (e.g., shifted 100 m to the east). Together with the first component, this second component highlights the importance of using an accurate street database, a point emphasized recently in . Third, the interpolated assignment of the address along the (correct) street segment may not coincide with the actual location of the address, due either to usage of only a portion of the segment's nominal address range or to less than perfect correspondence between a linear house numbering scheme and the actual numbering scheme on that segment, or both. Finally, the default offset (usually a uniform perpendicular distance of 10 to 15 m) used in automated geocoding may not accurately reflect the actual distance of the residence from the street centerline.
Positional errors introduce location uncertainties into the data that may affect spatial analytic methods. Documented effects of positional errors on spatial statistical analyses include an inflation of standard errors of parameter estimates and a reduction in power to detect spatial clusters and trends [18–22]. In order to better relate the size of these effects to the degree of automated geocoding inaccuracy, it is important to know how accuracy is affected by various geographic characteristics of an address. Such knowledge and understanding may make it possible, for example, to put context-specific confidence bounds or tolerance bounds on the magnitude of an address's positional error. Furthermore, they may allow one to simulate more realistic, context-specific positional errors for use in studies of the effects of geocoding inaccuracy on the power of various statistical tests for clusters, spatial trends, and other important spatial patterns and features; see, for example, . Finally, it can facilitate and improve measurement-error model methods and imputation methods for adjusting spatial statistical analyses for geocoding inaccuracy [24–26].
Our current level of understanding of the geographic factors affecting automated geocoding accuracy is rather limited, however. One factor known to be important is whether an address lies in a rural or urban area. Every published study that has compared positional errors for rural and urban addresses within the same geographic region has found that automated geocodes of the former are, on average, not as accurate as those of the latter. The ratio (rural to urban) of mean positional error magnitudes has been variously reported as approximately 1.4:1 for a small study in western New York , 4:1 for a study spanning 49 states , 3:1 or 10:1 (depending on whether the automated geocoding was performed in-house or by a commercial firm) for a study in south central Iowa , 5:1 for a study in upstate New York , and 5:1 for the address data (from western Iowa) presented in this article. Another factor known to affect positional errors, at least in rural areas with strongly rectilinear street networks, is the axial orientation (north-south or east-west) of the street on which an address lies ; specifically, the directional error component in the direction aligned with the street tends to be greater than the component in the orthogonal direction. Presumably, this is due to errors of interpolation along the street segment that are larger, on average, than errors of offset from the segment. Furthermore, scatterplots of positional errors displayed in [7, 12, 15] and formal statistical analysis reported in  have revealed that the empirical probability distribution of positional errors is approximated poorly by a single bivariate normal distribution but quite well by a two-component or three-component mixture of bivariate normal or t distributions. The reason these mixtures fit better appears to be the fact that they can account for disparate components of errors (e.g. interpolation and offset errors) having considerably different levels of variability .
Notwithstanding what has been learned about automated geocoding's positional errors from previous studies, there is still much that is not well understood. For instance, there are readily available covariates in addition to rurality and orientation that may be associated with automated geocoding accuracy. Among these are local characteristics of the street network, for example street intersection density or street segment length. Street intersection density could be construed as a more refined measure of rurality than the dichotomous rural/urban classification measure used heretofore. As such, we might expect that in rural areas at least, accuracy would increase with an increase in street intersection density. Street segment length might be suspected of being associated with accuracy because of how address interpolation algorithms work. That is, since a linear interpolation algorithm places an address proportionately along a street segment, according to where the residence number falls in the range of street numbers assigned to the segment's endpoints, it is reasonable to expect the magnitudes of positional errors to be approximately proportional to segment length.
In this article we present analyses of the effects of several factors, including local characteristics of the street network, on automated geocoding accuracy. In our analyses we consider not only how means of positional error magnitudes are affected by such characteristics, but also, more comprehensively, how the entire distribution of positional error magnitudes is so affected. We use confidence intervals to characterize uncertainty associated with estimating mean positional error magnitudes, but for characterizing uncertainty associated with estimating the distribution of error magnitudes we use tolerance intervals, i.e. intervals that contain, with a given level of certainty, a fixed proportion of the error magnitudes. Tolerance intervals have a long history of use in engineering and the physical sciences for the purpose of quantifying the uncertainty of errors associated with manufacturing and other physical processes, and it is entirely reasonable to apply them, for the same purpose, to errors incurred by geocoding. Furthermore, to characterize uncertainty associated with estimating the bivariate distribution of positional error vectors, we construct tolerance regions.
The main purpose of this article is to investigate the relationship between automated geocoding accuracy and various geographic and street network characteristics, namely rurality, street orientation, street intersection density, and street segment length for a real data set of geocoded addresses. For this purpose, we use a rather large set of geocoded addresses from an Iowa county.
The address data upon which this investigation is based are a subset of all 9298 residential addresses in Carroll County, Iowa, USA, current as of 31 December 2005, which we obtained in conjunction with a comprehensive study of rural health in Iowa by the Iowa Department of Public Health and other researchers at the University of Iowa.
Corresponding to each address, the following covariates were measured: (1) a dichotomous rurality variable (rural or municipal); (2) a dichotomous street segment orientation variable (north-south or east-west); (3) street segment length; and (4) street intersection density. Street segment lengths were calculated automatically using ArcGIS's field calculator function and VBScript code available from ArcGIS help files. Street intersection density for a given address was measured by counting the number of intersections in a circular buffer of radius one mile centered on the address.
where y represents the magnitude of a positional error, or some transformation thereof; x represents the covariate of interest (e.g. street intersection density or street segment length); α and β are the y -intercept and slope, respectively, of an assumed straight line relating the expectation of y to x; and e represents model error. In accordance with fitting the model by ordinary least squares and subsequent normal theory-based inference, we assumed that the model errors are independent and identically distributed as normal random variables with mean zero and unknown variance σ 2. We also considered larger, multiple regression models similar to (1), but which included one or more of the dichotomous covariates and interactions among them.
and is the average of the x i 's. Under the same model, an upper 100(1 - α)% tolerance bound for the lower 100(1 - p)% of the population of positional error magnitudes for addresses with covariate equal to x is given by an expression of the same form as the upper 100(1 - α)% confidence limit in (2) except that t α/2, n - 2is replaced with t α, n - 2, δ (x), the 100(1 - α)th percentile of a noncentral t distribution with n - 2 degrees of freedom and noncentrality parameter δ (x) . Here δ (x) = z p /c (x) and z p is the 100(1 - p)th percentile of the standard normal distribution.
Here is the centroid of positional errors, S is the sample covariance matrix of positional errors, is the 100(1 - α)th percentile of the chi-square distribution with 2(n - 1) degrees of freedom, and is the 100(1 - p)th percentile of the noncentral chi-square distribution with 2 degrees of freedom and noncentrality parameter 2/n. The second tolerance region, due to Di Bucchianico et al. , is the minimum volume ellipse containing of the observed positional errors, where z α is the 100(1 - α)th percentile of the standard normal distribution and [·] is the greatest integer function. This tolerance region is nonparametric, i.e. distribution-free, meaning that it is valid regardless of the actual bivariate distribution of the positional errors. For the sample sizes of subgroups occurring in this study (which all exceed 600), exact determination of minimum volume ellipses (and hence the desired tolerance regions) was computationally prohibitive, so we determined them approximately via the resampling algorithm of Rousseeuw and Van Zomeren .
Although the proposed tolerance intervals for error magnitudes account for the effects of street segment length and street intersection density, neither tolerance region for the errors themselves does. To the authors' knowledge, multi-dimensional tolerance regions that condition on the values of continuous covariates such as these are not yet available. However, we do obtain separate tolerance regions for each of the four subgroups formed by the two categories of rurality and the two categories of street orientation.
Results and Discussion
Descriptive statistics for positional error magnitudes and directional displacement magnitudes of automated geocodes of Carroll County addresses.
Rural |Δ x |, N-S
Rural |Δ y |, N-S
Rural |Δ x |, E-W
Rural |Δ y |, E-W
Municipal |Δ x |, N-S
Municipal |Δ y |, N-S
Municipal |Δ x |, E-W
Municipal |Δ y |, E-W
First, to characterize the proportion of variation in log positional error magnitude attributable to the effect of rurality, we carried out a one-factor analysis of variance. We found the effect of rurality to be highly significant (P < 2.0 × 10-16), but it explains only 28% of the overall variation in log positional error magnitude.
We note that the correlation between street length and intersection density is -0.19 for rural addresses and -0.03 for municipal addresses -- values that are very similar to the correlations between error magnitude and intersection density. As a consequence, the partial correlations between error magnitude and either covariate, adjusted for the other covariate, are virtually identical to the corresponding ordinary correlations. That is, the relationships between error magnitude and either covariate, which were described in the previous two paragraphs, are not affected by whether we do or do not adjust for the values of the other covariate.
In an effort to obtain a model for the entire set of log positional error magnitudes (rural and municipal) with the greatest possible explanatory power, we also fitted a multiple linear regression model with covariates rurality, street length, and street intersection density and their two-way and three-way interactions. All coefficient estimates are highly significant. However, the overall R 2 for the model is only 0.31, which is not much larger than that for the model that includes only the effect of rurality. Thus, the degree of explanatory power for the model is disappointingly modest.
Confidence limits and tolerance bounds
Two-sided 95% confidence limits for mean positional error magnitude, and 95% tolerance bound for the lower 95% of positional error magnitudes, at the 10th, 50th, and 90th percentiles of the statistically most important covariate (street length for municipal addresses and street intersection density for rural addresses).
100 p th percentile
Mean positional errors for rural addresses are about five times larger, and more strongly clustered in the E-W and N-S axial directions, than their municipal counterparts.
The effect of street segment length on geocoding accuracy was statistically significant for municipal addresses, for which, as expected, mean error magnitude increased with length. There was no such effect for rural addresses, however. We note that a similar phenomenon -- a significant positive street length effect for municipal but not rural addresses -- was observed for another, much smaller dataset of 95 addresses (54 municipal, 41 rural) of cancer patients from Kentucky (Eric Durbin, personal communication). This suggests that this phenomenon may not be uncommon, but more evidence is needed to substantiate this.
The effect of proximate street intersection density on geocoding accuracy was statistically significant for rural addresses, and as expected this effect was such that mean error magnitude was inversely related to intersection density. For municipal addresses, this inverse relationship was also found to be statistically significant, but it was of much smaller magnitude.
Although the effects of one or more street network characteristics were found to be statistically significant, unfortunately they explained only a modest proportion of the variability in the positional errors, especially when considered in the context of a model that accounts for rurality. Thus, the utility of street length and street intersection density as predictors of geocoding accuracy appears to be limited. It is possible that local street network characteristics other than the ones we considered contribute more to geocoding accuracy. Additional measurable factors that could be studied include the nominal address range for a street segment (some segments have much larger ranges than others) and the ratio of actual address range (maximum house number minus minimum house number) to nominal address range for a segment. Further work is needed to determine whether these or other measurable covariates affect accuracy.
Carroll County's strongly rectilinear road network, its relatively high rural/municipal population ratio, and its lack of a truly urban area are typical of many counties in the midwestern region of the United States, but not of many counties in other regions. Thus, the extent to which results for other areas would be similar to those for Carroll County is unknown. We hope that others will perform similar investigations using address data from regions of the United States or other countries with less rectilinear road networks and larger urban areas.
In addition to investigating the significance of street network characteristic effects, we applied methodology for obtaining confidence intervals and tolerance intervals for positional error magnitudes of the Carroll County addresses, which take into account the values of local street network characteristics. For the Carroll County positional error vectors themselves, we obtained elliptical tolerance regions, both parametric (normality-based) and nonparametric, which accounted for rurality and street orientation. Despite the relatively small proportion of the overall variability explained by the covariates, accounting for them in the computation of tolerance intervals and regions does appear to be worthwhile, as they facilitate a more address-specific assessment of likely positional error than is possible when the covariates are ignored.
In focusing our attention on geocoding errors, we have ignored the fact that for many studies, automated geocoding is incomplete; that is, not all addresses can be assigned point-level spatial coordinates by the software. In fact, it is common in practice for 20% or even as many as 40% of subjects' addresses to fail to geocode using standard software and street files. For example, Gregorio et al.  and Oliver et al.  present public health studies in which 14% and 26%, respectively, of addresses could not be assigned a point location via automated geocoding. For the Carroll County addresses of the present study this figure was 20% (36% rural, 15% municipal) under a 60%-match criterion (and slightly higher under a 100%-match criterion). Possible effects of street network characteristics on the failure to geocode is a topic for future study.
Finally, we note that our study focused on global effects of street network characteristics on geocoding accuracy. Alternatively, one could allow these effects to be spatially varying and use methods of local modeling such as geographically weighted regression  to characterize their spatial variation.
The work of the authors was supported by Grant N01-PC-35143 from the National Cancer Institute (NCI), National Institutes of Health, U.S. Department of Health and Human Services. The views expressed are solely those of the authors and do not represent the views of NCI. We thank Carl Wilburn, GIS Coordinator for Carroll County, Iowa for providing address data and E-911 geocodes for Carroll County.
- Lawson AB: Statistical Methods in Spatial Epidemiology. New York: John Wiley & Sons; 2001.
- Waller LA, Gotway CA: Applied Spatial Statistics for Public Health Data. Hoboken, New Jersey: John Wiley & Sons; 2004.View Article
- Zandbergen PA: Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads. BMD Public Health 2007, 7: 37.View Article
- Dearwent SM, Jacobs RR, Halbert JB: Locational uncertainty in georeferencing public health datasets. J Expo Anal Environ Epidemiol 2001, 11: 329–334.View ArticlePubMed
- Krieger N, Waterman P, Lemieux K, Zierler S, Hogan JW: On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. Am J Public Health 2001, 91: 1114–1116.View ArticlePubMed
- Bonner MR, Han D, Nie J, Rogerson P, Vena JE, Freudenheim JL: Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology 2003, 14: 408–412.PubMed
- Cayo MR, Talbot TO: Positional error in automated geocoding of residential addresses. Int J Health Geogr 2003, 2: 10.View ArticlePubMed
- McElroy JA, Remington PL, Trentham-Dietz A, Robert SA, Newcomb PA: Geocoding addresses from a large population-based study: lessons learned. Epidemiology 2003, 14: 399–407.PubMed
- Whitsel EA, Rose KM, Wood JL, Henley AC, Liao D, Heiss G: Accuracy and repeatability of commercial geocoding. Am J Epidemiol 2004, 160: 1023–1029.View ArticlePubMed
- Yang DH, Bilaver LM, Hayes O, Goerge R: Improving geocoding practices: Evaluation of geocoding tools. J Med Syst 2004, 28: 361–370.View ArticlePubMed
- Ward MH, Nuckols JR, Giglierano J, Bonner MR, Wolter C, Airola M, Mix W, Colt JS, Hartge P: Positional accuracy of two methods of geocoding. Epidemiology 2005, 16: 542–547.View ArticlePubMed
- Whitsel EA, Quibrera PM, Smith RL, Catellier DJ, Liao D, Henley AC, Heiss G: Accuracy of commercial geocoding: assessment and implications. Epidemiologic Perspectives and Innovations 2006, 3: 8.View ArticlePubMed
- Kravets N, Hadden WC: The accuracy of address coding and the effects of coding errors. Health Place 2007, 13: 293–298.View ArticlePubMed
- Schootman M, Sterling DA, Struthers J, Yan Y, Laboube T, Emo B, Higgs G: Positional accuracy and geographic bias of four methods of geocoding in epi-demiologic research. Annals of Epidemiology 2007, 17: 464–470.View ArticlePubMed
- Zimmerman DL, Fang X, Mazumdar S, Rushton GR: Modeling the probability distribution of positional errors incurred by residential address geocoding. Int J Health Geogr 2007, 6: 1.View ArticlePubMed
- Zandbergen PA: Geocoding quality and implications for spatial analysis. Geography Compass 2009, 3: 647–680.View Article
- Frizzelle BG, Evenson KR, Rodriguez DA, Laraia BA: The importance of accurate road data for spatial applications in public health: customizing a road network. International Journal of Health Geographics 2009, 8: 24.View ArticlePubMed
- Waller LA: Statistical power and design of focused clustering studies. Stat Med 1996, 15: 765–782.View ArticlePubMed
- Jacquez GM, Waller LA: The effect of uncertain locations on disease cluster statistics. In Quantifying Spatial Uncertainty in Natural Resources: Theory and Applications for GIS and Remote Sensing. Edited by: Mowrer HT, Congalton RG. Chelsea, Michigan: Arbor Press; 2000:53–64.
- Ozonoff A, Jeffery C, Manjourides J, White LF, Pagano M: Effect of spatial resolution on cluster detection: a simulation study. Int J Health Geogr 2007, 6: 52.View ArticlePubMed
- Zimmerman DL: Statistical methods for incompletely and incorrectly geocoded cancer data. In Pages 165–180 in Geocoding Health Data: The Use of Geographic Codes in Cancer Prevention and Control, Research and Practice. Edited by: Rushton G, Arm-strong MP, Gittler J, Greene BR, Pavlik CE, West MM, Zimmerman DL. Boca Raton, Florida: CRC Press; 2008.
- Mazumdar S, Rushton G, Smith BJ, Zimmerman DL, Donham KJ: Geocoding accuracy and the recovery of relationships between environmental exposures and health. Int J Health Geogr 2008, 7: 13.View ArticlePubMed
- Jacquez GM, Rommel R: Local indicators of geocoding accuracy (LIGA): theory and application. Int J Health Geogr 2009, 8: 60.View ArticlePubMed
- Barber JJ, Gelfand AE, Silander JA: Modelling map positional error to infer true feature location. The Canadian Journal of Statistics 2006, 34: 659–676.View Article
- Zimmerman DL, Sun P: Estimating spatial intensity and variation in risk from locations subject to geocoding errors. 2009, in press.
- Henry KA, Boscoe FP: Estimating the accuracy of geographical imputation. Int J Health Geogr 2008, 7: 3.View ArticlePubMed
- Graybill FA: Theory and Application of the Linear Model. Boston: Duxbury Press; 1976.
- John S: A tolerance region for multivariate normal distributions. Sankhya 1963, 25: 363–368.
- Di Bucchianico A, Einmahl JHJ, Mushkudiani NA: Smallest nonparametric tolerance regions. The Annals of Statistics 2001, 29: 1320–1343.View Article
- Rouseeuw P, Van Zomeren BC: Robust distances: simulations and cutoff values. In Pages 195–203 in Directions in Robust Statistics and Diagnostics II. Edited by: Stahel W, Weisberg S. New York: Springer; 1991.
- Zandbergen PA: Positional accuracy of spatial data: non-normal distributions and a critique of the National Standard for Data Accuracy. Transactions in GIS 2008, 12: 103–130.View Article
- Gregorio DI, Cromley E, Mrozinski R, Walsh SJ: Subject loss in spatial analysis of breast cancer. Health Place 1999, 5: 173–177.View ArticlePubMed
- Oliver MN, Matthews KA, Siadaty M, Hauck FR, Pickle LW: Geographic bias related to geocoding in epidemiologic studies. Int J Health Geogr 2005, 4: 29.View ArticlePubMed
- Fotheringham AS, Brunsdon C, Charlton M: Geographically Weighted Regression. Chichester: John Wiley & Sons Ltd; 2002.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.