Spatial epidemiologic studies commonly include statistical analyses of the spatial locations of study participants' residential addresses in order to, for example, test for geographic clustering of disease or estimate relationships between environmental exposures and disease [1, 2]. Consequently, as part of the study's data assimilation process, the address provided by each study participant must be converted to geographic (e.g. latitude-longitude) coordinates, a procedure which is known as geocoding. In some studies, geocoding is performed by visiting each address with a global positioning (GPS) receiver or by referencing a very accurate (e.g., orthophoto-rectified) image map; however, it is cheaper and hence much more common to obtain geocodes by an automated procedure, which uses widely available GIS software to match each address to a street segment georeferenced within a street database (e.g., a U.S. Census Bureau TIGER file) and then linearly interpolate the position of the address along that segment. This procedure, herein called automated geocoding, is also known as street geocoding. Alternative procedures, such as parcel geocoding and "rooftop" (address-point) geocoding are growing in use, but in the United States, at least, they are not yet as prevalent as street geocoding. Furthermore, parcel geocoding typically has much lower address match rates than street geocoding .
Unfortunately, the geocodes obtained by any procedure contain positional errors, defined as (vector) differences from the locations of addresses ascertained by geocoding to their corresponding true locations. Thus every geocoding procedure has associated with it some level of inaccuracy. Some procedures, however, are more inaccurate than others; in particular, automated geocoding is much more inaccurate than geocoding via GPS receivers or image maps. Several recent investigations have demonstrated that automated geocoding frequently results in positional errors of several hundred meters or more [4–15]. For example, one study  in a four-county area of upstate New York found that 10% of a sample of rural addresses geocoded with errors of more than 1.5 km, and 5% geocoded with errors exceeding 2.8 km.
Zandbergen  lists four main components of positional errors associated with automated geocoding. First, the address may be assigned to the wrong street segment, due to errors in the input address fields or the street database. This typically results in very large positional errors. Second, the address may be assigned to the correct street segment, but the geographic coordinates of the entire segment in the street database are incorrect (e.g., shifted 100 m to the east). Together with the first component, this second component highlights the importance of using an accurate street database, a point emphasized recently in . Third, the interpolated assignment of the address along the (correct) street segment may not coincide with the actual location of the address, due either to usage of only a portion of the segment's nominal address range or to less than perfect correspondence between a linear house numbering scheme and the actual numbering scheme on that segment, or both. Finally, the default offset (usually a uniform perpendicular distance of 10 to 15 m) used in automated geocoding may not accurately reflect the actual distance of the residence from the street centerline.
Positional errors introduce location uncertainties into the data that may affect spatial analytic methods. Documented effects of positional errors on spatial statistical analyses include an inflation of standard errors of parameter estimates and a reduction in power to detect spatial clusters and trends [18–22]. In order to better relate the size of these effects to the degree of automated geocoding inaccuracy, it is important to know how accuracy is affected by various geographic characteristics of an address. Such knowledge and understanding may make it possible, for example, to put context-specific confidence bounds or tolerance bounds on the magnitude of an address's positional error. Furthermore, they may allow one to simulate more realistic, context-specific positional errors for use in studies of the effects of geocoding inaccuracy on the power of various statistical tests for clusters, spatial trends, and other important spatial patterns and features; see, for example, . Finally, it can facilitate and improve measurement-error model methods and imputation methods for adjusting spatial statistical analyses for geocoding inaccuracy [24–26].
Our current level of understanding of the geographic factors affecting automated geocoding accuracy is rather limited, however. One factor known to be important is whether an address lies in a rural or urban area. Every published study that has compared positional errors for rural and urban addresses within the same geographic region has found that automated geocodes of the former are, on average, not as accurate as those of the latter. The ratio (rural to urban) of mean positional error magnitudes has been variously reported as approximately 1.4:1 for a small study in western New York , 4:1 for a study spanning 49 states , 3:1 or 10:1 (depending on whether the automated geocoding was performed in-house or by a commercial firm) for a study in south central Iowa , 5:1 for a study in upstate New York , and 5:1 for the address data (from western Iowa) presented in this article. Another factor known to affect positional errors, at least in rural areas with strongly rectilinear street networks, is the axial orientation (north-south or east-west) of the street on which an address lies ; specifically, the directional error component in the direction aligned with the street tends to be greater than the component in the orthogonal direction. Presumably, this is due to errors of interpolation along the street segment that are larger, on average, than errors of offset from the segment. Furthermore, scatterplots of positional errors displayed in [7, 12, 15] and formal statistical analysis reported in  have revealed that the empirical probability distribution of positional errors is approximated poorly by a single bivariate normal distribution but quite well by a two-component or three-component mixture of bivariate normal or t distributions. The reason these mixtures fit better appears to be the fact that they can account for disparate components of errors (e.g. interpolation and offset errors) having considerably different levels of variability .
Notwithstanding what has been learned about automated geocoding's positional errors from previous studies, there is still much that is not well understood. For instance, there are readily available covariates in addition to rurality and orientation that may be associated with automated geocoding accuracy. Among these are local characteristics of the street network, for example street intersection density or street segment length. Street intersection density could be construed as a more refined measure of rurality than the dichotomous rural/urban classification measure used heretofore. As such, we might expect that in rural areas at least, accuracy would increase with an increase in street intersection density. Street segment length might be suspected of being associated with accuracy because of how address interpolation algorithms work. That is, since a linear interpolation algorithm places an address proportionately along a street segment, according to where the residence number falls in the range of street numbers assigned to the segment's endpoints, it is reasonable to expect the magnitudes of positional errors to be approximately proportional to segment length.
In this article we present analyses of the effects of several factors, including local characteristics of the street network, on automated geocoding accuracy. In our analyses we consider not only how means of positional error magnitudes are affected by such characteristics, but also, more comprehensively, how the entire distribution of positional error magnitudes is so affected. We use confidence intervals to characterize uncertainty associated with estimating mean positional error magnitudes, but for characterizing uncertainty associated with estimating the distribution of error magnitudes we use tolerance intervals, i.e. intervals that contain, with a given level of certainty, a fixed proportion of the error magnitudes. Tolerance intervals have a long history of use in engineering and the physical sciences for the purpose of quantifying the uncertainty of errors associated with manufacturing and other physical processes, and it is entirely reasonable to apply them, for the same purpose, to errors incurred by geocoding. Furthermore, to characterize uncertainty associated with estimating the bivariate distribution of positional error vectors, we construct tolerance regions.
The main purpose of this article is to investigate the relationship between automated geocoding accuracy and various geographic and street network characteristics, namely rurality, street orientation, street intersection density, and street segment length for a real data set of geocoded addresses. For this purpose, we use a rather large set of geocoded addresses from an Iowa county.