Positional error in automated geocoding of residential addresses
© Cayo and Talbot; licensee BioMed Central Ltd. 2003
Received: 10 September 2003
Accepted: 19 December 2003
Published: 19 December 2003
Skip to main content
© Cayo and Talbot; licensee BioMed Central Ltd. 2003
Received: 10 September 2003
Accepted: 19 December 2003
Published: 19 December 2003
Public health applications using geographic information system (GIS) technology are steadily increasing. Many of these rely on the ability to locate where people live with respect to areas of exposure from environmental contaminants. Automated geocoding is a method used to assign geographic coordinates to an individual based on their street address. This method often relies on street centerline files as a geographic reference. Such a process introduces positional error in the geocoded point. Our study evaluated the positional error caused during automated geocoding of residential addresses and how this error varies between population densities. We also evaluated an alternative method of geocoding using residential property parcel data.
Positional error was determined for 3,000 residential addresses using the distance between each geocoded point and its true location as determined with aerial imagery. Error was found to increase as population density decreased. In rural areas of an upstate New York study area, 95 percent of the addresses geocoded to within 2,872 m of their true location. Suburban areas revealed less error where 95 percent of the addresses geocoded to within 421 m. Urban areas demonstrated the least error where 95 percent of the addresses geocoded to within 152 m of their true location. As an alternative to using street centerline files for geocoding, we used residential property parcel points to locate the addresses. In the rural areas, 95 percent of the parcel points were within 195 m of the true location. In suburban areas, this distance was 39 m while in urban areas 95 percent of the parcel points were within 21 m of the true location.
Researchers need to determine if the level of error caused by a chosen method of geocoding may affect the results of their project. As an alternative method, property data can be used for geocoding addresses if the error caused by traditional methods is found to be unacceptable.
There has been a dramatic increase in the number of public health applications using GIS. Software and hardware are now more accessible, affordable, and easier to use. Environmental, health and socio-demographic data are readily available through the Internet and optical disk media. Many colleges and universities now offer courses in GIS and spatial analysis. An increase in public awareness of these advances has led to increased demand for studies and maps investigating spatial relationships between health outcome, environmental risk factors and exposure.
Environmental health studies often rely on GIS and geocoding software to help delineate areas of potential exposure and to locate where people live in relation to these areas. A number of studies have used residential locations to determine whether individuals live within defined zones of exposure. Geschwind et al.  geocoded congenital malformation cases and controls to estimate an increased risk of living within 1 mile of hazardous waste sites. English et al.  used geocoded residential addresses to assess whether there was an elevated odds ratio of childhood asthma hospitalizations in children living within 550 feet (168 m) of roads with heavy traffic in San Diego, California. In a study of breast cancer on Long Island, New York, one-kilometer grid cells were created and cases, controls and chemical facilities were assigned to individual cells through automated geocoding methods. The risk of developing postmenopausal breast cancer was found to increase as the number of chemical facilities sharing the same cells as study subjects increased . More recently, Reynolds et al.  geocoded childhood cancer cases to census tracts in California and used USEPA data to assign hazardous air pollutant scores to each tract. There was an increased risk of developing childhood cancer as the exposure level increased. Kitto et al.  used GIS to locate nearly 45,000 residential radon screening measurements, which were then associated with surficial geology. The association between surficial geology types and radon measures were used to predict radon levels in towns across New York.
Geocoded health data are also used to map rates of disease in order to determine areas of high or low incidence [6, 7]. Rate maps can be used in conjunction with spatial statistics such as the local Moran's I  or the Spatial Scan Statistic  to locate the general areas where the rates are unlikely due to chance. Further investigations or more rigorous epidemiology studies are often needed to clarify the association of risk factors and adverse health outcomes when high rates are detected.
Many GIS software packages provide for street level geocoding. Geocoding software matches residential addresses to street reference files containing geographic centerline coordinates, street numbers, street names and postal codes. Researchers undertaking projects having a geocoding component should be concerned that positional error can be introduced by commonly used algorithms. Of more concern, they need to understand if this error could impact study results. Nondifferential errors with respect to exposure classification will bias the association between a risk factor and the health outcome towards the null, limiting the ability to detect true effects. The capability to detect an association thus depends on the magnitude of this error. However, if the positional error is systematic, it is possible an association may be found between a health outcome and an exposure where none actually exists. In the case of disease surveillance activities, localized high or low rates of disease may appear as an artifact of geocoding errors .
The percentage of addresses that geocode is commonly referred to as a match rate. The inability to geocode addresses can lead to a loss of study population causing sample bias and reduced statistical power in detecting important associations. Several investigators have provided statistics related to match rates [11–17]. Researchers have found that differences in these rates are dependent on population density [18, 19]. This is because street reference files, such as the U.S. Census TIGER (Topologically Intergrated Geographic Encoding and Referencing) files or commercially enhanced TIGER files, often contain more complete address information in more densely populated areas. Gregorio et al.  analyzed the match rates of breast cancer cases from the Connecticut Tumor Registry and found that women of color and women living in low income neighborhoods were more likely to be successfully geocoded compared to white women and women living in higher income areas. Investigators should be aware that the geographic differences in match rates can alter study results. For example, if more cases are matched in inner city minority neighborhoods, the disease incidence may appear higher in these areas due to larger subject loss in other areas.
Achieving high match rates is dependent on accurate and complete address information of the study subjects and the street reference files. Many types of problems can occur in both, such as: spelling errors; street suffix, prefix and abbreviation inconsistencies; and erroneous ZIP code information. Reference files also contain errors such as missing, incomplete, and incorrect street segments and address ranges. The North American Association of Central Cancer Registries provide an extensive overview and guideline of the standardization and geocoding of patient addresses, problems encountered, and recommendations for improving the geocoding process of disease registries . Match rates alone are not sufficient to evaluate a geocoding result. Some investigators have also provided statistics related to the percent of geocoded addresses misclassified to the correct town , census area [22, 16], and land parcel . The level of misclassification will change depending on the geographic scale of the regions used.
Very limited published information exists on positional error in automated street level geocoding. Hertz, of the California Department of Health Services, conducted a pilot study to assess geocoding accuracy of 70 addresses (A. Hertz, personal correspondence, 2002). In his study, Hertz used aerial photos to determine the true location of each address and geocoded the same group using three different commercial products. Hertz found positional error to be in a range of 20–80 m depending on the product used and had some extreme outliers over 250 m. Researchers at the University of Connecticut compared the locations of addresses geocoded using the U.S. Census Bureau's TIGER  files to ground truth locations of approximately 536 addresses in Stratford, Connecticut. Four of these addresses were located more than 500 feet (152 m) from the correct location (E. Cromley, unpublished manuscript, 1997). In a recently published study, Bonner et al.  found differences between urban and non-urban addresses when examining distances between the geocoded and GPS determined locations. They found 89 percent of the addresses were within 100 meters in urban areas of Erie and Niagara Counties, NY, while in the non-urban areas 69 percent were within 100 meters.
This study had several objectives. The primary objective was to evaluate positional error in automated geocoding of residential addresses. We measured positional error by calculating the distance between geocoded locations provided by a commonly available off-the-shelf product and their corresponding true locations. This commercial product uses a proprietary enhanced version of the TIGER files. The second objective was to evaluate how this error varies between urban, suburban, and rural population densities. A third objective was to determine if the error can be reduced by adjusting default settings in the geocoding software. The street offset setting allows the user to change the default for how far a geocoded address is placed from a street centerline while the corner inset setting determines how far a geocoded address is placed along a street from an intersection. The final objective was to compare the error observed in the traditional geocoding method, which relies on linear interpolation, to a point geocoding method using property parcel data.
In order to successfully geocode a residential address, a valid street number, street name and ZIP code is required. NYSORPS property data contain parcel specific street number and name information, but lack the ZIP code of the parcel address. We could not reliably assign parcel ZIP codes to 3,145 addresses (1.5%) in our residential property file. This group was excluded from further analysis and should have no effect on the overall results since they represent a very small portion of the addresses in the four county area.
Geocoding match rates by population density. Values are based on exact matching of house number, street name, and ZIP code.
Number of Residential Addresses
Number Exact Matched
Percent Exact Matched
In order to measure positional error, we determined the true location for a random sample of 1,000 addresses from each of the three population density classes for a total sample size of 3,000 addresses. This selection was drawn from the group that matched exactly on street number, street name, and ZIP code. We define the "true" location of each address as the point that visually represented the approximate center of the house using 1 m resolution digitally enhanced aerial orthoimagery. The orthoimagery was flown from 1994–98 and has a horizontal accuracy of 10 m . Through this method, 2,674 (89%) addresses of the study group were assigned true locations.
Closely spaced homes were the most common problem in identifying the true location in urban areas. One meter resolution orthoimagery made it difficult to delineate some of the building rooftops. A more common problem in suburban addresses was dark rooftops surrounded by dense canopy cover from trees. In rural areas, detached garages, barns, and other large outbuildings made it difficult to distinguish the actual house.
Overall, we found only small differences in our ability to assign a true location between the three density classes. Four individuals were involved in creating true locations for the study sample addresses. A QA/QC assessment was performed on a random sample of 100 addresses to compare their decisions of where to place the true location. Results showed that discrepancies between all individuals were minimal, averaging only 3.3 m.
Fieldwork was completed in the summer of 2001 for the remaining 326 addresses which could not be confidently identified using in-house techniques. Staff used real time Global Positioning System technology and mapping software as a navigational aid to locate the address and identify the correct structure for that address. As with the in-house procedure, the point was then manually placed in the center of the correct structure using aerial imagery.
Once true locations of all 3,000 addresses in our sample were determined, we calculated the straight-line distance between coordinates of the true locations and the automated geocoded points. This allowed us to compute the positional error, by population density, from traditional automated geocoding.
Without knowing the optimal settings, geocoding was initially performed using a street offset and corner inset of zero. We adjusted the default offset and inset settings in MapMarker to see if the positional error in the geocoded addresses could be reduced. The sample address file was re-geocoded using 5 m iterations of these values and compared to true locations to determine the optimal combination.
We also investigated whether directional bias in the error could be introduced by data conversion issues, such as inconsistent projections or datums in the various GIS layers. A rose diagram was constructed using the directional error of the 3,000 addresses. We also calculated the angle of the errors to determine if the direction of the errors were uniform for both the automated geocoded points and the property parcel points using the modified Rayleigh test .
TIGER based positional error. Positional error is calculated by measuring the distance between address locations determined by automated geocoding methods using enhanced TIGER files and the true location of the houses. RMSE = Root Mean Square Error (radial). N = 1000 for each density class.
We found the optimal combination of the street offset and corner inset for the entire sample to be 15 m and 50 m respectively. This combination of values, however, only reduced the overall mean positional error from 272 to 265 m. Optimal values were actually determined for the rural, suburban and urban areas separately, but provided little additional benefit from using an average setting for all density areas. Using unique values for each area provided an additional reduction in the mean error of 2.1 m in rural areas, 0.1 m in the suburban areas, and 0.7 m in urban areas.
Parcel based positional error. Positional error is calculated by measuring the distance between property parcel locations and the true location of the houses. RMSE = Root Mean Square Error (radial). N = 1000 for each density class.
A visual inspection of the rose diagram showed that the directions of the error were well dispersed. The Rayleigh test confirmed that the angles of the errors were uniform for both the automated geocoded points and the parcel points.
This project used address data typical of that which are geocoded for health studies. We calculated error only for the addresses which had an exact match on house number, street name and ZIP code to the reference files. If we considered the addresses that matched on less stringent criteria, both match rates and positional error would have increased. Yu showed that small improvements in achieving higher match rates by relaxing the matching criteria results in large decreases in positional accuracy . Researchers often sacrifice positional error in order to reduce subject loss from lower match rates when resources are limited for accurately geocoding study subjects.
Several factors explain the positional error in the geocoded locations. The original TIGER files have a horizontal positional accuracy of ± 167 feet (51 m) . The geocoding engine used in this project incorporates enhanced versions of these files. We are unaware of the improvement in geometric accuracy of these street centerlines over original TIGER files. Although it is difficult to measure, we feel that positional accuracy of the enhanced files represents a significant source of positional error in the geocoded addresses. Further research is needed to quantify this contribution to the error.
A more dominant source of error originates in the interpolation algorithms used to determine an address along a street centerline. Address ranges can be incorrect or reversed in the reference files, which causes houses to be geocoded to either the wrong side or wrong end of the street. Larger positional error was observed in rural areas. Generally, rural areas consist of longer streets with fewer intersections. The software must interpolate where to place an address based on the street numbers assigned to the ends of each street segment. As the street segments increase in length, the interpolation error will also increase. In a study of vehicle accident locations, Levine et al. reached a similar conclusion that geocoding error is a function of street segment length and urban areas typically contained shorter segments . Since the software often assumes uniform intervals between street numbers along a street segment, interpolation errors increase when homes are not evenly spaced along a street. Parcels tend to be larger and less consistent in size in less densely populated areas. The median parcel size in our random sample was found to be 472 m2 in the urban areas, 1214 m2 in the suburban areas and 3035 m2 in the rural areas. The variation in parcel size showed a similar trend. In those properties classified urban, the standard deviation was 445 m2, for suburban was 5024 m2, while in rural areas increased to 56,046 m2. Finally, there is a greater variation in the distance houses are located from the street centerlines in rural areas. In the urban settings, a common problem was row type housing or condominiums. Reference files space the addresses uniformly along the street when in fact the addresses are clustered together.
A spatial non-stationary process is evident if statistical parameters such as the mean and variance change with location. Non-stationarity of the positional error may have some important implications in environmental health studies. For example, in urban areas where the error is small we may notice an association between an environmental risk factor and a particular health outcome. In other areas having greater error, associations may be more difficult to detect. In addition, some types of environmental exposures, such as exposures to air pollution from traffic or exposures to agricultural pesticides, are associated with population density. If the level of error varies as the level of exposure changes, the study parameters which estimate the relationship between a risk factor and a health outcome may also be impacted. Global statistical methods perform poorly at uncovering important associations in which the statistical parameters vary locally due to non-stationarity. . This has lead to an increase in the use of local spatial statistical methods for detecting clustering and localized associations between health outcomes and risk factors.
Though no systematic directional bias in our random sample of addresses was found, we did not determine if systematic error may be present in localized areas. Addresses on a particular street, in close proximity to each other, or within the same ZIP code may all have error of similar direction and distance. For example, the geocoding software may place all the addresses in a local area at some distance from the true street location if a street is misnamed, has incorrect address ranges, or if a ZIP code is incorrect in the street reference file. Burra et al.  demonstrated that very localized geocoding errors in which less than one percent of mortality cases are placed in the wrong census area can lead to differences in up to 75 percent of comparative mortality figures in Hamilton, Ontario, census tracts. Once they had more accurately geocoded the cases they found that approximately 80 percent of the difference in number of cases occurred in only 4 of the 88 census tracts studied. In addition, the size and shape of the clusters that were detected using the local Moran I statistic changed when the geocoding errors were corrected. This was a result of errors being concentrated in localized areas. Further work is needed to measure the strength of the spatial autocorrelation of the geocoding errors by distance and direction.
We attempted to reduce the positional error by optimizing the offset and inset default values in the software. Changing these values contributed very little to reducing overall error. Previous work by Ratcliffe  also did not yield significant reduction in the positional error by altering the offset and inset distances.
The use of property parcel points provides one solution for reducing error when the level of error in traditional geocoding methods is not acceptable. The parcel data clearly contains more accurate locations for the individual houses compared to TIGER based files. Parcel centroids are rarely at the exact location of the house. In urban areas the centroid will more closely represent the location of the actual house because of smaller parcels and more uniform spacing of homes. In rural areas this becomes less likely. The use of parcel data may also help to improve match rates since the parcel data is updated on a yearly basis for tax purposes, while commonly used street centerline files are often updated less frequently.
Though the use of parcel points provides greater positional accuracy, the parcel addresses are often not standardized. Residential and commercial addresses are collected by thousands of local governments across the country. This can lead to a lack of standardization in the way addresses are stored in the data files. The challenge is to standardize the millions of New York addresses and add a ZIP code to each property parcel record. Commercially available software programs are available which can be used to help standardize the parcel addresses. Once standardized, linkage to health outcome data could be achieved more efficiently and with the same effort as using currently available TIGER based files.
We considered using data from local county emergency E911 systems to improve geocoding accuracy. However, we found that each county in New York State developed their own E911 system for providing route directions to emergency responders. These systems range from simple text based to more elaborate systems using GIS. The files used in E911 systems come from a variety of sources. Some counties rely exclusively on TIGER based files, some use real property assessment data, while others use files from telephone or electric utility companies. The county E911 data is often considered either confidential or proprietary depending on the source. For example, E911 systems often contain the addresses of unlisted telephone numbers. The advantages of using New York State real property data are that the format is more consistent across the state and is available through freedom of information requests. In addition, most of the counties and municipalities report the data directly to NYSORPS. This minimizes the number of requests needed in developing a statewide reference file.
There are some limitations in our study. Since the TIGER files are often derived from data provided by state and local governments, the geometric accuracy and address range completeness may differ in other areas. For this reason, it is difficult to predict if the magnitude of the geocoding error resulting from positional inaccuracy and interpolation error would be similar in other areas of the country. However, we would expect that interpolation issues contributing to positional error will remain the dominant source and correlate highly with population density in most areas. This is due to such issues as longer street segments and houses being spaced further apart in less densely populated areas. In addition to population density, there may be other predictors of positional error such as population growth or sociodemographic variables. Further research is needed in this regard. This study assumes our true locations to be error free. We recognize there is some positional error in the true locations assigned. However, this error is quite small compared to the error caused by the automated geocoding process and should not have a major impact on our results.
We only provide results from one geocoding package. We are uncertain of how the results would change if other products were used on the same set of address data. Most products we are aware of rely on the use of TIGER or enhanced TIGER files. As the geometric accuracy and completeness of the street centerline files improve, we would expect positional error to decrease. However, because houses are often not spaced evenly along streets, there will continue to be greater error using linear interpolation techniques compared to using parcel points to locate addresses.
It is important that researchers determine if the level of error caused by a chosen method of geocoding may affect the results of their project. In the past, researchers appeared to pay little attention to understanding positional error from geocoding. Foote and Huebner report that only recently has more attention been devoted to problems introduced by error, inaccuracy, and imprecision in spatial data and how this can "make or break" a GIS project . The location derived from the geocoding process is often used as input to other operations such as assignment of exposure or socioeconomic class. These assignments are often based on models which also have inherent error. When multiple operations are strung together, errors are often compounded making it difficult to evaluate the accuracy of the final result [34, 35]. Burra et al.  suggest that small geocoding errors, when combined with other types of error in the data, may be amplified into large errors in the final results. Though researchers may be aware that error propagates through the various analyses, they are unable to estimate the accuracy of the final results without first recording the errors of intermediate operations such as geocoding.
Krieger et al.  recommends "that all public health projects involving geocoding evaluate and report on methods to verify the accuracy of their geocoding methodology". If the error caused by traditional methods is not acceptable, one consideration is the use of property data to geocode health data.
We are currently conducting further analyses to determine the implications positional error has on the misclassification of individuals with respect to exposure. Copeland et al.  provides examples of how to measure the underlying true value of a study's odds ratio or relative risk if the sensitivity and specificity of a classification procedure can be measured. We also need to examine whether the errors are random and bias study results towards the null, or whether there are systematic errors which could lead to erroneous positive results.
We thank Chris Pantea and Valerie Haley for their assistance with statistical analysis, Pat Steen and Frank Schoonbeck for their assistance in fieldwork, and Jim Bowers and Deepa Varadarajulu for their assistance in determining photo corrected true locations for residential addresses. We thank Syni-An Hwang, Steve Forand, Francis Boscoe, and Gwen Babcock for providing editorial comments on this manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.