Geographic bias related to geocoding in epidemiologic studies
© Oliver et al; licensee BioMed Central Ltd. 2005
Received: 02 November 2005
Accepted: 10 November 2005
Published: 10 November 2005
This article describes geographic bias in GIS analyses with unrepresentative data owing to missing geocodes, using as an example a spatial analysis of prostate cancer incidence among whites and African Americans in Virginia, 1990–1999. Statistical tests for clustering were performed and such clusters mapped. The patterns of missing census tract identifiers for the cases were examined by generalized linear regression models.
The county of residency for all cases was known, and 26,338 (74%) of these cases were geocoded successfully to census tracts. Cluster maps showed patterns that appeared markedly different, depending upon whether one used all cases or those geocoded to the census tract. Multivariate regression analysis showed that, in the most rural counties (where the missing data were concentrated), the percent of a county's population over age 64 and with less than a high school education were both independently associated with a higher percent of missing geocodes.
We found statistically significant pattern differences resulting from spatially non-random differences in geocoding completeness across Virginia. Appropriate interpretation of maps, therefore, requires an understanding of this phenomenon, which we call "cartographic confounding."
Keywordsbias (epidemiology) confounding factors epidemiology geographic information systems
Epidemiologists and public health researchers are increasingly using geographic information systems (GIS) to assess the association between population health and the area characteristics of where people reside [1–8]. However, spatial analyses are fraught with challenges. The initial task of assigning geographic locations to study subjects – geocoding – can be difficult [9, 10]. As Krieger et al have noted , the completeness with which geocoding is performed varies, which can affect the findings of spatial epidemiologic analyses .
There are two separate and equally important problems that can arise in the process of matching addresses to locations. First, addresses may be assigned longitudes and latitudes that are unacceptably far from their actual locations. This positional inaccuracy, owing to incorrect addresses or the assignment of incorrect latitudes and longitudes to correctly recorded addresses, can lead to bias in a study's outcomes [13–16]. This paper does not address this issue; rather, it focuses on the second problem, which is differential match rates by geographic region.
Such differential match rates can give biased results because GIS analyses may be based on unrepresentative data and a consequent information  bias in which important data are missing in a non-random fashion. Non-random missingness, a term used by statisticians to describe this information bias , can result from social, economic, political, and other reasons. One of the central contributions of spatial analyses of disease is that geographic location can be used to help account for unmeasured and unmeasurable risks for disease[19, 20]. However, if the risk factors for missing locations are some of the same risks as those for the disease under study, then this confounding of risk factor with place makes an unbiased spatial analysis more difficult to achieve . We illustrate this latter type of bias through an analysis of prostate cancer incidence over a 10-year period in Virginia.
In a study of prostate cancer incidence and race in Virginia , we found that the median household income and urban status of an area were associated positively with prostate cancer incidence for both African Americans and whites. The level of poverty and lower education were associated with decreased incidence among whites but not African Americans. Statistically significant associations were found only at the census tract level, disappearing when the analyses were conducted at the county level.
We sought to discern whether the differences we noted at the census-tract level analysis were real or simply an example of the modifiable areal unit problem (MAUP) , in which one obtains different results and inferences when the same set of data is grouped in different sized areal units. For example, Krieger et al  found in their analysis of cancer incidence and mortality in Massachusetts and Rhode Island that significant findings were lost when the analyses were conducted with areal units larger than the census tract. Gregorio et al , however, found in their cluster analysis of incident prostate and breast cancer cases in Connecticut that there were few differences in results across areal units, and there was no compelling need in typical cancer surveillance studies to prepare data at areas finer than the census tract.
In our study, only 74% of the cases were geocoded to the census-tract level, whereas 100% of the cases had county codes. When the statistical analyses were conducted at the county level with the same reduced data set (74% of the cases) as that used at the census-tract level, the results were similar, regardless of the geographic unit of analysis. (The generalized linear mixed modeling used to conduct this analysis and the results are described in detail elsewhere .) When the county analyses were conducted using 100% of the cases, the associations between predictor variables and prostate cancer incidence disappeared. These statistical analyses indicated the differences in results were not owing to the MAUP, but possibly to missing data.
In this paper, we report the findings of a study we conducted to evaluate whether unrepresentative data resulted merely from being missing, or whether the "missingness" of the data itself was confounded geographically with our covariates.
Census Tract geocoding results broken down by address type.
% Of address types (No.)
One of the most important contributions of using GIS technology in epidemiologic research is to help us discern geographic patterns of disease. We found that geographic patterns of prostate cancer incidence at the census-tract level in Virginia may reflect the distribution of the available data rather than real, underlying disease patterns. In an analysis of cancer incidence, where the census population is the denominator, information bias may result from missing geocoded data. The effect of missing geocoded data could be different, for example, in an analysis of the proportion of late-stage disease, where the denominator would be the successfully geocoded cases. However, even in this situation, if screening practices are different in urban versus rural communities, then the apparent proportion of late-stage disease could be biased.
The percentage of data able to be geocoded at the census-tract level in this analysis increased over the study period, reflecting the progressive implementation of the enhanced 911 rules by the Federal Communications Commission. These rules, among other things, require assigning street addresses to rural locations. Despite these improvements, as well as increasing prostate cancer screening from 1990–94 to 1995–99, the location of high-rate clusters did not differ markedly between these two time periods. However, for each time period the spatial location of clusters among cases geocoded to the census tract versus those located in the county were vastly different.
These findings demonstrate statistically significant pattern differences resulting from spatially non-random differences in geocoding completeness across Virginia. In classic epidemiologic terms, a measure of the effect of one factor on disease risk can be biased because of its association with another factor (confounder) and the disease. Similarly, when the factor of interest is geographic, a factor related to the disease that is not distributed randomly across the study area can confound the appearance of maps of that disease. Appropriate interpretation of maps, therefore, requires an understanding of this phenomenon, which we call "cartographic confounding."
In this study, systematically missing data are a result of location; however, a location's urban or rural status and associated sociodemographic characteristics were found to be associated with the likelihood of missing data from that location, as well as to the likelihood of disease in that area. Spatial patterns of disease incidence, therefore, may confound cartographically the location and sociodemographic risk factors for the disease. In our study of prostate cancer incidence in Virginia, the findings that area-level measures of income and urban status are associated with increased incidence are tempered by the possibility of cartographic confounding. This problem is particularly vexing when evaluating geographic health disparities, as the possible bias of one's statistical analyses depends upon the proportion of rural population in the study .
Cartographic confounding is geographically based, i.e., related to location. Methods designed to account for possibly unrepresentative data, therefore, also should account for this geographic component. One approach to dealing with this problem is to minimize missing data through imputation of geocodes .
When conducting spatial analyses, sound analyses depend upon assessing for possible bias and cartographic confounding resulting from insufficient geocoding that leads to systematically missing data. In this way, the power of geographic information science can be more effectively brought to bear on important issues of public health and the inferences from the analyses are more likely to be correct.
Incidence data are from the Virginia Cancer Registry (VCR), 1990 – 1999. There were a total of 37,373 malignant neoplasms of the prostate in that period, with 27,414 in whites, 8,252 in African Americans, and 1,707 in others. Incident cases were geocoded to the street level using ArcGIS and its StreetMap USA 2000 database.(Products of the Environmental Science Research Institute, Redlands, CA.) A point-in-polygon methodology [30, 31] was used to attribute 1990 census tracts to cases. Case counts were aggregated to the census tract. All cases were assigned county codes by the VCR. County codes were checked against the address-matched geocodes as a quality control measure. County codes were incorrect in only 1% of cases.
For the study period, the North American Association of Central Cancer Registries (NAACCR) reports that the VCR has 90% case ascertainment . The VCR is not given NAACCR's top ranking primarily because the number of cases ascertained by death certificate only is too high or not available, depending on the year.
Area-based measures were derived from the 1990 U.S. Census data (U.S. Bureau of the Census Summary Tape File 3A). These measures were used so that presumed exposures occurred before disease incidence. The poverty variable was a measure of the percentage of persons in a census tract below the poverty level, categorized as <10%, 10 – 19%, and ≥20%. A near-poor variable measured the percent of the tract's population between 100 and 200% of the federal poverty level. The tract's median household income was also used as a variable. A low-education variable measured the percentage of persons in a census tract 25 years or older who had less than a high-school education. A high-education variable did likewise for that percentage with at least 4 years of college. The percent of a tract's population that was rural (≤50%, 51 to <100%, and 100%) was another variable. These cutpoints were chosen based on frequencies for this measure in our data. The percent of female heads of household was another predictor variable.
For both the African-American and white populations during the study period, we used an area allocation method [19, 30] to produce population averages over the study period at the tract level. (The 1990 and 2000 census tracts do not match exactly, and this method was utilized to adjust for this fact.) The 1990 and 2000 county boundaries were directly comparable. As a result, we created a direct average of populations without any manipulation. The resulting averages were annualized over the 10-year study period, and these figures were used to calculate age-adjusted incidence rates of prostate cancer by the direct method utilizing the 2000 U.S. standard million .
Exploratory spatial data analysis
Annualized, age-adjusted prostate cancer incidence rates for African Americans and whites were calculated at the census tract and county levels. Owing to the low case counts at younger ages, we used three age categories (<50, 50–74, and ≥75). These incidence rates were mapped at the tract and county levels.
Low case counts, sparse populations, or both result in unstable incidence rates. We used a weighted, two-dimensional, median-based smoothing algorithm called "headbanging" to reduce this noise , allowing patterns to emerge from the data.
Hierarchical Poisson regression modeling, using the SAS GLIMMIX macro (SAS 2001), was performed to assess prostate cancer incidence for all census tracts and counties in Virginia by the patient's age at diagnosis and sociodemographic characteristics of the census tracts. Specifically, the number of prostate cancer cases in census tract i (i = 1, ..., 1673), age group j (j = 1, 2, 3), denoted d ij , was assumed to be distributed as a Poisson random variable, with a mean n ij λ ij where n ij is the corresponding population at risk and λ ij is the incidence rate in census tract i and age group j. We assumed a log-linear rate structure, with the county or tract intercept of the regression model conceptualized as a random effect with a spatial correlation structure to account for spatial autocorrelation in the data.
The statistical analyses were stratified by racial category. Owing to the sparseness of the VCR data in other racial categories, we only analyzed data for African Americans and whites. All main effects and two-way interactions were initially screened for significance in a logistic regression model, and a final model for the hierarchical Poisson regression was constructed using stepwise, backward variable selection. Only highly significant interaction terms (p < 0.005) were retained in the logistic model to account for the multiple comparisons inherent in the selection process . The full results of this study of prostate cancer incidence in Virginia, 1990–99, is available elsewhere .
In the current study, we conducted analyses to assess whether prostate cancer cases clustered within the study area. As noted by Waller and Gotway , global clustering indicates clustering exists at some point in one's study area, whereas local clustering refers to the presence of a cluster at a specific site. We evaluated the raw count data for global clustering, using Tango's Maximum Excess Events Test (MEET) . The statistical code to execute the MEET was provided to the authors in R (an open-source statistical package similar to S-Plus [Insightful Corporation]) by Prof. Toshiro Tango. We also assessed the count data with a spatial scan statistic (SaTScan) [37, 38] to identify statistically significant local clusters.
The patterns of missing tract identifiers were examined by generalized linear regression models in SAS 9.1 (SAS Institute, Inc. Cary, NC) that included percent of tract population over age 64, percent of tract population aged 25 or older with less than a high school education, percent of tract population aged 25 or older with at least a college education, and the median household income in the area. These factors have been found to be associated with prostate cancer incidence [4–6, 22]. Moreover, in our prior Virginia study , the geographic distribution of missing data and that of several of these covariates was similar, which, we hypothesized, might be the result of confounding between the two.
This project was supported by grant K07 CA099983 from the National Cancer Institute, HRSA CFDA No. 93.984, Academic Units in Primary Care-Family Medicine, and a grant from the Paul Mellon Prostate Cancer Research Institute. The authors would like to thank David Stinchcomb, Denise Lewis, and Barry Miller for informative discussions on this topic. We would like to especially thank the reviewers of our original manuscript. Their thoughtful comments helped immensely to improve the quality of this paper.
- Ricketts TC: Geographic Information Systems and Public Health. Annu Rev Public Health. 2003, 24: 1-6. 10.1146/annurev.publhealth.24.100901.140924.PubMedView ArticleGoogle Scholar
- McLafferty SL: GIS and Health Care. Annu Rev Public Health. 2003, 24: 25-42. 10.1146/annurev.publhealth.24.012902.141012.PubMedView ArticleGoogle Scholar
- Rushton G: Public Health, GIS, and Spatial Analytic Tools. Annu Rev Public Health. 2003, 24: 43-56. 10.1146/annurev.publhealth.24.012902.140843.PubMedView ArticleGoogle Scholar
- Krieger N, Chen JT, Waterman PD, Soobader MJ, Subramanian SV, Carson R: Geocoding and monitoring of U.S. socioeconomic inequalities in mortality and cancer incidence: Does the choice of area-based measure and geographic level matter? The Public Health Disparities Geocoding Project. Am J Epidemiol. 2002, 156: 471-482. 10.1093/aje/kwf068.PubMedView ArticleGoogle Scholar
- GK S, BA M, B H, Edwards BK: 2003, Bethesda, MD, National Cancer Institute, NIH Pub No 03-5417: Area socioeconomic variations in U.S. cancer incidence, mortality, stage, treatment, and survival, 1975-1999, NCI Cancer Surveillance Monograph Series, No. 4,
- Krieger N, Chen JT, Waterman PD, Rehkopf DH, Subramanian SV: Painting a truer picture of U.S. socioeconomic and racial/ethnic health inequalities: The Public Health Disparities Geocoding Project. Am J Public Health. 2005, 95: 312-323. 10.2105/AJPH.2003.032482.PubMedPubMed CentralView ArticleGoogle Scholar
- Pickle LW, Feuer EJ, Edwards BK: Prediction of incident cancer cases in non-SEER counties. ASA Proceedings of the Biometrics Section. 2000Google Scholar
- Pickle LW, Feuer EJ, Edwards BK: U.S. predicted cancer incidence, 1999: Complete maps by county and state from spatial projection models. NCI Cancere Surveillance Monograph Series No. 5. 2003, Bethesda, MD, National Cancer Institute, NIH Publication No. 03-5435Google Scholar
- McElroy JA, Remington PL, Trentham-Dietz A, Robert SA, Newcomb PA: Geocoding addresses from a large population-based study: lessons learned. Epidemiology. 2003, 14: 399-407.PubMedGoogle Scholar
- Boscoe FP, Kielb MS, Schymura MJ, Bolani TM: Assessing and improving census tract completeness. J Registry Management. 2002, 29: 117-123.Google Scholar
- Krieger N, Waterman PD, Lemieux K, Zierler S, Hogan JW: On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. Am J Public Health. 2001, 91: 1114-1116.PubMedPubMed CentralView ArticleGoogle Scholar
- Krieger N, Waterman PD, Chen JT, Soobader MJ, Subramanian SV, Carson R: Zip code caveat: Bias due to spatitemporal mismatches between zip codes and U.S. census-defined geographic areas -- The Public Health Disparities Geocoding Project. Am J Public Health. 2002, 92: 1100-1102.PubMedPubMed CentralView ArticleGoogle Scholar
- Cayo MR, Talbot TO: Positional error in automated geocoding of residential addresses. Int J Health Geogr. 2003, 2: 10-10.1186/1476-072X-2-10.PubMedPubMed CentralView ArticleGoogle Scholar
- Bonner MR, Han D, Nie J, Rogerson P, Vena JE, Freudenheim JL: Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology. 2003, 14: 408-412.PubMedGoogle Scholar
- Rushton G: Selecting appropriate geocoding methods for cancer control and prevention program activities. 2005, [http://www.uiowa.edu/~gishlth/giswkshp/GCD_Rushton_files/frame.htm#slide0001.htm]Google Scholar
- Whitsel EA, Rose KM, Wood JL, Henley AC, Liao D, Heiss G: Accuracy and repeatability of commercial geocoding. Am J Epidemiol. 2004, 160: 1023-1029. 10.1093/aje/kwh310.PubMedView ArticleGoogle Scholar
- Rothman KJ, Greenland S: Modern epidemiology. 1998, Philadelphia, PA, Lippincott Williams & Wilkins, SecondGoogle Scholar
- Vach W: Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med. 1997, 16: 57-72. 10.1002/(SICI)1097-0258(19970115)16:1<57::AID-SIM471>3.3.CO;2-J.PubMedView ArticleGoogle Scholar
- EK C, McLafferty SL: GIS and public health. 2002, New York, The Guilford PressGoogle Scholar
- Waller LA, Gotway CA: Applied spatial statistics for public health data. 2004, Hoboken, NJ, John Wiley & Sons, Inc.View ArticleGoogle Scholar
- Gregorio DI, Cromley E, Mrozinski R, Walsh SJ: Subject loss in spatial analysis of breast cancer. Health Place. 1999, 5: 173-177. 10.1016/S1353-8292(99)00004-0.PubMedView ArticleGoogle Scholar
- Oliver MN, Smith E, Siadaty M, Hauck FR, Pickle LW: A spatial analysis of prostate cancer incidence and race in Virginia, 1990-1999. Am J Prev Med. 2005Google Scholar
- Gregorio DI, Dechello LM, Samociuk H, Kulldorff M: Lumping or splitting: seeking the preferred areal unit for health geography studies. Int J Health Geogr. 2005, 4: 6-10.1186/1476-072X-4-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Beyers N, Gie RP, Zietsman HL, Kunneke M, Hauman J, Tatley M, Donald PR: The use of a geographical information system (GIS) to evaluate the distribution of tuberculosis in a high-incidence community. S Afr Med J. 1996, 86: 40-1, 44.PubMedGoogle Scholar
- Miles-Doan R, Kelly S: Geographic concentration of violence between intimate partners. Public Health Rep. 1997, 112: 135-141.PubMedPubMed CentralGoogle Scholar
- Rushton G, Lolonis P: Exploratory spatial analysis of birth defect rates in an urban population. Stat Med. 1996, 15: 717-726. 10.1002/(SICI)1097-0258(19960415)15:7/9<717::AID-SIM243>3.0.CO;2-0.PubMedView ArticleGoogle Scholar
- Hurley SE, Saunders TM, Nivas R, Hertz A, Reynolds P: Post office box addresses: a challenge for geographic information system-based studies. Epidemiology. 2003, 14: 386-391.PubMedGoogle Scholar
- Kravets N, Hadden WC: The accuracy of address coding and the effects of coding errors. Health Place. 2005Google Scholar
- Klassen AC, Curriero FC, Hong JH, Williams C, Kulldorff M, Meissner HI, Alberg A, Ensminger M: The role of area-level influences on prostate cancer grade and stage at diagnosis. Prev Med. 2004, 39: 441-448. 10.1016/j.ypmed.2004.04.031.PubMedView ArticleGoogle Scholar
- PA L, MF G, DJ M, DW R: Geographic information systems and science. 2001, New York, John Wiley & SonsGoogle Scholar
- Centers for Disease Control and Prevention (CDC): Behavioral Risk Factor Surveillance System Survey Data. 2000, Atlanta, GA, U.S. Department of Health and Human Services, Centers for Disease Control and PreventionGoogle Scholar
- North American Association of Central Cancer Registries: Cancer in North America, 1988-2002. Appendix C: Indicators of data quality for all participating registries. 2005, [http://www.naaccr.org/index.asp?Col_Sectionkey=11&Col_ContentID=50=12&Col_ContentID=54]Google Scholar
- Fleiss JL: Statistical methods for rates and proportions. 1981, New York, NY, John Wiley and SonsGoogle Scholar
- Mungiole M, Pickle LW, Simonson KH: Application of a weighted headbanging algorithm to mortality data maps. Statistics in Medicine. 1999, 18: 3201-3209. 10.1002/(SICI)1097-0258(19991215)18:23<3201::AID-SIM310>3.3.CO;2-L.PubMedView ArticleGoogle Scholar
- LR C: Simultaneous statistical inference in the normal multiple linear regression model. JASA. 1973, 68: 457-461.View ArticleGoogle Scholar
- Tango T: A test for spatial disease clustering adjusted for multiple testing. Statistics in Medicine. 2000, 19: 191-204. 10.1002/(SICI)1097-0258(20000130)19:2<191::AID-SIM281>3.0.CO;2-Q.PubMedView ArticleGoogle Scholar
- Kuldorff M, Nagarwalla N: Spatial disease clusters: detection and inference. Statistics in Medicine. 1995, 14: 799-810.View ArticleGoogle Scholar
- Kuldorff M: A spatial scan statistic. Communications in Statistics Theory and Methods. 1997, 26: 1481-1496.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.