Lumping or splitting: seeking the preferred areal unit for health geography studies
© Gregorio et al; licensee BioMed Central Ltd. 2005
Received: 28 January 2005
Accepted: 23 March 2005
Published: 23 March 2005
Findings are compared on geographic variation of incident and late-stage cancers across Connecticut using different areal units for analysis.
Few differences in results were found for analyses across areal units. Global clustering of incident prostate and breast cancer cases was apparent regardless of the level of geography used. The test for local clustering found approximately the same locales, populations at risk and estimated effects. However, some discrepancies were uncovered.
In the absence of conditions calling for surveillance of small area cancer clusters ('hot spots'), the rationale for accepting the burdens of preparing data at levels of geography finer than the census tract may not be compelling.
The geographic study of cancer patterns can be an important tool in disease control and prevention , as well as a resource for generating hypotheses about pathogenesis . Unfortunately, there is little practical guidance available as to whether or how to select an 'ideal' level of geography for surveillance of events with distinctive spatial autocorrelations . Designating the geo-spatial locations of health events (i.e., 'geocoding') so as to be accurate (within acceptable error), precise (to a desired areal unit of analysis) and 'fit for use' (applicable to other available data)  can be vexing, even for those with great skill and experience [4–12]. On the one hand, small areal units containing few at-risk subjects will yield less reliable rates than larger units, whereas on the other hand, large areal units have potential to blur meaningful variation occurring within locales. Communicating and interpreting results that disentangle underlying risks from methodological artifact is important for public health workers and epidemiologists alike.
Procedures for spatial analyses of suspected cancer 'hot spots'  may be unnecessary and even inappropriate  regarding studies of rate variation across large areas , as well as those intended to evaluate resource allocations [16, 17]. At the same time, concerns to protect confidentiality of geographically referenced health data by those entrusted to collect and manage surveillance data may effectively eliminate some options for analysis. While the underpinnings of the 'modifiable areal unit problem (MAUP)' have been well described [18, 19], there is neither guidance to effectively deal with the problem nor few real examples of whether or how differing aggregation units affect actual results. Armheim's treatment of simulated data suggests fewer disparities across findings with greater aggregation of data . Krieger et al., examining all-cause and selected cause-specific mortality and cancer incidence rates across Massachusetts and Rhode Island, found analyses by block group and census tract performed comparably , although tract-level analyses were found to offer greater linkage to area-based socio-economic indicators . Sheehan et al. reported few differences for town, zip code or census tract-level analyses of breast cancer incidence across Massachusetts, but noted case counts fluctuated due to various geocoding problems .
Here, we address the problem of modifiable areal units while examining breast and prostate cancer incidence during a 5 year interval (1988–92) across Connecticut. Initially, we utilized geographically referenced data furnished us by the CT Tumor Registry to consider differences of incidence and late stage cases according to town and census tract. Evidence of either global or local clustering was evaluated using Oden's Ipop  and the spatial scan statistic . Subsequently, we independently ascertaining census block group and exact latitude-longitude coordinates of recorded cases to consider whether greater precision of location modified/enhanced initial findings.
Prostate cancer incidence
Ipop global clustering. Case count correlations for the geographic distribution of invasive and late-stage prostate or breast cancer incidence within or among selected areal units of analysis, Connecticut, 1988–92.
Prostate Cancer Incidence
Breast Cancer Incidence
Late Stage Prostate Cancer Incidence
Late Stage Breast Cancer Incidence
Spatial scan statistic clusters. Approximate locations with elevated invasive and late-stage prostate or breast cancer incidence according to selected areal units of analysis, Connecticut, 1988–92.
Coordinates (Lat.; Long.)
Cases in Cluster
Prostate Cancer Incidence
Breast Cancer Incidence
Late Stage Prostate Cancer Incidence
Place of Residence
Late Stage Breast Cancer Incidence
Place of Residence
No significant clusters detected
No significant clusters detected
No significant clusters detected
Breast cancer incidence
Significant global clustering was found at each level of analysis. According to Ipop test results, the percent of incident breast cancer cases clustering among geographic units was somewhat less than that for prostate cancer in results for town (24.2% vs. 42.6%) or block group (27.9 vs. 40.0%), but similar when examined according to census tract (36.3% vs. 39.9%).
Proportion of late stage prostate cancer
Results of the Ipop statistic for tract and town level analyses did not reveal global clustering of late-stage prostate cancers, but significant, albeit minimal clustering (i.e., only 0.3% of clustering was attributed to cases adjacent block groups) was indicated in analysis by block group.
Proportion of late stage breast cancer
Spatial analysis of health necessarily addresses issues about the accuracy of geocoded data, the requirements of time and training necessary to complete tasks, the threats to protecting confidentiality of sensitive health records and the interpretability of results for given areal units of analysis. Desire for greater precision challenges data safeguards as well as the technical capacity of available GIS systems. Surveillance by aggregating records into large areal units will yield greater proportions of accurate and protected records but possibly at the expense of capacity to identify discrete locales with elevated rates/proportions of health outcomes .
Our effort to contrast geographic analyses of prostate and breast cancers according to differing aggregation units across Connecticut yielded much, but not complete, consistency across analyses. Like others [20, 22], we found in most instances that results obtained by block group level data mirrored those based on the census tract. As such, interpretations based on geocoded data available through the CTR were not appreciably enhanced by our further efforts to specify finer levels of geography. Global clustering of incident prostate and breast cancer cases was apparent for either level of geography and the test for local clustering found approximately the same locales, populations at risk and estimated effects.
On the other hand, some discrepancies were uncovered. Secondary cluster locations varied by level of analysis. More importantly, analysis of breast cancer incidence by town yielded an approximate location of a significant primary cluster some distance from results based on block group or tract. It is possible that discrepancy is not a product of analytic scale but the consequence of differing ability to geocode records across all locales . Test of this hunch requires analyses whereby cases excluded from one level of analysis would be excluded from all other analyses. As our intention was not a pure test of MAUP but a 'simulation' of the choice investigators might confront when selecting between a geographically referenced files in hand (CTR generated) or one independently created using original address data, we did not pursue this line of inquiry here.
The local tests for late stage prostate cancer produced similar findings of significant clustering for analysis by exact coordinates, block group, tract or town, whereas results for the global clustering test were not significant for all but the block group analysis. Significant global clustering of late stage breast cancer was found using block group or tract, but not town or exact coordinates; significant local clustering was found only for the block group level analysis. Divergence across analyses could reflect distinctions among the levels of aggregation or merely subtly differences in the relative size of our data sets. It is noted that analysis of disaggregate (point) data raise issues separate from those specific to MAUP which we specifically address in this paper. It goes without saying that statistical procedures predicated on disaggregate (point) data would be unavailable if only aggregate files were available .
When analyzing geographic health data, concern regarding scale effects attributable to MAUP is unavoidable. Increased aggregation of data reduces power to detect very small clusters but stabilizes rate estimates. For now, the magnitude and direction of artifact generated by a given areal unit cannot be reliably predicted. Consequently, analysts will continue to be driven to select a preferred areal unit for analysis based on pragmatic rather than scientific consideration. In the absence of conditions calling for surveillance of small area cancer clusters ('hot spots'), the rationale for analysts to accept the technical, political and substantive burdens of preparing data at levels of geography finer than the census tract may not be compelling. The added protections to personal health data, the ease of interpretation and the applicability of similarly structured census and survey data organized argues for geographic studies to prioritize census tract level analyses.
Spatial and population characteristics of selected areal units of Connecticut.
Area (sq. km)
1990 Population 20 & Over
Persons 20 + years per sq. mile
956 to 2,383
72,931 to 635,829
54 to 382
13 to 160
443 to 100,552
6 to 2,426
0.5 to 249
19 to 45,623
9 to 3,943
<0.01 to 160
0 to 7,507
6 to 9,077
<0.01 to 86
0 to 5,415
0 to 21,333
0 to 2,796
Between 1988 and 1992, the Connecticut Tumor Registry (CTR) recorded incidence and stage of diagnosis of 10,054 invasive cancers of the prostate (ICD-9-185) and 12,518 breast cancers (ICD-9-174) among State residents. The Institutional Review Boards of the University of Connecticut and Connecticut State Department of Public Health approved our access to, and analysis of information reported here.
Geocoding of incident prostate and breast cancer cases, Connecticut, 1988–92.
Incident cases with town of residence recorded by the Connecticut Tumor Registry (CTR)
Census tract of residence recorded by CTR
Geocoded block group & street address of residence
Geocoded street address on 1st try (stringent criteria)
Geocoded street address on 2nd try (relaxed criteria)
Nursing home resident excluded for analysis by block group and exact coordinates
Record not geocoded
Post Office box listed
No street address listed
No house number listed
Listed address unable to geocode
To examine if geographic patterns of cancer incidence and late stage change at finer units of analysis, we subsequently used the full street address available within the CTR record to independently assigned latitude-longitude coordinates to census block group and place of residence for 9,207 prostate (92%) and 11,864 breast (95%) cancer records. Our purpose was neither to augment nor correct the CTR data, but to generate separate geographically-referenced files to study cancer patterns according to aggregation units otherwise unavailable to external researchers. This accounts for the seemingly incongruous observation that 11,753 records were geocoded (by us) to block group whereas only 10,924 records were geocoded (by CTR) to tract. The result of our effort, vis-à-vis data provided by the CTR, is summarized in Table 4. As there is no 'gold standard' available to validate geocoded results, no effort was made to enumerate or resolve ambiguities that could be noted if files were directly compared.
Approximately one-half of records geocoded in this manner were categorized using stringent coding criteria (i.e., an address conforms completely to a street location recognized by geocoding software); the remainder were completed using 'relaxed' procedures (i.e., an address bearing one or more incongruities was assigned to the 'most likely' street location by the geocoding software) . We were unable to geocode 847 prostate and 654 breast cancer records because only a Post Office box was available, no street or house number was recorded or the recorded address could not be matched to a recognized street location. Records for individuals with addresses associated with nursing home were not included in this phase of analysis (179 prostate and 111 breast cancer records, respectively); leaving totals of 9,028 prostate cancers and 11,753 breast cancers for study.
Numerous tests for spatial randomness (i.e., are geographical patterns due to random fluctuations/chance or true underlying variability?) are available . For purposes of illustration, we selected one global clustering and one cluster detection test to evaluate geographic variations of disease rates.
Oden's Ipop  indicates whether there is an overall pattern of spatial aggregation of cases throughout the study region, without regard to specific locations where aggregation might occur. Group data are used to generate a weighted correlation coefficient, adjusted for population size, that indicates the extent to which case counts within given locations are associated with values of neighboring locales (i.e., are places with high frequencies adjacent to places with similarly high frequencies?). The significance of the computed value is evaluated in relation to an expectation derived by a hypothetical null spatial distribution of data. Oden's Ipop was calculated using ClusterSeer v2.06 software .
The spatial scan statistic  looks for significant concentration of cases at specific locations within a study region without preconceptions about where concentrations might be found. The spatial scan statistic utilizes scanning circles of varying location and size so as to contain 0–25% of the State's population at risk to identify places where the number of observed cases exceeds expectation under a null hypothesis that incidence is proportional to population density. The spatial scan statistic was calculated using SaTScan 3.1 .
Among the available address matched records, 9,207 (92%) prostate cancer and 11,864 (95%) breast cancer records contained sufficient information for geographic analyses of 'late stage' disease across the State. Historical SEER summary stage classifications  were used where regional/distant prostate or breast cancers were noted among 2,198 (28%) and 4,119 (40%) records, respectively. Analyses of geographic distribution of disease stage (regional/ distant versus local) using Oden's Ipop and the spatial scan statistic were completed according to town, census tract and census block group of residence. The spatial scan also was applied using exact place of residence coordinates of cases; because necessary group boundaries for discrete residential locations are unavailable, Oden's Ipop could not be used with individual coordinates. Maptitude 4.5 software  was used to map cluster locations with markedly high incidence rates (Figures 1 and 2) or proportions of late-stage disease (Figures 3 and 4).
This publication/project was made possible through a Cooperative Agreement between the Centers for Disease Control and Prevention (CDC) and the Association of Teachers of Preventive Medicine (ATPM), award number U50/CCU300860 project number TS-0431; its contents are the responsibility of the authors and do not necessarily reflect the official views of the CDC or ATPM.
- Rushton G: Methods to evaluate geographic access to health services. J Public Health Manag Pract. 1999, 5: 93-100.PubMedView Article
- Klassen AC, Curriero FC, Hong JH, Williams C, Kulldorff M, Meissner HI, Alberg A, Ensminger M: The role of area-level influences on prostate cancer grade and stage at diagnosis. Prev Med. 2004, 39: 441-448. 10.1016/j.ypmed.2004.04.031.PubMedView Article
- Cromley EK, Cromley RG: An analysis of alternative classification schemes for medical atlas mapping. Eur J Cancer. 1996, 32A: 1551-1559. 10.1016/0959-8049(96)00130-X.PubMedView Article
- Chen W, Petitti DB, Enger S: Limitations and potential uses of census-based data on ethnicity in a diverse community. Ann Epidemiol. 2004, 14: 339-345. 10.1016/j.annepidem.2003.07.002.PubMedView Article
- Rushton G: Selecting appropriate geocoding methods for cancer control and prevention program activities. [http://www.uiowa.edu/~gishlth/giswkshp/GCD_Rushton_files/frame.htm#slide0001.htm]
- Gregorio DI, Cromley E, Mrozinski R, Walsh SJ: Subject loss in spatial analysis of breast cancer. Health Place. 1999, 5: 173-177. 10.1016/S1353-8292(99)00004-0.PubMedView Article
- Yang DH, Bilaver LM, Hayes O, Goerge R: Improving geocoding practices: evaluation of geocoding tools. J Med Syst. 2004, 28: 361-370. 10.1023/B:JOMS.0000032851.76239.e3.PubMedView Article
- Cayo MR, Talbot TO: Positional error in automated geocoding of residential addresses. Int J Health Geogr. 2003, 19: 10-10.1186/1476-072X-2-10.View Article
- Krieger N, Waterman P, Lemieux K, Zierler S, Hogan JW: On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research. Am J Public Health. 2001, 91: 1114-1116.PubMedPubMed CentralView Article
- Bonner MR, Han D, Nie J, Rogerson P, Vena JE, Freudenheim JL: Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology. 2003, 14: 408-412.PubMed
- McElroy JA, Remington PL, Trentham-Dietz A, Robert SA, Newcomb PA: Geocoding addresses from a large population-based study: lessons learned. Epidemiology. 2003, 14: 399-407.PubMed
- Hurley SE, Saunders TM, Nivas R, Hertz A, Reynolds P: Post office box addresses: a challenge for geographic information system-based studies. Epidemiology. 2003, 14: 386-391.PubMed
- Thun MJ, Sinks T: Understanding cancer clusters. CA Cancer J Clin. 2004, 54: 273-280.PubMedView Article
- Kulldorff M, Nararwalla N: Spatial disease clusters: detection and inference. Stat Med. 1995, 14: 799-810.PubMedView Article
- Sturgeon SR, Schairer C, Gail M, McAdams M, Brinton LA, Hoover RN: Geographic variation in mortality from breast cancer among white women in the United States. J Natl Cancer Inst. 1995, 76: 1846-1853.View Article
- Rushton G, West M: Women with localized breast cancer selecting mastectomy treatment, Iowa, 1991–1996. Public Health Rep. 1999, 114: 370-371.PubMed
- Gregorio DI, Kulldorff M, Barry L, Samociuk H, Zarfos K: Geographic differences in primary therapy for early-stage breast cancer. Ann Surg Oncol. 2001, 8: 844-849.PubMedView Article
- Openshaw S, Alvandies S: Applying geocomputing to the analysis of spatial distributions. Geographic information systems: Principles and technical issues. Edited by: Longley P, Goodchild M, Maguire D, Rhind D. 1999, New York: John Wiley and Sons, Inc, 1: 2
- Armhein C: Searching for the elusive aggregation effect: Evidence from statistical simulations. Environment & Planning A. 1994, 27: 105-09.
- Krieger N, Chen JT, Waterman PD, Soobader MJ, Subramanian SV, Carson R: Geocoding and monitoring of US socioeconomic inequalities in mortality and cancer incidence: does the choice of area-based measure and geographic level matter?. Am J Epidemiol. 2002, 156: 471-482. 10.1093/aje/kwf068.PubMedView Article
- Krieger N, Chen JT, Waterman PD, Rehkopf DH, Subramanian SV: Race/ethnicity, gender and monitoring socioeconomic graduate in health: a comparison of area-based socioeconomic measures – the Public Health Disparities Geocoding Project. Am J Public Health. 2003, 93: 1655-1671.PubMedPubMed CentralView Article
- Sheehan TJ, Gershman ST, MacDougal L, Danley RA, Mroszczyk M, Sorensen AM, Kulldorff M: Geographic surveillance of breast cancer screening by tracts, towns and zip codes. J Public Health Manag Pract. 2000, 6: 48-57.PubMedView Article
- Oden N: Adjusting Moran's I for population density. Stat Med. 1995, 14: 17-26.PubMedView Article
- Kulldorff M: A spatial scan statistic. Commun Stat Theory Methods. 1997, 26: 1481-1496.View Article
- Gregorio DI, Cromley E, Tate JP, Mrozinski R, Walsh SJ, Flannery J: Subject loss in spatial analysis of breast cancer. Health and Place. 1999, 5: 173-77. 10.1016/S1353-8292(99)00004-0.PubMedView Article
- Waller LA, Gotway CA: Applied Spatial Statistics for Public Health Data. 2004, New York: WileyView Article
- Census of Population and Housing, 1990 [United States]: Summary Tape File 1, Connecticut. [http://www.census.gov]
- Caliper Corporation: Maptitude Geographic Information System for Windows. ver 4.5. 2001, Newton, MA
- Lawson AB, Kulldorff M: A review of cluster detection methods. Disease mapping and risk assessment for public health decision-making. Edited by: Lawson AB, Biggeri A, Bohning D, Lesaffre E, Veil J, Bertollini R. 1999, London: Wiley, 99-110.
- TerraSeer, Inc: ClusterSeer. ver. 2.07; 2002–2003. [http://www.terraseer.com/products/clusterseer.html]
- Kulldorff M, Information Management Services, Inc: SaTScan. ver. 3.1. 2003, [http://www.statscan.org]
- National Cancer Institute: SEER Extent of Disease – 1988: Codes and Coding Instructions. 1998, [http://seer.cancer.gov/manuals/EOD10Dig.pub.pdf]3
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.