On the use of ZIP codes and ZIP code tabulation areas (ZCTAs) for the spatial analysis of epidemiological data
© Grubesic and Matisziw; licensee BioMed Central Ltd. 2006
Received: 16 October 2006
Accepted: 13 December 2006
Published: 13 December 2006
While the use of spatially referenced data for the analysis of epidemiological data is growing, issues associated with selecting the appropriate geographic unit of analysis are also emerging. A particularly problematic unit is the ZIP code. Lacking standardization and highly dynamic in structure, the use of ZIP codes and ZIP code tabulation areas (ZCTA) for the spatial analysis of disease present a unique challenge to researchers. Problems associated with these units for detecting spatial patterns of disease are explored.
A brief review of ZIP codes and their spatial representation is conducted. Though frequently represented as polygons to facilitate analysis, ZIP codes are actually defined at a narrower spatial resolution reflecting the street addresses they serve. This research shows that their generalization as continuous regions is an imposed structure that can have serious implications in the interpretation of research results. ZIP codes areas and Census defined ZCTAs, two commonly used polygonal representations of ZIP code address ranges, are examined in an effort to identify the spatial statistical sensitivities that emerge given differences in how these representations are defined. Here, comparative analysis focuses on the detection of patterns of prostate cancer in New York State. Of particular interest for studies utilizing local, spatial statistical tests, is that differences in the topological structures of ZIP code areas and ZCTAs give rise to different spatial patterns of disease. These differences are related to the different methodologies used in the generalization of ZIP code information. Given the difficulty associated with generating ZIP code boundaries, both ZIP code areas and ZCTAs contain numerous representational errors which can have a significant impact on spatial analysis. While the use of ZIP code polygons for spatial analysis is relatively straightforward, ZCTA representations contain additional topological features (e.g. lakes and rivers) and contain fragmented polygons that can hinder spatial analysis.
Caution must be exercised when using spatially referenced data, particularly that which is attributed to ZIP codes and ZCTAs, for epidemiological analysis. Researchers should be cognizant of representational errors associated with both geographies and their resulting spatial mismatch, especially when comparing the results obtained using different topological representations. While ZCTAs can be problematic, topological corrections are easily implemented in a geographic information system to remedy erroneous aggregation effects.
As the production and consumption of spatial data continues to increase, the subsequent use and abuse of spatially referenced data is also on the rise. Jacquez  provides a timely review of the key issues, outlining a number of limitations to working with spatial and temporal data. For example, one of the major issues confronting analysts is spatiotemporal mismatch. Broadly defined, this occurs when data collected in both space and time do not coincide. For example, Jacquez  highlights a recent study of lung cancer on Long Island that used cancer data collected at the ZIP+4 level reported for 1994–97 . Cancer incidence was then compared to air toxics data from the Environmental Protection Agency for 1996. In this particular instance, the mismatch is both spatial and temporal.
A second concern highlighted by Jacquez  and others [3–5] is the issue of granularity in epidemiological data. In sum, granularity deals with the spatial and temporal resolution of data. Because human health applications must adhere to patient privacy protocols, individual level data is frequently aggregated to larger spatial units for analysis. For instance, rather than utilizing geocoded household data corresponding to individual patients, these records are aggregated to the ZIP code level for analysis. This process prevents unwanted disclosure or reconstruction of patient identity . However, it also reduces the ability for analysts to compare data across spatial units. For example, if one set of data is aggregated to census tracts and another set to ZIP codes, issues relating to the modifiable areal unit problem emerge .
A third major issue of interest is more technical in nature, that of polygons, topology and computational geometry. As noted by Jacquez , many spatial statistical techniques are predicated on the accurate representation of areal units (polygons), points and lines. If there are problems with areal units, such as self intersection, the resulting statistical analyses can be interlaced with errors.
As with most technical issues, epidemiologists, geographers and other analysts are aware of the limitations and caveats of working with spatial data. For example, in a study of cerebrovascular disease in New York State, Han et al.  note:
"[t]here may be some bias related to spatial mismatch, since we have used ZIP-code level hospitalization data and ZCTA-level population and income data in our analysis.... Unfortunately, we could not find any empirical study that validates this issue of spatial mismatch."
Of particular interest in the previous statement is the issue of bias and spatial mismatch between ZIP code areas and ZIP code tabulation areas (ZCTA). In fact, the problems of spatiotemporal mismatches between these two units have largely gone unnoticed. While Kreiger et al.  provide a brief overview regarding many of the technical differences between ZIP codes and ZCTAs, a full treatise of the differences, particularly how these differences may bias empirical analysis, is not available.
The purpose of this study is to 1) reexamine the use and misuse of ZIP codes and ZCTAs for epidemiological analysis, 2) provide enough technical detail on the construction of ZIP code and ZCTA boundaries, and their associated characteristics, to supply analysts with a more complete picture of their utility for spatial analysis, 3) provide an empirically based analysis of the spatial and statistical mismatch between ZIP code areas and ZCTAs, highlighting their relative weaknesses, and 4) develop a methodological approach for rectifying the problems inherent to ZCTA topologies, so that more direct comparisons between ZCTA and ZIP code-based analysis may be performed.
Results and discussion
Issues of spatial misrepresentation and mismatch
In the context of longitudinal spatial analyses, the ability to match spatial units through time is important. Fortunately, the hierarchically nested spatial units provided by the Census Bureau (e.g. blocks, block groups, tracts, counties, etc.) simplify this task. In most cases, changes to the spatial structure of Census tracts and even block groups, can be tracked between the decennial surveys. As a result, accurate longitudinal analyses are much easier to perform. However, for temporally and spatially dynamic areal units that are not hierarchically nested, the problems of spatiotemporal mismatch are significant. Not surprisingly, the ZIP code and its spatial characteristics are of concern. Exceedingly popular for epidemiological analysis, the ZIP code has become a de-facto spatial unit for the study of disease distribution and etiology [9–13].
Zone Improvement Plan codes, or ZIP codes as they are commonly known, originated as a way of classifying street segments, address ranges and delivery points to expedite the delivery of mail. Given that ZIP codes can be associated with most places of human habitation in the United States, they present researchers with an alternative means of collecting, visualizing, and analyzing spatial information. However, given their use in directing the distribution of mail, ZIP codes are not attributed to space in general, but rather to roads, post offices, and other facilities within the U.S. postal system. For instance, if an area does not have a recognized delivery point or address range, no ZIP code is assigned. Geographically, the best examples of this are in desolate and uninhabited places such as the Sonora Desert in Arizona, the Mojave Desert in California and the Klamath Mountains in Oregon. Simply put, if no residential areas or business establishments exist, there is no need to deliver mail or assign a five digit ZIP code. The process for making ZIP codes accessible for spatial analysis, has involved their generalization into polygonal units representing the spatial extent of ZIP code delivery areas (referred to here as ZIP code areas). In large part, the tiling of the United States with ZIP code areas has been accomplished by various private data vendors. More recently, the U.S. Census Bureau has produced its own ZIP code topology for area based representations – ZIP Code Tabulation Areas (ZCTAs).
The use of ZIP codes for applications other than postal delivery can present many challenges and there are several major issues worth summarizing. First, the United States Postal Service (USPS) makes updates to its ZIP codes regularly , providing this information in the biweekly Postal Bulletin. However, for analysts unfamiliar with a particular area, understanding the magnitude and nature of these changes is a challenge. For example, it is not uncommon for postal delivery routes to be realigned or for ZIP codes to be split. More importantly, ZIP codes can be discontinued, added or expanded between months/years. Thus, where longitudinal studies are concerned, even the slightest modification in ZIP codes and their associated coverage can create a spatiotemporal discontinuity . Many private data vendors update ZIP code area databases quarterly. However, even this relatively short time-lag between updates can be problematic for areas where significant changes were made, particularly for syndromic surveillance or infectious outbreaks. Further, if analysts fail to make use of available updates, problems can also emerge. Another difficulty associated with ZIP code areas is the significant variation in geographic extent [8, 10]. Grubesic  notes that the average size of a ZIP code area in Wyoming is (1,430 square kilometers), while the average size of a ZIP code area in New Jersey is 12.8 km2. The USPS does attempt to optimize the size or population allocation of ZIP codes given that the sole purpose of the ZIP code is to expedite the delivery of mail. As a result, ZIP codes can range in size from a single building to a delivery zone spanning hundreds of square miles and crossing several political jurisdictions .
A Summary of Census ZCTA Characteristics
ZCTAs are linked to Census blocks and every tabulation block has a single ZCTA code
ZCTAs cover all tabulation blocks in the United States and Puerto Rico
ZCTAs may consist of two or more discontiguous areas
A ZCTA code represents a five digit ZIP code where possible
In large undeveloped areas where there are no master address file (MAF) addresses with five-digit ZIP codes, the ZCTA code assigned is based on the three-digit ZIP code (e.g. XX for tracts of undeveloped land and HH for water features)
Numerical Differences between ZCTA and ZIP Code Geographic Base Files in New York State
ZIP Code (GDT 2000)
Number of Polygons
Number of Unique Records
Standard Deviation in Size
CZU i <1 = decreased level of uncertainty
CZU i = 1 = average level of uncertainty
CZU i >1 = increased level of uncertainty
Figure 2 suggests that while many of the GDT ZIP codes in New York State include fewer than expected numbers of non-native street segments, many others display an increased level of uncertainty. Clearly, this suggests the presence of a relatively substantial gap between the ZIP codes assigned to linear features and their location relative to interpolated ZIP code areas. Interestingly, much of this uncertainty can be attributed to the process of ZIP code polygon interpolation, which is outlined in the next section.
ZIP code polygon interpolation
The process for developing ZIP code area polygons is relatively laborious. As mentioned previously, these areal units are not developed and distributed by the USPS . Rather, private data vendors, such as GDT/TeleAtlas  and Caliper  generate these boundaries. Boundaries are created by using several important pieces of information. First, data vendors leverage mail-stop (i.e. residential and business addresses) information from the USPS and their associated street segments. Second, other non-street features are also analyzed, including water bodies, parks, and large tracts of undeveloped land. Third, ZIP+4 state directories are used to differentiate delivery zones and the corresponding boundaries for areas that might not have a clear-cut group of street segments. Finally, technicians make telephone inquiries to area post offices in an effort to determine predominant ZIP codes . Once all of this information is collected, ZIP code polygons are manually digitized. This process, particularly the use of manual digitizing routines, can lead to polygon generalization and a "smoother" geographic boundary file.
The process for developing ZCTAs by the U.S. Census Bureau is much different. As highlighted in Table 1, ZCTAs have some relatively distinct features that ZIP codes do not. Many of these features relate to the characteristics of the Census blocks on which they are based. There is no standard spatial extent of Census blocks. Some blocks are relatively small (i.e. those located in a city), while others are large and irregular, covering many square miles. Utilizing Census block boundaries, USPS ZIP code data and the 2000 Master Address File (MAF), the Census Bureau calculated the numbers of addresses associated with each ZIP code represented in each tabulation block and then assigned the ZCTA that represented the most frequently occurring ZIP code with preference given to residential addresses. If no ZIP code data were available, ZCTA codes were assigned from an adjoining block. Finally, it is important to remember that since the size of Census blocks vary widely over space, zone delineation is guided more by the Census geographies than by the distribution of ZIP coded addresses.
The standard GDT (2000) ZIP code boundaries for Blossvale are highlighted in yellow. The ZCTA boundaries for the same ZIP code and the neighboring Lake Oneida are displayed in red. There are several critical points worth addressing here. First, the 13308 ZCTA and GDT ZIP code area representations are not in complete spatial correspondence, given that there are a number of slight deviations between these two areal units. Clearly, this represents a spatial mismatch. Second, notice that a small water feature, Fish Creek, cuts the 13308 ZCTA in half. When one examines the raw geographic base files for ZCTAs, 13308 actually appears twice. That is, there are two separate and distinct entries in the geographic base file for the 13308 ZCTA. Thus, if the ZCTA remains uncorrected, data assigned to the ZCTA will be represented twice. Additionally, if an adjacency matrix is constructed, as is often necessary in spatial statistical analysis, the 13308 ZCTAs are not treated as neighbors because they are split by the 130 HH water feature polygon. Therefore, inclusion of these polygons can muddle spatial relationships between ZCTAs that have socioeconomic, demographic and epidemiologic data associated with them. Clearly, any lack of adjustment to the ZCTA geographic base file incorporates these types of errors into the subsequent analysis.
Given this background in ZIP code area interpolation and ZCTA development, there are several questions remaining to be answered. First, how do these potential spatial inconsistencies manifest in the real-world? Second, what kind of impact would these problems have on spatial-statistical analysis? Third, how does one correct these problems to ensure consistency and accuracy in an analysis?
Mitigating topological anomalies in the ZCTA geographic base file
To illustrate some of the issues associated with use of ZIP code areas and ZCTAs in spatial analysis, both topologies for New York State were obtained for analysis. In order to compare ZIP code areas with ZCTAs in New York, several important steps must be undertaken to mitigate the topological anomalies between these two geographic base files. Based on year 2000 ZIP code data from GDT, New York is covered by 1,599 ZIP code areas. Conversely, 2,450 Census ZCTAs cover the state (Table 2). In part, this high number of ZCTAs is a product of the 398 water features found in the state that fragment the ZCTAs. To bring these two geographies into greater accord, several steps must be taken to adjust the ZCTA file for the presence of these features :
1. In order to rectify the topological anomalies in the ZCTA file, one must remove all ZCTAs with HH codes. This eliminates all water features in the file. While the features are still visible, they are no longer entities in the geographic base file. It is not as critical to remove features with XX codes, because these actually do represent land masses with no formal addresses in the system, rarely splitting a ZCTA into multiple features like a river or creek might (See Figure 4).
2. All five-digit ZCTA entries that consist of multiple polygons (e.g. split by a water feature) must be dissolved on a common attribute ID. In virtually every case, this can be the ZCTA code. The dissolve process merges polygons into single features, removing double or triple entries in the geographic base file and ignoring any splits in polygon continuity that may have been created by water features.
3. Cancer incident cases, population, or whatever variables of interest are being analyzed, must be reaggregated back to the topologically rectified ZCTA geographic base file for analysis. This effectively removes the aggregation errors (e.g. double counting) from the original file.
4. Finally, if one is conducting a spatial statistical analysis that relies on neighborhood information, the adjacency matrix must be recalculated using the rectified ZCTA file. Again, because the water features are removed, and ZCTA polygons are now dissolved on a common attribute, the newly calculated adjacency matrix will represent a more realistic and accurate snapshot of spatial relationships between polygons.
After correcting for the water polygons, the ZCTA and ZIP code area boundary files are in nearly complete correspondence. For the analysis that follows, ZIP code based prostate incidence data was obtained from the New York State Department of Health (NYSDOH) . As discussed in the methodology section, data for some ZIP code areas were aggregated in this particular dataset. In an attempt to accurately represent this data, both the New York ZIP code area and ZCTA geographies used in this analysis were subject to similar aggregation of areas where necessary. Given this aggregation, the GDT ZIP code areas, subsequently modified to meet confidentiality requirements by the NYSDOH, numbered 1,384 while the topologically adjusted ZCTA file now includes 1,389 areas – yielding a difference of only 5 polygons. This small difference can be attributed to five partitions of land with no five-digit ZIP codes – areas maintained by the Census Bureau in the ZCTA file (i.e. XX codes).
In summary, ZIP code areas and ZCTAs are not directly comparable units of observation. In addition to displaying significant differences in size and extent, there is a major disconnect in the way these units are generated. These differences stem from the fact that ZIP codes are based on address ranges, developed for mail delivery and their representation as polygons does not accurately portray all of the linear features in a ZIP code. Given the methods by which these areal units are generated, there are many instances where ZIP ranges are misclassified by ZIP code areas and ZCTAs. Our research also suggests that ZCTAs present some challenges with which analysts must address, particularly in their spatial representation. As noted previously, Census blocks are used for building ZCTA boundaries. In addition to the errors introduced by representing linear features with polygons, each block is assigned a single ZCTA code. While this is good for looking at census data, if there is overlap or underlap between ZIP code segments, the ZCTA zoning scheme is unable to accurately portray these differences. Further, the incorporation of water features and uninhabited areas into the ZCTA geographic base file can also complicate spatial analysis.
In conclusion, the problem of spatiotemporal mismatch is significant for ZIP codes and ZCTAs. Caution must be used when attempting to compare statistical results across both time and space when these units are used. More importantly, analysts must also weigh the cost/time benefits of rectifying ZCTA topology for conducting epidemiological analysis. While this certainly involves more work and GIS processing time, the benefits of these modifications are significant.
Observed values of prostate cancer incidence were retrieved from the New York State Cancer Registry. ZIP code boundaries were created by Geographic Data Technology for the year 2000 and subsequently modified by the NYSDOH . These modifications include the following:
1. Some adjacent ZIP codes were combined due to confidentiality requirements because an insufficient numbers of cases of prostate cancer were reported.
2. A subset of residential point ZIP codes with no defined delivery area and ZIPs too small to be included in the GDT file were also combined with adjacent ZIP code areas.
3. NYSDOH also eliminated uninhabited islands from the ZIP code area file.
ZCTA boundaries were delineated by the U.S. Census Bureau for the year 2000. The street network used for calculating CZU i were based on TIGER 2000 data .
The coefficient of ZIP code uncertainty is calculated as follows:
x i =the number of non-native ZIP code street segments in ZIP code i
y i = the number of street segments in ZIP code i
As mentioned previously, CZU i measures the local concentration of non-native street segments within a ZIP code area relative to the number of non-native segments for a larger spatial unit (e.g. a metropolitan area or a state). Segments with no ZIP codes were not included in this computation given that there is no way of telling whether or not they actually contained an address and which ZIP it was attributed to. It is also important to remember that CZU i says nothing about the length of these street segments. However, with a slight adjustment to both the numerator and denominator, the magnitude of uncertainty, as measured by the distance associated with each non-native street segment could be quantified.
ZIP code and ZCTA contiguity measurements were quantified through the use of a spatial weights matrix, W. Elements of Ware specified as:
Where c ij = 1 if i and j share a common boundary or vertex; 0 otherwise. For the purposes of this study, first order properties include only those vertices and boundaries that are contiguous to the observation (ZIP code or ZCTA) in question (viz. a Queen's contiguity matrix). While there are alternatives to this spatial weight matrix (e.g. rook, or distance based), the selection of a queen's based measure provided an effective approach for highlighting the topological complexities of the ZCTA geographic base layer. A more robust contiguity matrix, using other spatial lags, or polygon boundary lengths would be appropriate for a formal analysis of cancer incidence and clustering.
The statistical analysis of local spatial association was conducted by using a local Moran's I test statistic. The local Moran's I  is defined as:
x i and x j are observations for locations i and j (with mean μ)
z i = (x i - μ),
z j = (x j - μ), and
w ij = spatial weights matrix with values of 0 or 1.
- Jacquez GM: Current practices in the spatial analysis of cancer: flies in the ointment. International Journal of Health Geographics. 2004, 3 (22):Google Scholar
- Jacquez GM, Grieling DA: Local clustering in breast, lung and colorectal cancer in Long Island, New York. International Journal of Health Geographics. 2003, 2 (3):Google Scholar
- Boscoe FP, Ward MH, Reynolds P: Current practices in spatial analysis of cancer data: data characteristics and data sources for geographic studies of cancer. International Journal of Health Geographics. 2004, 3 (28):Google Scholar
- Miller HJ, Wentz EA: Representation and spatial analysis in geographic information systems. Annals of the Association of American Geographers. 2003, 93: 574-594. 10.1111/1467-8306.9303004.View ArticleGoogle Scholar
- Johnson GD: Small area mapping of prostate cancer incidence in New York State (USA) using fully Bayesian hierarchical modeling. International Journal of Health Geographics. 2004, 3 (29):Google Scholar
- Openshaw S: The modifiable areal unit problem. Concepts and techniques in modern geography. 1984, Norwich: Geobooks, 38:Google Scholar
- Han D, Carrow SS, Rogerson PA, Munschauer FE: Geographical variation of cerebrovascular disease in New York State: the correlation with income. International Journal of Health Geographics. 2005, 4 (25):Google Scholar
- Krieger N, Waterman P, Chen JT, Soobader MJ, Subramanian SV, Carson R: ZIP code caveat: bias due to spatiotemporal mismatches between ZIP codes and US census-defined geographic areas – the Public Health Disparities Geocoding Project. Am J Public Health. 2002, 92: 1100-1102.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang F: Spatial clusters of cancers in Illinois 1986–2000. J Med Syst. 2004, 28 (3): 237-56. 10.1023/B:JOMS.0000032842.78643.38.PubMedView ArticleGoogle Scholar
- Cook WH, Grala K, Wallis RC: Avian GIS models to signal human risk for West Nile virus in Mississippi. International Journal of Health Geographics. 2006, 5 (36):Google Scholar
- Acevedo GD: ZIP code-level risk factors for tuberculosis: neighborhood environment and residential segregation in New Jersey, 1985–1992. Am J Public Health. 2001, 91 (5): 734-741.View ArticleGoogle Scholar
- Luo W, Wang F: Measures of spatial accessibility to healthcare in a GIS environment: Synthesis and a case study in Chicago region. Env Plan B. 2003, 30 (6): 865-884. 10.1068/b29120.View ArticleGoogle Scholar
- Dohn MN, White ML, Vigdorth EM, Ralph Buncher C, Hertzberg VS, Baughman RP, George Smulian A, Walzer PD: Geographic clustering of Pneumocystis carinii pneumonia in patients with HIV infection. Am J Respir Crit Care Med. 162 (5): 1617-1621.Google Scholar
- ZIP Code Frequently Asked Questions. [http://www.usps.com/ncsc/ziplookup/zipcodefaqs.htm]
- Grubesic TH: ZIP codes and spatial analysis: Problems and prospects. Socio-Economic Planning Sciences.Google Scholar
- ZIP code tabulation areas (ZCTA) frequently asked questions. [http://www.census.gov/geo/ZCTA/zctafaq.html]
- Census 2000 ZIP code tabulation areas technical documentation. [http://www.census.gov/geo/ZCTA/zcta_tech_doc.pdf]
- Cova TJ, Church RL: Contiguity constraints for single-region site search problems. Geographical Analysis. 2000, 32 (4): 306-329.View ArticleGoogle Scholar
- Geographic Data Technology/TeleAtlas: ZIP code boundary files (year 2000). Lebanon. 2001Google Scholar
- Krieger N, Waterman P, Lemieux K, Zierler S, Hogan JW: On the wrong side of the tracts? Evaluating accuracy of geocoding for public health research. Am J Public Health. 2001, 91: 1114-1116.PubMedPubMed CentralView ArticleGoogle Scholar
- Ratcliffe JH: On the accuracy of TIGER type geocoded address data in relation to cadastral and census areal units. International Journal of Geographical Information Science. 2001, 15 (5): 473-485. 10.1080/13658810110047221.View ArticleGoogle Scholar
- Grubesic TH, Murray AT: Assessing the locational uncertainties of geocoded data. Proceedings from the 24th Urban Data Management Symposium. Chioggia. 27–29 October 2004Google Scholar
- Caliper Corporation. [http://www.caliper.com]
- Geographic Data Technology/TeleAtlas: Ohio ZIP Code areas. [http://www.co.warren.oh.us/warrengis/metadata/ohZIP.htm]
- U.S. Census Bureau: Master Address File (MAF) Basics. [http://www.census.gov/geo/mod/maf_basics.pdf]
- New York State Department of Health (NYSDOH): New York State Cancer Registry. [http://www.health.state.ny.us/statistics/cancer/registry/nyscr.htm]
- Moonan PK, Bayona M, Quitagua TN, Oppong J, Dunbar D, Jost KC, Burgess G, Singh KP, Weis SE: Using GIS technology to identify areas of tuberculosis transmission and incidence. International Journal of Health Geographics. 2004, 3 (23):Google Scholar
- Anselin L: Local Indicators of Spatial Association – LISA. Geographical Analysis. 1995, 27 (2): 93-115.View ArticleGoogle Scholar
- Anselin L, Syabri I, Kho Y: GeoDa: An Introduction to Spatial Data Analysis. Geographical Analysis. 2006, 38 (1): 5-22. 10.1111/j.0016-7363.2005.00671.x.View ArticleGoogle Scholar
- McLaughlin CC, Boscoe FP: Effects of randomization methods on statistical inference in disease cluster detection. Health and Place. 2007, 13 (1): 152-163. 10.1016/j.healthplace.2005.11.003.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.