A comparison of spatial clustering and cluster detection techniques for childhood leukemia incidence in Ohio, 1996 – 2003
© Wheeler. 2007
Received: 16 January 2007
Accepted: 27 March 2007
Published: 27 March 2007
Spatial cluster detection is an important tool in cancer surveillance to identify areas of elevated risk and to generate hypotheses about cancer etiology. There are many cluster detection methods used in spatial epidemiology to investigate suspicious groupings of cancer occurrences in regional count data and case-control data, where controls are sampled from the at-risk population. Numerous studies in the literature have focused on childhood leukemia because of its relatively large incidence among children compared with other malignant diseases and substantial public concern over elevated leukemia incidence. The main focus of this paper is an analysis of the spatial distribution of leukemia incidence among children from 0 to 14 years of age in Ohio from 1996–2003 using individual case data from the Ohio Cancer Incidence Surveillance System (OCISS).
Specifically, we explore whether there is statistically significant global clustering and if there are statistically significant local clusters of individual leukemia cases in Ohio using numerous published methods of spatial cluster detection, including spatial point process summary methods, a nearest neighbor method, and a local rate scanning method. We use the K function, Cuzick and Edward's method, and the kernel intensity function to test for significant global clustering and the kernel intensity function and Kulldorff's spatial scan statistic in SaTScan to test for significant local clusters.
We found some evidence, although inconclusive, of significant local clusters in childhood leukemia in Ohio, but no significant overall clustering. The findings from the local cluster detection analyses are not consistent for the different cluster detection techniques, where the spatial scan method in SaTScan does not find statistically significant local clusters, while the kernel intensity function method suggests statistically significant clusters in areas of central, southern, and eastern Ohio. The findings are consistent for the different tests of global clustering, where no significant clustering is demonstrated with any of the techniques when all age cases are considered together.
This comparative study for childhood leukemia clustering and clusters in Ohio revealed several research issues in practical spatial cluster detection. Among them, flexibility in cluster shape detection should be an issue for consideration.
Spatial cluster detection is an important tool in cancer surveillance to identify areas of elevated risk and to generate subsequent hypotheses about cancer etiology. A spatial disease cluster may be defined as an area with an unusually elevated disease incidence rate [1, 2]. There are several cluster detection methods used in spatial epidemiology to investigate apparently suspicious groupings of cancer occurrences in both regional count data and case-control data, where the controls are often sampled from the at-risk population and are used to estimate local relative risk or local rates, depending on the method utilized. Numerous studies [3, 4] in the literature have focused on childhood leukemia because of its relatively large incidence among children compared with other malignant diseases, its apparent tendency to cluster, and the substantial public concern over locally elevated leukemia incidence. Many cluster-inducing factors have been considered in the literature on leukemia, including infectious agents  and population mixing[6, 7], environmental pollution , such as benzene , pesticides , and radiation , and geographic variation in other risk factors, such as inherited genetic risk , maternal alcohol consumption and cigarette smoking , and socioeconomic status . There are many studies of potential cancer clusters in the literature, and the reader is referred to two useful reviews [15, 16].
In this paper, we present an empirical analysis of the spatial distribution of leukemia incidence among children from 0 to 14 years of age in Ohio from 1996–2003 using individual case data from the Ohio Cancer Incidence Surveillance System (OCISS) in response to public concern of potentially elevated cancer risk among children in areas of Ohio. There has been no previous comprehensive and systematic spatial analysis of potential clustering of childhood leukemia in Ohio. Other studies [7, 17] of potential clusters of childhood leukemia in Ohio do not include spatial analysis methods or individual case data, and instead typically use chi-square tests of differences in expected and observed case counts in census or political units. This approach is not expressly a test for clustering or clusters, but a test of elevated counts inside an often heterogeneously populated area, for example, a county, and the test for one area is considered independently of other areas. This approach does not consider if areas with significantly more cases than expected are spatially juxtaposed [18, 19]. We choose not to use aggregated case data at the census level because we have access to individual case and control data, want to avoid unstable regional rates caused by small observed case counts and small population counts [20, 21], and want to avoid the modifiable areal unit problem (MAUP)  arising from using political boundaries that are arbitrarily related to public health. More specifically, we explore whether there is or is not statistically significant global clustering and local clusters of individual leukemia cases using numerous published methods of spatial cluster detection. We, therefore, address the questions of whether childhood leukemia cases have a significant tendency to cluster in Ohio and where the most unusual groupings of cases, if any, are located. The evaluation of the null hypothesis of no significant global spatial clustering of childhood leukemia uses three different methods: the K function, the kernel intensity function, and Cuzick and Edwards' method. See Waller and Jacquez  for a discussion of hypotheses in tests for disease clustering. We evaluate the null hypothesis of no local areas of elevated childhood leukemia risk using the kernel intensity function and Kulldorff's scan statistic. The distinction between clustering and cluster detection tests has been made in the literature [1, 19, 23–25], and we follow that distinction in this paper. Clustering and cluster detection tests are viewed as complimentary, as they test different hypotheses. A simulation study by Waller et al.  indicated that it is possible to have a significant cluster, but no overall significant clustering. In spatial point processes, the first-order property (intensity function) of the process is used for a test of clusters and the second-order property (K function) is used as a test for global clustering .
Our comparison of cluster detection methods is similar in spirit to Griffith's comparison of disease mapping techniques for West Nile Virus , and is motivated by the numerous and diverse analytical options currently available to cancer prevention researchers investigating potential clusters with case-control data. There have been methodological comparison papers in the literature for spatial cluster detection [27–31], but none exclusively for individual level data. Our selection set of methods to compare in this paper includes the leading published methods designed for individual level case data that are currently implemented in publicly available software. We use R software  to implement the K function and kernel intensity function, ClusterSeer software  for Cuzick and Edwards' method, and SaTScan  for Kulldorff's scan statistic. The reader interested in a comparison of general functionality of free software that may be used for cluster analysis is referred to a review by Anselin , although not all features compared in the review are expressly for individual case data. We next briefly review each of the clustering and cluster detection techniques and then present and compare the findings from them.
Kernel intensity function
Cuzick and Edwards' method
Results of the Cuzick and Edwards' test
Monte Carlo P-value
We next applied the Cuzick and Edwards method to subsets of the case data, using three sets for ages 0–4, 5–9, and 10–14 and one for ALL type cases. In the interest of space, we report only the summary of each subset analysis. There was no overall significant clustering or significant clustering at any level of k for cases age 0–4. There was significant clustering for cases age 5–9 with k = 7 (p-value = 0.04), but no overall significant clustering. There was no overall significant clustering or significant clustering for cases age 10–14. There was significant clustering for cases of type ALL with all ages with k = 6 (p-value = 0.048), but no overall significant clustering. The results suggest some clustering at six or seven nearest neighbors, depending on the subset of cases, but no overall clustering, regardless of the set of cases. The relevance of nearest neighborhood structures of size six or seven for some leukemia cases is unknown at this point in time, but could be a subject of future inquiry with a credible hypothesis. However, there may not be a factor that can be quantified to explain the significance of this apparent structure.
Typically, when public health professionals investigate a potential cluster, they use a much smaller study area than a state, perhaps using the spatial extent of a county or area surrounding a town. To better mimic this type of investigation, and to evaluate the sensitivity of the spatial scan statistic's test for significance to the size of the study area, we next report results from a cluster detection analysis in a spatial subset of the study area. We selected a contiguous set of five counties, Union, Franklin, Delaware, Madison, Champaign, which contained the most likely SaTScan cluster for cases age 0–14. In practice, a public health analyst would not refine the study area around a previously detected cluster. The most likely cluster found by SaTScan with this subset of data is the same 43 cases in the most likely cluster with all of the Ohio data, but the p-value is now 0.71, instead of the value of 0.81 found with the complete dataset. The highlighted subset of counties and most likely cluster are visualized in Figure 9. This raises a point that the size of the study area can impact the result of the significance test in SaTScan. Naturally, the conclusion of no significant cluster in this situation does not change, but it could in some circumstances, with a cluster changing status from insignificant to significant depending on how the analyst defines the study area. We make note of this as more of a practical issue for consideration then as a criticism of SaTScan. Since the study area provides the context for interpretation in the investigation of the question of whether cases cluster in an area, the question of interest changes if the study area is changed. The relationship between study area size and the research question considered in a cluster detection study has also been discussed by Jacquez and Greiling .
The three methods used to detect global clustering, the K function, the kernel intensity function ratio summary, and Cuzick and Edwards' method, all found no statistically significant clustering of childhood (age 0–14) leukemia in Ohio from 1996–2003. Cuzick and Edwards' method also found no significant clustering in three separate age groups of cases and ALL type cases. These findings are not entirely surprising given the large and diverse study area of Ohio, in which it is doubtful that one particular risk factor would have a consistent or sustained effect across space that would result in clustering demonstrated at the state scale. It is more likely that factors which could explain clustering of cases would have local or regional influence, and one factor could be associated with clustering in one area while another factor could be related to clustering in a different area. Given the scale of the study area in this analysis, the search for local cancer clusters is the more useful investigation, and also the one with more public interest. In investigation of potential clusters, there were inconsistent findings from the two methods used to detect clusters. The kernel intensity function ratio suggested some significant local clusters in cases age 0–14 in portions of central and eastern Ohio, while the spatial scan statistic in SaTScan found no significant clusters. SaTScan also found no significant clusters for three different age groups and ALL type cases. Some reassurance comes from the fact that some of the most likely SaTScan clusters are in the same areas as the significant elevated log relative risk areas from the kernel intensity function ratios. Still, the cancer cluster investigator is left to wonder which results are more trustworthy in this circumstance. Unfortunately, without a well-designed simulation study that reflects the current study situation and where the true clusters are known, one cannot definitively reach a conclusion on this matter. A simulation study that tests for different types of clusters is left for future research.
One practical reason to favor the kernel intensity function method is that it tests for local clusters and explicitly uses a summary measure of the local results to test for global clustering; it is unique in this regard. Another advantage of the kernel intensity function method is that it provides the log relative risk surface over the entire study area, so one can visualize the local peaks and valleys in the risk of disease. In addition, the kernel is more flexible in its shape than SaTScan's circular scanning window. There have been advancements in the literature, however, with scan statistics designed to detect elliptical clusters  as well as more flexibly shaped clusters . An arbitrary shaped non-scanning method based on minimum spanning trees has also been recently introduced . A disadvantage with the kernel intensity ratio is that one must select the bandwidth in advance of calculating the log relative risk, and results can certainly vary depending on the selected bandwidth. One possibility to overcome this is may be to use a Bayesian framework for kernel intensity estimation , where the kernel bandwidth would be estimated from the data while simultaneously calculating the log relative risks.
Numerous practical issues with spatial case-control cluster detection were encountered in this study. First, the selection of controls is crucial in these case-control spatial clustering studies. We found a traditional epidemiology ratio of 3 to 1 to be inadequate with our systematic sampling scheme, and believe that would be true with a purely random sampling scheme as well. We tentatively recommend using as many controls as possible taking into consideration the cost in acquiring them and in computing, as some methods such as the K function and SaTScan can take substantial run time with a large number of points in the study. More research is needed to determine, if possible, an optimal number of controls and sampling scheme. In this study, we also realized the importance of avoiding unnecessary spatial error when possible, in terms of geocoding and map units. Of course, there is inherent locational uncertainty in these data . Invariably, in the address matching process of individual records there will be observations for which an exact address match is not possible. These records can be geocoded to census boundary or ZIP Code centroids or omitted from the study, where the decision on the handling of these records could depend on the study area scale. For a large study area, using census tract or ZIP Code centroids matches may be deemed acceptable in searching for an approximate cluster location, where county centroids may be viewed as providing spatial locations that are too inaccurate. We omitted centroid-matched points after checking visually that they were not spatially influential, i.e. occurring in one area only or exclusively in rural areas, to avoid inducing artificial clustering in cases or controls. We also used UTM map coordinates to prevent adding spatial error to our Euclidean distance calculations. An alternative would be to use great circle distance calculation for records in latitude and longitude coordinates.
This comparative study for childhood leukemia clustering and clusters in Ohio is the first one with individual level case and control data. The study produced results that lead to different conclusions based on the method utilized regarding the significance of clusters and also revealed several open research issues in practical spatial cluster detection. In summary, we found some evidence, although inconclusive, of significant local clusters in childhood (age 0–14) leukemia in Ohio during years 1996–2003, but no significant overall clustering when considering all case ages simultaneously. The spatial scan statistic in SaTScan found no significant clusters, while the kernel intensity function ratio found clusters, some of irregular shape, in areas of central, southern, and eastern Ohio. It should be pointed out that different methods used to test for clustering look for different types of clusters, and one method may not find a cluster while another method does, and both may be correct depending on the underlying true cluster. Consideration of the potential shape of clusters in the study area appears to be an important issue. In considering future work with these data, a subsequent study should test for spatial clusters in ALL type cases by age groups based on the finding of Dockerty and his coauthors  of significant clustering using Cuzick and Edwards' method in age subgroups of ALL cases, but not in ALL cases age 0–14. Additional future work could systematically investigate the sensitivity of the results from the methods selected to the ratio of controls to cases, to different sizes of the study area, and to different control sampling schemes, such as simple random, stratified, or probability proportional to size cluster sampling. A potentially interesting and relevant future comparison would be between the results presented here to those from methods for regional count data at the county level. There is additional effort involved in spatial case-control cluster studies compared to regional count cluster studies, and it would be worthwhile to see if the additional data needs and computational cost result in substantially increased power to detect clusters.
Cancer incidence data used in this study were obtained from the Ohio Cancer Incidence Surveillance System, Ohio Department of Health (ODH), a registry participating in the National Program of Cancer Registries of the Centers for Disease Control and Prevention (CD). Use of these data does not imply ODH or CDC either agrees or disagrees with any presentations, analyses, interpretations or conclusions. Information about the OCISS can be obtained at .
The author thanks Holly Engelhardt and Robert Indian of the Ohio Department of Health for providing case data and John Paulson from the Ohio Vital Statistics Department for providing control data. The author acknowledges assistance from James Fisher and Mario Davidson of the Arthur G. James Cancer Hospital at The Ohio State University with data processing of the cases. The author also thanks Lance Waller for sharing R code for the K function and kernel intensity estimation and for helpful comments on an earlier draft that lead to improvement of this paper.
- Waller LA, Hill EG, Rudd RA: The geography of power: statistical performance of tests of clusters and clustering in heterogeneous populations. Statistics in Medicine 2006, 25:853–865.View ArticlePubMed
- Rothman KJ: A sobering start for the cluster busters' conference. American Journal of Epidemiology 1990,132(Supplement):S6-S13.PubMed
- Dockerty JD, Sharple KJ, Borman B: An assessment of spatial clustering of leukaemias and lymphomas among young people in New Zealand. Journal of Epidemiology and Community Health 1999, 53:154–158.View ArticlePubMed
- Alexander F: Viruses, clusters, and clustering of childhood leukaemia: a new perspective? European Journal of Cancer 1993, 29A:1424–43.View ArticlePubMed
- Heath CW, Hasterlick RJ: Leukaemia amongst children in a suburban community. American Journal of Medicine 1963, 34:796–812.View Article
- Kinlen LJ: Childhood cancer and population mixing. American Journal of Epidemiology 2004, 159:716.View ArticlePubMed
- Clark BR, Ferketich AK, Fisher JL, Harris RE, Wilkins JR: Childhood leukemia and population mixing in Ohio. Pediatric Blood & Cancer, in press
- Lagakos SW, Wessen BJ, Zelen M: An analysis of contaminated well water and health effects in Woburn, Massachusetts. Journal of the American Statistical Association 1986, 81:583–96.View Article
- Duarte-Davidson R, Courage C, Rushton L, Levy L: Benzene in the environment: an assessment of the potential risks to the health of the population. Occupational and Environmental Medicine 2001, 58:2–13.View ArticlePubMed
- Fasal E, Jackson EW, Klauber MR: Leukemia and lymphoma mortality and farm residence. American Journal of Epidemiology 1968, 87:267–274.PubMed
- Draper GJ, Stiller CA, Cartwright RA, Craft AW, Vincent TJ: Cancer in Cumbria and in the vicinity of the Sellafield nuclear installation, 1963–90. British Medical Journal 1993, 306:89–94.View ArticlePubMed
- Schwartz SO, Greenspan I, Brown ER: Leukaemia cluster in Niles Ill: immunologic data on families of leukemic patients and others. Journal of the American Medical Association 1963, 186:106–8.PubMed
- Korte JE, Hertz-Picciotto I, Shulz MR, Ball LM, Duell EJ: The contribution of benzene to smoking-induced leukemia. Environmental Health Perspectives 2000,108(4):333–339.PubMed
- Poole C, Greenland S, Luetters C, Kelsey JL, Mezei G: Socioeconomic status and childhood leukaemia: a review. International Journal of Epidemiology 2006, 35:370–384.View ArticlePubMed
- Alexander FE, Boyle P: Do cancers cluster? Spatial Epidemiology: Methods and Applications (Edited by: Elliot P, Wakefield JC, Best NG, Briggs DJ). New York: Oxford University Press 2000, 302–316.
- Bithell JF, Vincent TJ: Geographical variations in childhood leukaemia incidence. Spatial Epidemiology: Methods and Applications (Edited by: Elliot P, Wakefield JC, Best NG, Briggs DJ). New York: Oxford University Press 2000, 317–332.
- Community Health Assessments Section; BHSIOS-Prevention; Ohio Department of Health: Case review of leukemia among residents of Marion County, Ohio, 1992-and graduates of River Valley High School, 1963–2000. Columbus, Ohio 2001.
- Rogerson PA: The detection of clusters using a spatial version of the chi-square goodness-of-fit statistic. Geographical Analysis 1999,31(1):130–147.
- Waller LA, Gotway CA: Applied Spatial Statistics for Public Health Data New York: John Wiley 2004.View Article
- Gatrell AC: Geographies of Health: An Introduction Oxford: Blackwell 2002.
- Devine OJ, Louis TA, Halloran ME: Identifying areas with elevated disease incidence rates using empirical Bayes estimators. Geographical Analysis 1996,28(3):187–199.View Article
- Waller LA, Jacquez GM: Disease models implicit in statistical tests of disease clustering. Epidemiology 1995,6(6):584–590.View ArticlePubMed
- Besag J, Newell J: The detection of clusters in rare diseases. Journal of the Royal Statistical Society, Series A 1991, 154:143–155.
- Gangnon RE: Impact of prior choice on local Bayes factors for cluster detection. Statistics in Medicine 2006, 25:883–895.View ArticlePubMed
- Lawson AB: Disease cluster detection: a critique and a Bayesian proposal. Statistics in Medicine 2006, 25:897–916.View ArticlePubMed
- Griffith DA: A comparison of six analytical disease mapping techniques as applied to West Nile Virus in the coterminous United States. International Journal of Health Geographics 2005, 4:18.View ArticlePubMed
- Fotheringham AS, Zhan FB: A comparison of three exploratory methods for cluster detection in spatial point patterns. Geographical Analysis 1996,28(3):200–218.View Article
- Ozdenerol E, Williams BL, Kang SY, Magsumbol MS: Comparison of spatial scan statistic and spatial filtering in estimating low birth weight clusters. International Journal of Health Geographics 2005, 4:19.View ArticlePubMed
- Aamodt G, Samuelsen SO, Skrondal A: A simulation study of three methods for detecting disease clusters. International Journal of Health Geographics 2006, 5:15.View ArticlePubMed
- Song C, Kulldorff M: Power evaluation of disease clustering tests. International Journal of Health Geographics 2003, 2:9.View ArticlePubMed
- Kulldorff M, Song C, Gregorio D, Samociuk H, DeChello L: Cancer map patterns: are they random or not? American Journal of Preventive Medicine 2006,30(2S):S37-S49.View ArticlePubMed
- R [http://www.r-project.org/]
- TerraSeer, Inc ClusterSeer Users Guide 2 2002.
- Kulldorff M: SaTScan User Guide v7.0 2006. [http://www.satscan.org/]
- Anselin L: Review of cluster analysis software. North American Association of Central Cancer Registries Springfield, IL 2004.
- ESRI: ArcGIS 9.1 Users Guide 2005.
- Surveillance, Epidemiology, and End Results (SEER) Program [http://www.seer.cancer.gov]
- Ohio Cancer Incidence Surveillance System Advisory Board Report to the Ohio General Assembly House and Senate Finance Committees 2002.
- Cuzick J, Edwards R: Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society B 1990,52(1):73–104.
- Armstrong MP, Rushton G, Zimmerman DL: Geographically masking health data to preserve confidentiality. Statistics in Medicine 1999,18(5):497–525.View ArticlePubMed
- Ripley BD: Modeling spatial patterns (with discussion). Journal of the Royal Statistical Society, Series B 1977, 39:172–212.
- Diggle PJ: Statistical Analysis of Spatial Point Patterns London: Academic Press 1983.
- Besag J: Discussion of "Modeling spatial patterns" by B. D. Ripley. Journal of the Royal Statistical Society, Series B 1977, 39:193–195.
- Ripley BD: The second-order analysis of stationary point patterns. Journal of Applied Probability 1976, 13:255–266.View Article
- Kelsall JE, Diggle PJ: Non-parametric estimation of spatial variation in relative risk. Statistics in Medicine 1995, 14:2335–2342.View ArticlePubMed
- Scott DW: Multivariate Density Estimation: Theory, Practice, and Visualization New York: John Wiley 1992.View Article
- Kulldorff M: A spatial scan statistic. Communications in Statistics: Theory and Methods 1997, 26:1487–1496.
- Kulldorff M: Commentary: geographical distribution of sporadic Creutzfeldt-Jakob disease in France. International Journal of Epidemiology 2002, 31:495–496.View ArticlePubMed
- Huillard d'Aignaux J, Cousens SN, Delasnerie-Lauprête N, Brandel JP, Salomon D, Laplanche JL, Hauw JJ, Alpêrovitch A: Analysis of the geographical distribution of sporadic Creutzfeldt-Jakob disease in France between 1992 and 1998. International Journal of Epidemiology 2002, 31:490–495.View Article
- Jacquez GM, Greiling DA: Local clustering in breast, lung and colorectal cancer in Long Island, New York. International Journal of Health Geographics 2003, 2:3.View ArticlePubMed
- Kulldorff M, Huang L, Pickle L, Duczmal L: An elliptic spatial scan statistic. Statistics in Medicine 2006,25(22):3929–3943.View ArticlePubMed
- Tango T, Takahashi K: A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics 2005, 4:11.View ArticlePubMed
- Assunçao R, Costa M, Tavares A, Ferreira S: Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine 2006, 25:723–742.View ArticlePubMed
- Botella-Rocamora P, López-Quílez A: Intensity estimation of a complex spatial point process by a mixture [abstract]. Valencia 8 Meeting 2006.
- Jacquez GM: Current practices in the spatial analysis of cancer: flies in the ointment. International Journal of Health Geographics 2004, 3:22.View ArticlePubMed
- Ohio Department of Health [http://www.odh.ohio.gov/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.