 Methodology
 Open access
 Published:
A simulation study for geographic cluster detection analysis on populationbased health survey data using spatial scan statistics
International Journal of Health Geographics volume 21, Article number: 11 (2022)
Abstract
Background
In public health and epidemiology, spatial scan statistics can be used to identify spatial cluster patterns of healthrelated outcomes from populationbased health survey data. Although it is appropriate to consider the complex sample design and sampling weight when analyzing complex sample survey data, the observed survey responses without these considerations are often used in many studies related to spatial cluster detection.
Methods
We conducted a simulation study to investigate which data type from complex survey data is more suitable for use by comparing the spatial cluster detection results of three approaches: (1) individuallevel data, (2) weighted individuallevel data, and (3) aggregated data.
Results
The results of the spatial cluster detection varied depending on the data type. To compare the performance of spatial cluster detection, sensitivity and positive predictive value (PPV) were evaluated over 100 iterations. The average sensitivity was high for all three approaches, but the average PPV was higher when using aggregated data than when using individuallevel data with or without sampling weights.
Conclusions
Through the simulation study, we found that use of aggregatelevel data is more appropriate than other types of data, when searching for spatial clusters using spatial scan statistics on populationbased health survey data.
Introduction
It is important to identify geographical disparities in health outcomes related to chronic diseases [1], physical activity [2], behavioral health [3], and mental health [4]. In particular, identifying locations with significantly high or lowrisk health outcomes would be useful for guiding targeted health programs and shaping health policies to reduce health disparities [5]. Health authorities often conduct health surveys of the general population; thus, it might help analyze the spatial cluster patterns using this data.
Among the various statistical methods for geographic cluster detection, the spatial scan statistic proposed by Kulldorff [6] has been widely used in various epidemiologic studies. This method calculates a likelihood ratio test statistic to compare the inside and outside of a scanning window. Areas in the scanning window, that maximized the test statistic, were identified as the most likely clusters. Monte Carlo hypothesis testing is typically used to obtain a pvalue for testing the statistical significance of the most likely cluster. Spatial scan statistics have been developed for various probability models such as Poisson [6], Bernoulli [6], normal [7, 8], ordinal [9], and multinomial [10]. The spatial scan statistic method, based on these models, is available through software SaTScan™ [11]. The method has been extended to a regression modeling approach with different regression coefficients for cluster detection [12,13,14].
Public health surveillance [15] is conducted to collect, analyze, and interpret healthrelated data for planning, implementing, and evaluating public health policies. As part of the public health surveillance, healthrelated data were collected from populationbased surveys. The data obtained from these ongoing surveys can be used to understand trends in public health [16]. Such health surveys are often based on complex sampling [17] approaches, including several design features such as stratification, cluster sampling, and disproportionate sampling. Sample design features need to be incorporated into the estimation and analysis to generalize the results to the entire population. Therefore, to ensure that the estimation and analysis are generalizable to the entire population, it seems appropriate to consider sample designs and sampling weights when exploring the spatial cluster patterns with the spatial scan statistic.
Some studies have conducted geographic cluster detection analysis using spatial scan statistics on populationbased health survey data. However, most of these studies utilized observed survey responses, without considering sample designs and sampling weights. Roberson et al. [18] identified spatial clusters of high stroke prevalence using the spatial scan statistic under the discrete Poisson probability model for a populationbased health survey (Behavioral Risk Factor Surveillance System). They specified the number of stroke cases in each county, derived from the observed binary responses, as the case variable in the analysis. Kebede et al. [19] conducted a study to identify spatial clusters of high health coverage among women aged 15‒49 years, using the Bernoullibased spatial scan statistic on a populationbased health survey (Ethiopian Demographic and Health Survey). Similarly, they specified the number of health coverage cases observed for a binary response as the case variable in the analysis.
Two approaches are available for utilizing the survey responses observed with binary outcomes. One approach is to use individuallevel data as is, observed with binary responses represented by 0 and 1. In this case, spatial cluster detection can be conducted using the Bernoullibased spatial scan statistics [6]. The other approach is to use aggregatelevel data, which summarizes the individuallevel data into regionallevel rates for each location. The sampling design and weights can be considered when calculating the regionlevel rates. For this type of data, spatial cluster detection can be conducted using the weighted normal spatial scan statistic [8], which is used to identify clusters with high rates of regional measures (e.g., mortality rate and disease prevalence at the regional level) with a heterogeneous population.
It is unclear which model is appropriate for use when health survey data comes from a complex survey design. We can use individuallevel or aggregated data at the regional level for spatial cluster detection of disease prevalence. The weighted frequency by the sampling weights can be used as binary data to properly consider the sampling design. First, we applied different approaches to the Korea Community Health Survey (KCHS), which is one of the several populationbased health surveys in South Korea. We identified statistically significant spatial clusters with high rates of male diabetes diagnoses. Having found that the cluster detection results were very different depending on the type of data, we conducted a simulation study to examine which approach is more appropriate among the three approaches using sampled data from hypothetical population data. Several design features were taken into account when generating the simulation data to mimic real health survey data, such as stratification with different sampling proportions and poststratification weights. We compared the accuracy of detected clusters in terms of sensitivity and positive predictive value.
The Korea Community Health Survey (KCHS) data
The KCHS has been conducted annually by the Korea Disease Control and Prevention Agency since 2008 to investigate both public health status and health behaviors at community health centers [20]. KCHS data were collected from an average of 900 adults per community health center (“si/gun/gu” or district level). The survey is based on a complex sample design. Survey data with sample weights can be provided upon request at https://chs.kdca.go.kr/chs.
We used the answers to diagnose diabetes as an outcome from the 2018 KCHS to search for geographic clusters with high rates of diabetes prevalence. There were 250 administrative districts in South Korea in 2018, with the exception of two districts located on Jeju Island. Spatial cluster detection analysis was conducted on the (1) individuallevel data, (2) weighted individuallevel data, and (3) aggregated data. The Bernoullibased and weighted normal spatial scan statistics were used for the first two and third data types, respectively. We used the circular scanning window shape and optimal maximum reported cluster size (MRCS) determined by the Gini coefficient [21], while the maximum scanning window size (MSWS) was fixed at 50%. The study participants were divided into male and female subgroups for analysis. All analyses were conducted using the SaTScan™ software version 9.6. This study only shows the results for men. Figures 1 and 2 show cluster detection results for the three different approaches, with and without age adjustment. Only statistically significant clusters were reported at the significance level of 0.05. Tables 1 and 2 include the number of identified high diabetes diagnosis rate spatial clusters at the optimal MRCS value.
The detected clusters were very different, depending on the approach. These results motivated the present study. With or without age adjustment, the weighted normal model of the aggregated data found a single significant cluster in the northeast area of South Korea. When dealing with the survey data, it is necessary to consider sampling weights for proper inference. One may think that it would be more appropriate to use weighted data by sampling weights rather than observed individual data. However, the Bernoulli model identified too many significant clusters in the weighted data, which could be due to the inflated sample size. Using the survey data, the detected clusters from the Bernoulli model were similar to those based on aggregated data to a certain degree. Only one significant cluster, whose location was similar to that of the cluster detected from the weighted normal model, was detected for the data with age adjustment. Without age adjustment, the most likely cluster was similar to that from the weighted normal model; however, another significant cluster was also detected in the southwest area.
We expected to discover common geographic patterns regardless of the data type used from the survey data. However, the significant spatial clusters with high rates of diabetes diagnosis varied depending on the type of data. The patterns of spatial cluster detection results were similar when using other health outcomes in the 2018 KCHS data. Thus, we aimed to assess which data type derived from binary survey responses is more appropriate for spatial cluster detection using the spatial scan statistic through a simulation study.
Simulation study
A simulation study was performed to investigate which type of data [individuallevel data (frequency and weighted frequency) and aggregatelevel data (crude rate estimates)] obtained from the complex sample survey is more appropriate for spatial cluster detection with the spatial scan statistic. First, we generated a hypothetical population dataset based on the administrative districts in South Korea in 2018. The study area consisted of 250 districts. We then sampled 100 iterations from the hypothetical population dataset in a manner similar to the KCHS sampling procedure. Finally, we computed the weighted frequency (individuallevel data) and crude rate estimates (aggregatelevel data) for each sample dataset using SAS software [22] version 9.4, based on the sample design and sampling weights. For each iteration, we applied the Bernoullibased spatial scan statistic [6] to two types of individuallevel data and the weighted normal spatial scan statistic [8] to aggregatelevel data derived from the simulated sample dataset. Age adjustment was not considered in this simulation study. Similar to the KCHS analysis, we only identified statistically significant clusters.
Here, we briefly review the sampling procedure of KCHS, which is based on a complex sample design that uses a twostage stratified cluster sampling procedure. The surveyed population was stratified by the smallest administrative unit (“dong/eup/myeon”) and housing unit (general house/apartment), which were the first and second strata, respectively. In the first stage, a sample area (“tong/ban/ri”), as a primary sampling unit, was selected for each housing unit type within each administrative unit, based on the number of households through probability proportional to size sampling. In the second stage, households were selected through systematic sampling. The detailed sampling procedure is described in a brief report describing the survey [20].
Sensitivity and positive predictive value (PPV) were used to evaluate the accuracy of the simulation results. Sensitivity was defined as the number of districts included in significant clusters among districts belonging to the true cluster. PPV was defined as the number of districts belonging to the true cluster among the districts included in significant clusters. The average and standard deviation of sensitivity and PPV over 100 iterations are presented. This simulation study was performed using R software [23] version 4.0.2 with the rsatscan package [24] to iteratively run the SaTScan™ software in R environment.
Population data generation

(Step1) It was assumed that the population was stratified by age group (20‒34 years, 35‒49 years, 50‒64 years and over 65 years) and sex. Stratification by age group and sex was denoted by \(j\) (\(j\) = 1 for 20‒34 years of male, 2 for 35‒49 years of male, 3 for 50‒64 years of male, 4 for 65+ years of male, 5 for 20‒34 years of female, 6 for 35‒49 years of female, 7 for 50‒64 years of female, and 8 for 65+ years of female).

(Step2) We defined two true cluster models with different sizes and shapes using a geographical map of South Korea for 2018. The two true cluster models are shown in Fig. 3. The true cluster in Model (A) was composed of 18 districts located in the northeast, including the coastal areas. We assumed two true clusters in Model (B), one cluster identical to Model (A) and another composed of 12 districts located in the central region. The prevalence rate was set to 0.3 for each district belonging to the true clusters and 0.2 for each district not belonging to the true clusters.

(Step3) For each district, we generated binary outcomes for individuals from a binomial distribution with the actual population of South Korea in 2018 and the prevalence rate defined in Step2. Binary outcomes were generated from the binomial distribution \({\text{B}}\left( {N_{kj} ,{ }p_{kj} } \right)\), where \(N_{kj}\) and \(p_{kj}\) denote the actual population and prevalence rate, respectively, for \(j{\text{th}}\) stratification of \(k{\text{th}}\) district.
Sample data generation

(Step1) We defined the sample size for each district (\(n_{k}\)) between 900 and 920.

(Step2) The sample size (\(n_{kj}\)) for each stratification of each district was drawn from a multinomial distribution, with the sample size (\(n_{k}\)) defined in Step1 and the sampling proportion (\(q_{kj}\)). The assumed sampling proportions are listed in Table 3. In sampling proportion scenario (1), simple random sampling (SRS) was assumed, which means that \(q_{kj}\) was calculated using \(N_{kj} /N_{k}\). In sampling proportion scenario (2), we used the actual proportion of the 2018 KCHS by age group and sex as the sampling proportion. In sampling proportion scenario (3), we set a higher sampling proportion of 35‒49 years and 50‒64 years while setting a lower sampling proportion of 20‒34 years and over 65 years for both males and females. This indicates that the sampling proportion in Scenario (3) was more dispersed than actual proportion of the 2018 KCHS [i.e. the sampling proportion scenario (2)]. Through this scenario, we considered a situation where certain groups of the population were more or less likely to be sampled than others, which could cause sampling bias.

(Step3) We randomly sampled \(n_{kj}\) from the hypothetical population dataset for each stratification of each district.

(Step4) The sampling weight (\(w_{kj}\)) of a sampled individual for district \(k\) and stratification \(j\) was calculated as the inverse of the probability that this sampled individual was to be selected. The sampling weight was expressed as follows:
$$w_{kj} = \frac{{N_{kj} }}{{n_{k} \times q_{kj} }} .$$
The sampling weight (\(w_{kj}\)) was then adjusted using a poststratification weight. The poststratification weight was calculated as the ratio of the actual population from the 2018 Korean census to the sum of the sampling weights by age group and sex for each district. As assumed in the population data generation, we used stratification by age group and sex divided into eight stratifications. The poststratification weight was calculated as follows:
Finally, the final sampling weight (\(w_{kj}^{final}\)) was calculated as follows.
The sampling procedure of the simulation study was conducted according to that described by Vandendijck et al. [25].
Results of simulation study
The simulation results were obtained for each combination of the true cluster model and sampling proportion scenario (two true cluster models and three sampling proportion scenarios). The average and standard deviation of sensitivity and PPV are presented in Table 4.
The simulation results showed a similar tendency for the average and standard deviation of sensitivity and PPV in all scenarios. The average sensitivity was generally high in all scenarios, regardless of use of the three types of data, while the average PPV was the highest in all scenarios when using the summary measure (crude rate estimates) at the aggregatelevel data. Although the difference was not large, the average sensitivity for the aggregated data was the highest in four of six scenarios. Interestingly, the average PPV was very low when using the weighted frequency compared with the frequency and crude rate estimates. We found that a very large number of clusters were identified throughout the entire study area when using the weighted frequency, as seen in the real data analysis of the KCHS 2018. Also, when using the aggregatelevel data, the standard deviation of sensitivity and PPV was relatively low across all scenarios, which implies that we can obtain more consistent and stable results than when using the other approaches. Using the aggregated data from the complex survey seemed to reflect the true spatial cluster patterns better than using other types of data.
Discussion
In this study, we examined which approach is more appropriate for spatial cluster detection, using data from a populationbased health survey. We found that the detected geographic cluster patterns of high disease prevalence varied depending on the type of data, when analyzing the KCHS data. To investigate which data type is more appropriate for spatial cluster detection using spatial scan statistics, we conducted a simulation study. Our findings through the simulation study revealed that the use of arealevel summary measure estimates is better at detecting spatial clusters with spatial scan statistics under various scenarios. In all scenarios, although the average sensitivity was similarly high regardless of the use of the three types of data, the average PPV was the highest when using the arealevel rate estimates. Therefore, it seems that it is more appropriate to use summary measure estimates (aggregatelevel data), which takes the sample design and sampling weights into account, for geographical cluster detection with the spatial scan statistic than the other types of data.
One limitation of this study is that we partially implemented the sampling procedure of KCHS in the simulation study. KCHS is based on a twostage stratified cluster sampling procedure; however, we could not consider the cluster sampling features to simplify the simulation process. Nevertheless, this simplified sampling procedure appears to yield meaningful results because the sampling weights are available in the sample data sampled from the hypothetical population data.
Conclusion
Based on our findings from the simulation study, it seems that it is more appropriate to use aggregatelevel data (the rate estimates) among the three types of data from the populationbased health survey, when exploring spatial cluster detection with the spatial scan statistic. It is expected that more simulation studies will need to be performed by considering other sampling features, such as cluster sampling, to obtain more comprehensive results.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 MRCS:

Maximum reported cluster size
 MSWS:

Maximum scanning window size
 KCHS:

Korea Community Health Survey
 PPV:

Positive predictive value
 SRS:

Simple random sampling
References
Kauhl B, Maier W, Schweikart J, Keste A, Moskwyn M. Exploring the smallscale spatial distribution of hypertension and its association to area deprivation based on health insurance claims in Northeastern Germany. BMC Public Health. 2018;18(1):121.
Tamura K, Puett RC, Hart JE, Starnes HA, Laden F, Troped PJ. Spatial clustering of physical activity and obesity in relation to built environment factors among older women in three US, states. BMC Public Health. 2014;14(1):1–16.
Huang L, Tiwari RC, Pickle LW, Zou Z. Covariate adjusted weighted normal spatial scan statistics with applications to study geographic clustering of obesity and lung cancer mortality in the United States. Stat Med. 2010;29(23):2410–22.
Yamaoka K, Suzuki M, Inoue M, Ishikawa H, Tango T. Spatial clustering of suicide mortality and associated community characteristics in Kanagawa prefecture, Japan, 2011–2017. BMC Psychiatry. 2020;20(1):1–15.
Braveman P. Health disparities and health equity: concepts and measurement. Annu Rev Public Health. 2006;27:167–94.
Kulldorff M. A spatial scan statistic. Commun Stat Theory Methods. 1997;26(6):1481–96.
Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. Int J Health Geogr. 2009;8:58.
Huang L, Tiwari RC, Zou Z, Kulldorff M, Feuer EJ. Weighted normal spatial scan statistic for heterogeneous population data. J Am Stat Assoc. 2009;104:886–98.
Jung I, Kulldorff M, Klassen AC. A spatial scan statistic for ordinal data. Stat Med. 2007;26:1594–607.
Jung I, Kulldorff M, Richard OJ. A spatial scan statistic for multinomial data. Stat Med. 2010;29:1910–8.
Kulldorff M. Information Management Services, Inc. SaTScan v9.6: software for the spatial and spacetime scan statistics. 2018. www.satscan.org.
Jung I. A generalized linear models approach to spatial scan statistics for covariate adjustment. Stat Med. 2009;28(7):1131–43.
Lee J, Gangnon RE, Zhu J. Cluster detection of spatial regression coefficients. Stat Med. 2017;36:1118–33.
Lee J, Sun Y, Chang HH. Spatial cluster detection of regression coefficients in a mixedeffect model. Environmetrics. 2020;31: e2578.
Thacker SB, Berkelman RL. Public health surveillance in the United States. Epidemiol Rev. 1988;10:164–90.
Carlson SA, Densmore D, Fulton JE, Yore MM, Kohl HW 3rd. Differences in physical activity prevalence and trends from 3 US surveillance systems: NHIS, NHANES, and BRFSS. J Phys Act Health. 2009;6(S1):S18–27.
Heeringa SG, West BT, Berglund PA. Applied survey data analysis. Boca Raton: Chapman and Hall/CRC; 2017.
Roberson S, Dawit R, Moore J, Odoi A. An exploratory investigation of geographic disparities of stroke prevalence in Florida using circular and flexible spatial scan statistics. PLoS ONE. 2019;14(8):1–16.
Kebede SA, Liyew AM, Tesema GA, et al. Spatial distribution and associated factors of health insurance coverage in Ethiopia: further analysis of Ethiopia demographic and health survey, 2016. Arch Public Health. 2020;78(1):1–10.
Kang YW, Ko YS, Kim YJ, et al. Korea community health survey data profiles. Osong Public Health Res Perspect. 2015;6(3):211–7.
Han J, Zhu L, Kulldorf M, et al. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. Int J Health Geogr. 2016;15:27.
SAS Institute Inc. SAS 9.4 help and documentation. Cary: SAS Institute Inc., 2002–2012; 2017.
R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2013.
Kleinman, K. Rsatscan: tools, classes, and methods for interfacing with SaTScan standalone software. 2015. https://CRAN.Rproject.org/package=rsatscan/.
Vandendijck Y, Faes C, Kirby RS, et al. Modelbased inference for small area estimation with sampling weights. Spat Stat. 2016;18:455–73.
Acknowledgements
Not applicable.
Funding
This study was supported by the Research Program, funded by the Korea Disease Control and Prevention Agency (B0080520000732).
Author information
Authors and Affiliations
Contributions
IJ conceived the study. JM conducted the simulations and analyzed the data. All authors drafted the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was approved by the SNU Research Ethics Team (IRB No. E1912/001010).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Moon, J., Jung, I. A simulation study for geographic cluster detection analysis on populationbased health survey data using spatial scan statistics. Int J Health Geogr 21, 11 (2022). https://doi.org/10.1186/s12942022003116
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12942022003116