It is important to identify geographical disparities in health outcomes related to chronic diseases [1], physical activity [2], behavioral health [3], and mental health [4]. In particular, identifying locations with significantly high- or low-risk health outcomes would be useful for guiding targeted health programs and shaping health policies to reduce health disparities [5]. Health authorities often conduct health surveys of the general population; thus, it might help analyze the spatial cluster patterns using this data.

Among the various statistical methods for geographic cluster detection, the spatial scan statistic proposed by Kulldorff [6] has been widely used in various epidemiologic studies. This method calculates a likelihood ratio test statistic to compare the inside and outside of a scanning window. Areas in the scanning window, that maximized the test statistic, were identified as the most likely clusters. Monte Carlo hypothesis testing is typically used to obtain a p-value for testing the statistical significance of the most likely cluster. Spatial scan statistics have been developed for various probability models such as Poisson [6], Bernoulli [6], normal [7, 8], ordinal [9], and multinomial [10]. The spatial scan statistic method, based on these models, is available through software SaTScan™ [11]. The method has been extended to a regression modeling approach with different regression coefficients for cluster detection [12,13,14].

Public health surveillance [15] is conducted to collect, analyze, and interpret health-related data for planning, implementing, and evaluating public health policies. As part of the public health surveillance, health-related data were collected from population-based surveys. The data obtained from these ongoing surveys can be used to understand trends in public health [16]. Such health surveys are often based on complex sampling [17] approaches, including several design features such as stratification, cluster sampling, and disproportionate sampling. Sample design features need to be incorporated into the estimation and analysis to generalize the results to the entire population. Therefore, to ensure that the estimation and analysis are generalizable to the entire population, it seems appropriate to consider sample designs and sampling weights when exploring the spatial cluster patterns with the spatial scan statistic.

Some studies have conducted geographic cluster detection analysis using spatial scan statistics on population-based health survey data. However, most of these studies utilized observed survey responses, without considering sample designs and sampling weights. Roberson et al. [18] identified spatial clusters of high stroke prevalence using the spatial scan statistic under the discrete Poisson probability model for a population-based health survey (Behavioral Risk Factor Surveillance System). They specified the number of stroke cases in each county, derived from the observed binary responses, as the case variable in the analysis. Kebede et al. [19] conducted a study to identify spatial clusters of high health coverage among women aged 15‒49 years, using the Bernoulli-based spatial scan statistic on a population-based health survey (Ethiopian Demographic and Health Survey). Similarly, they specified the number of health coverage cases observed for a binary response as the case variable in the analysis.

Two approaches are available for utilizing the survey responses observed with binary outcomes. One approach is to use individual-level data as is, observed with binary responses represented by 0 and 1. In this case, spatial cluster detection can be conducted using the Bernoulli-based spatial scan statistics [6]. The other approach is to use aggregate-level data, which summarizes the individual-level data into regional-level rates for each location. The sampling design and weights can be considered when calculating the region-level rates. For this type of data, spatial cluster detection can be conducted using the weighted normal spatial scan statistic [8], which is used to identify clusters with high rates of regional measures (e.g., mortality rate and disease prevalence at the regional level) with a heterogeneous population.

It is unclear which model is appropriate for use when health survey data comes from a complex survey design. We can use individual-level or aggregated data at the regional level for spatial cluster detection of disease prevalence. The weighted frequency by the sampling weights can be used as binary data to properly consider the sampling design. First, we applied different approaches to the Korea Community Health Survey (KCHS), which is one of the several population-based health surveys in South Korea. We identified statistically significant spatial clusters with high rates of male diabetes diagnoses. Having found that the cluster detection results were very different depending on the type of data, we conducted a simulation study to examine which approach is more appropriate among the three approaches using sampled data from hypothetical population data. Several design features were taken into account when generating the simulation data to mimic real health survey data, such as stratification with different sampling proportions and post-stratification weights. We compared the accuracy of detected clusters in terms of sensitivity and positive predictive value.