 Methodology
 Open access
 Published:
Optimizing the maximum reported cluster size for the multinomialbased spatial scan statistic
International Journal of Health Geographics volumeÂ 22, ArticleÂ number:Â 30 (2023)
Abstract
Background
Correctly identifying spatial disease cluster is a fundamental concern in public health and epidemiology. The spatial scan statistic is widely used for detecting spatial disease clusters in spatial epidemiology and disease surveillance. Many studies default to a maximum reported cluster size (MRCS) set at 50% of the total population when searching for spatial clusters. However, this default setting can sometimes report clusters larger than true clusters, which include less relevant regions. For the Poisson, Bernoulli, ordinal, normal, and exponential models, a Gini coefficient has been developed to optimize the MRCS. Yet, no measure is available for the multinomial model.
Results
We propose two versions of a spatial cluster information criterion (SCIC) for selecting the optimal MRCS value for the multinomialbased spatial scan statistic. Our simulation study suggests that SCIC improves the accuracy of reporting true clusters. Analysis of the Korea Community Health Survey (KCHS) data further demonstrates that our method identifies more meaningful small clusters compared to the default setting.
Conclusions
Our method focuses on improving the performance of the spatial scan statistic by optimizing the MRCS value when using the multinomial model. In public health and disease surveillance, the proposed method can be used to provide more accurate and meaningful spatial cluster detection for multinomial data, such as disease subtypes.
Introduction
In public health and disease surveillance, the spatial scan statistic is a widely used method for identifying spatial clusters with significantly high or low risk of disease outcomes. This method is based on the likelihood ratio test statistic for each scanning window to compare its inside and outside. The scanning window that maximizes the test statistic is identified as the most likely cluster. Secondary clusters with high values of the test statistics are also identified. The statistical significance of the most likely cluster and secondary clusters is determined using the Monte Carlo hypothesis testing. The spatial scan statistic has been developed for various probability models such as Poisson [1], Bernoulli [1], exponential [2], ordinal [3], normal [4, 5], and multinomial [6]. SaTScanâ„¢ software is freely available for conducting spatial cluster detection analysis using various models of the spatial scan statistic.
The spatial scan statistic differs from spatial clustering methods such as ADCN [7] and STICC [8] in that the method is designed for identifying clusters rather than dividing spatial data into distinct subgroups. A cluster is defined as geographically and/or temporally bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance [9]. The clusters are characterized by the statistical distribution of outcome, not just by distance between geographic objects as in densitybased clustering. Spatial clustering methods are commonly used in geodata mining [10,11,12], while the spatial scan statistic is widely utilized for detecting geographic disease clusters [13,14,15].
In SaTScanâ„¢, researchers are required to specify the scanning window shape and the maximum scanning window size (MSWS). In many studies, the MSWS value is set to the default setting, which is 50% of the total population. A simulation study by Ribeiro and Costa [16] revealed that spatial cluster detection results can vary depending on the MSWS value. Nevertheless, their findings do not suggest running the analysis multiple times with different MSWS values to find the best results, as it may lead to a multiple testing problem, as argued by Han et al. [17]. They proposed an alternative approach, suggesting that the analysis should be rerun with a fixed large MSWS value while adjusting the maximum reported cluster size (MRCS) values. Setting the MRCS value to the default 50% may result in the reporting of clusters larger than the true clusters, encompassing less meaningful regions. Therefore, it is advisable to carefully select an optimal MRCS value.
Several studies have recently developed criteria to select the optimal value of the MRCS. Han et al. [17] proposed an optimization criterion using the Gini coefficient [18] specifically for the Poissonbased spatial scan statistic. Their simulation study showed that the proposed Gini coefficient effectively identified the correct clusters. However, it is important to note that the Gini coefficient needs to be defined differently for different probability models. Kim and Jung [19], Yoo and Jung [20], and Lee et al. [21] developed the Gini coefficient for the ordinal, normal, and exponentialbased spatial scan statistics, respectively. Yet, no Gini coefficient has been developed for the multinomialbased spatial scan statistic. The difficulty in defining a clear Gini coefficient for the multinomialbased spatial scan statistic arises from its inapplicability to nominal values.
Other studies [22,23,24] have proposed alternative criteria for selecting the optimal MRCS or MSWS. However, these studies only evaluated the performance of their methods for the Poissonbased spatial scan statistic. Because the methods are likelihoodbased optimization criteria, they can potentially be extended to other probability models. Nevertheless, it remains crucial to carefully evaluate the effectiveness of these methods when applied to probability models other than the Poisson model.
In this study, we propose a spatial cluster information criterion (SCIC) inspired by the formulation of the Bayes Information Criterion (BIC) [25] to choose the optimal MRCS value for the multinomialbased spatial scan statistic. The SCIC can be defined for the spatial scan statistic irrespective of the underlying probability model, as its approach is rooted in the likelihood ratio test statistic. To assess the performance of our proposed method, we conducted a simulation study for both the multinomialbased and ordinalbased spatial scan statistics. We compared the performance of our proposed method with that of existing approaches. To exemplify the methodology, we utilized the Korea Community Health Survey (KCHS) data collected by the Korea Centers for Disease Control and Prevention.
Methods
Spatial scan statistic for multinomial data
The multinomialbased spatial scan statistic [6] is used to detect disease clusters with statistically different diseasetype distributions. Let \({p}_{k}\) and \({q}_{k}\) denote the probabilities of category \(k\) inside and outside the scanning window \(z\), respectively. If we want to identify regions with different diseasetype distributions, the null and alternative hypotheses are stated as
where \(Z\) denotes the set of all scanning windows and \(K\) denotes the total number of categories. The likelihood ratio test statistic, given the scanning window z, is denoted as
where \({c}_{ik}\) is the number of cases belonging to category \(k\) inside the region \(i\), \({C}_{k}\) is the total number of cases belonging to category \(k\) in the whole study area and \(C\) is the total number of cases in the whole study area.
Spatial cluster information criterion (SCIC)
Now we propose an optimization criterion called the spatial cluster information criterion (SCIC) for selecting the optimal MRCS value. Our criterion draws inspiration from the formulation of the Bayes information criterion (BIC) [25], which is a widely used criterion in statistical modeling for model selection. The BIC for a candidate model \({M}_{u}\) is defined as
where \(y\) is observed data, \(L\left({\theta }_{u}y\right)\) is the likelihood of \(y\) given the model \({M}_{u}\), \(\widehat{{\theta }_{u}}\) is the maximum likelihood estimation (MLE) of \({\theta }_{u}\) that maximizes the \(L\left({\theta }_{u}y\right)\), \(u\) is the number of parameters in the model \({M}_{u}\), and \(v\) is the total number of observations. The BIC equation includes a penalty term as the second component, which penalizes models with additional parameters. The model exhibiting the minimum BIC value is considered the most appropriate selection [26].
We define the SCIC as the sum of the LLR test statistic for all significant clusters, along with a penalty term. In the multinomialbased spatial scan statistic, the LLR test statistic for each scanning window is used to measure the degree of heterogeneity in the spatial distribution of the categories. A higher LLR test statistic indicates a greater degree of heterogeneity within the scanning window compared to the surrounding area. However, as the scanning window size increases, there is a tendency for the LLR test statistic to rise due to the growing number of cases included within the window.
The spatial scan statistic has faced criticism for its tendency to identify clusters that are considerably larger than the actual clusters, often incorporating neighboring regions with no elevated risk of disease occurrence [27,28,29]. This tendency is mainly noticeable when the default settings of MSWS and MRCS, both set at 50%, are used with circular scanning windows. Optimizing the MRCS improves the spatial scan statisticâ€™s ability to identify clusters with greater precision [17, 19,20,21]. To utilize the sum of the LRT statistics as an optimizing criterion, we need to offset the inflation of the test statistic due to a large number of observations within the window.
The penalty term in the SCIC is defined in two versions. In the first version, the penalty term is calculated by multiplying the logarithm of the number of cases within the significant clusters by the product of the number of categories and the number of significant clusters. In the second version, we substitute the number of regions inside the significant clusters for the number of cases. This is based on the understanding that the number of cases within a cluster tends to increase as the number of regions inside the cluster increases. Both versions serve as optimization criteria with similar implications. For the multinomial model, the algorithm for computing the SCIC is as follows:

(Step 1) For a given MRCS \(m\)% (\(m\)=1, â€¦, 50), denote \({J}_{m}\) significant clusters reported using the multinomialbased spatial scan statistic by \({Z}_{1}^{\left(m\right)}, \cdots , {Z}_{{J}_{m}}^{\left(m\right)}\).

(Step 2) For each \(m\), calculate the SCIC for all significant clusters as follows:
$${SCIC}_{1}\left(m\right)=2\sum _{j=1}^{{J}_{m}}log\left({\lambda }_{{Z}_{j}^{\left(m\right)}}\right)+K\cdot {J}_{m}\cdot log\left({\tau }^{\left(m\right)}\right)$$(Version 1)$${SCIC}_{2}\left(m\right)=2\sum _{j=1}^{{J}_{m}}log\left({\lambda }_{{Z}_{j}^{\left(m\right)}}\right)+K\cdot {J}_{m}\cdot log\left({\delta }^{\left(m\right)}\right)$$(Version 2)where \({\lambda }_{{Z}_{j}^{\left(m\right)}}\) denotes the LRT statistic for the multinomialbased spatial statistic given the \({j}^{th}\) significant cluster \({Z}_{j}^{\left(m\right)}\), \(K\) is the total number of categories, and \({\tau }^{\left(m\right)}\) and \({\delta }^{\left(m\right)}\) denote the sum of the number of total cases and the sum of the number of regions inside all significant clusters, respectively.

(Step 3) Choose the MRCS which minimizes the SCIC as the optimal MRCS.
Figure 1 illustrates the flowchart of the proposed method.
Elbow method, MCSP, and MCHSP
For the Poissonbased spatial scan statistic, optimization criteria such as the elbow method [22], the maximum clustering setâ€“proportion (MCSP) [23], and the maximum clustering heterogeneous setproportion (MCHSP) [24] have been proposed to determine the optimal value of MRCS or MSWS. Since these methods are likelihoodbased optimization criteria, we have adapted them to the multinomial model in order to evaluate and compare their performance with our proposed approaches. The logical order is the same as the SCICs, with the only difference being the measure being calculated. Itâ€™s important to emphasize that we should consider optimizing MRCS, not MSWS, to avoid the multiple testing problem, as noted by Han et al. [17].
The elbow method [30] is commonly employed in unsupervised learning to determine the optimal number of clusters by identifying the elbow point. In the context of selecting the optimal MRCS value, Meysami et al. [22] proposed an optimization criterion for the Poisson model by adopting the method for finding the optimal elbow point as suggested by Delgado et al. [31]. We employ the method for the multinomial model by calculating the negative sum of the likelihood ratio test (LRT) statistic values over all \({J}_{m}\) significant clusters for each \(m\) as
where \({\lambda }_{{Z}_{j}^{\left(m\right)}}\) denotes the LRT statistics value for the \({j}{\text{th}}\) significant cluster \({Z}_{j}^{\left(m\right)}\) (\(j\)= 1, â€¦, \({J}_{m}\)). If no significant cluster is present, use the maximum LRT statistic. The elbow plot is constructed by connecting the points (\(m, LRT\left(m\right)\)) for \(m\)= 1, â€¦, 50. For each \(m\), we calculate the orthogonal distance between each point (\(m, LRT(m)\)) and the line connecting the first and last points. The optimal MRCS is the one that maximizes this orthogonal distance.
Ma et al. [23] proposed the maximum clustering setâ€“proportion (MCSP) as an optimization criterion to determine the optimal value of the MSWS for the Poissonbased spatial scan statistic. This criterion assumes that all identified significant clusters are homogeneous clusters with the same relative risks. However, considering the issue of multiple testing, analyzing the data multiple times with different MSWS values to select the best result might not be appropriate. In our study, we adapt the MCSP criterion to the multinomial model and utilize it to select the optimal MRCS, while keeping the MSWS value fixed at 50%. To apply the MCSP to the multinomial model, we first define the union cluster set \({Z}_{A}^{\left(m\right)}\) by merging all \({J}_{m}\) clusters for each \(m\) as
where \({Z}_{j}^{\left(m\right)}\) is the \({j}{\text{th}}\) detected significant cluster (\(j\)= 1, â€¦, \({J}_{m}\)). Then, we calculate the union loglikelihood ratio (LLR) test statistic \(log{\lambda }_{{Z}_{A}^{\left(m\right)}}\) given the union cluster set \({Z}_{A}^{\left(m\right)}\) as
where \({c}_{ik}\), \({C}_{k}\), and \(C\) were as defined previously and \({c}_{i}\) is the number of cases inside the region \(i\). The optimal MRCS is the one that maximizes the union LLR test statistic \(log{\lambda }_{{Z}_{A}^{\left(m\right)}}\).
Considering the possibility of detected significant clusters being heterogeneous with varying relative risks, Wang et al. [24] introduced the maximum clustering heterogeneous setproportion (MCHSP) as an optimization criterion to determine the optimal value of the MSWS. As previously discussed, we employ the MCSP criterion in the multinomial model and utilize it to select the optimal MRCS, while maintaining a fixed MSWS value of 50%. For each \(m\), we define the heterogeneous cluster set \({Z}_{B}^{\left(m\right)}\) by merging \({J}_{m}\) detected significant clusters into \({W}_{m} ({W}_{m}\le {J}_{m})\) merged clusters according to their spatial contiguity.
Then we calculate the union LLR test statistic \(log{\lambda }_{{Z}_{B}^{\left(m\right)}}\) given the heterogeneous cluster set \({Z}_{B}^{\left(m\right)}\) as
The optimal MRCS is the one that maximizes the union LLR test statistic \(log{\lambda }_{{Z}_{B}^{\left(m\right)}}\).
Simulation study
We conducted a simulation study to evaluate the performance of the proposed method for the multinomial model in comparison to other existing methods. The study region comprised Seoul and Gyeonggi Province in South Korea, consisting of 69 districts. For the simulation, we considered five different true cluster models as depicted in Fig. 2. True cluster models (A) and (B) represented one circularshaped and one ellipticalshaped true cluster, respectively, each consisting of 5 districts, which accounted for 8% of the entire study region. True cluster model (C) depicted one irregularshaped true cluster with 10 districts, representing 15% of the entire study region. True cluster models (D) and (E) assumed two circularshaped and two ellipticalshaped true clusters, respectively, each consisting of 5 districts.
For each true cluster model, we considered various scenarios of the alternative hypothesis, assuming four categories. The parameter setting for the alternative hypothesis was adopted from a previous study [6]. The null hypothesis was set to equal probabilities of 0.25 for each of four categories. In the previous study [6], several different alternative hypotheses were used to evaluate the multinomialbased spatial scan statistic and successfully showed that the multinomialbased spatial scan statistic worked well under those hypotheses. In this study, we aimed to assess a method for optimizing the MRCS for the multinomialbased spatial scan statistic and believe that it would be good to evaluate its performance under the same hypotheses. Furthermore, because the alternative hypotheses satisfy the likelihood ratio ordering, we were also able to evaluate the performance of the ordinal model [3]. For the true cluster models with two clusters, we included heterogeneous settings where different alternative hypotheses were assigned to each cluster, as well as homogeneous settings where the same alternative hypotheses were applied to both clusters. This allowed us to examine the performance of the proposed method in more plausible heterogeneous settings, where the relative risks of each category differ between the two clusters. We considered four alternative hypotheses for the true cluster models with one cluster and two homogeneous clusters, as well as three alternative hypotheses for the true cluster models with two heterogeneous clusters. This resulted in a total of 26 scenarios considered in combination. Table 1 presents the simulation scenarios for the true cluster model along with their respective alternative hypotheses.
Under each scenario, we generated 1000 datasets, each containing 1000 cases distributed among four categories. For each data set, we repeatedly identified clusters by varying the MRCS values. In SaTScanâ„¢, the MRCS value was set to 1%, 2%, 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50%. As SaTScanâ„¢ provides Gini coefficient values for these 17 candidate MRCS values in the Bernoulli and Poisson models, we computed the SCICs, Gini coefficient (for the ordinal model), Elbow method, MCSP and MCHSP values for these 17 candidate MRCS values for consistency. Then, we compared the clusters reported by each method using the optimal MRCS selected, with the true clusters. Regarding the scanning window shape, we presented the simulation results obtained when using the elliptical windows as the main results because Kulldorff et al. [32] found that the spatial scan statistic with elliptic windows exhibited good performance in terms of the power when the shape of the true cluster is elliptical or circular.
Over 1000 randomly generated datasets, we recorded the frequency at which each candidate MRCS value was selected as the optimal MRCS for each method. To compare the performance of the proposed method with other existing methods and default setting (MRCS value of 50%), we used sensitivity, positive predicted value (PPV) and misclassification as the performance measures, as per a previous study [33]. Sensitivity represents the proportion of correctly identified districts within the true cluster, while PPV represents the proportion of correctly identified districts within the detected cluster. A method with higher values of these measures indicates greater precision in identifying the true cluster. A lower sensitivity means that the method failed to identify some districts that belong to the true cluster. A lower PPV means that the method identified some districts that do not belong to the true cluster. Misclassification indicates the proportion of incorrectly identified districts within the true or detected cluster. Higher sensitivity and PPV values, along with lower misclassification values, indicate better performance in accurately identifying clusters. We calculated the average sensitivity, PPV, and misclassification over 1000 simulated datasets for two sets of MRCS values: (1) those selected by SCIC_{1}, SCIC_{2}, Gini coefficient (only for the ordinal model), Elbow method, MCSP, and MCHSP, and (2) the default value of 50%. The simulation was conducted using SaTScanâ„¢ version 10.0 and R software version 4.0.2, employing the â€˜rsatscanâ€™ package [34].
Results
Simulation study results
Tables 2, 3, 4,â€‰5 present the simulation results for cluster model (B). The other results are provided in Additional file 1. For cluster models (A), (B), (D), and (E), all five methods most often selected the optimal MRCS value equal to the size of the true cluster from the 17 candidate MRCS values, regardless of the alternative hypothesis scenario. For cluster model (C) of irregularshaped cluster, all five methods most often chose an optimal MRCS value of 12%, which is smaller than the size of the true cluster (30%), irrespective of the alternative hypothesis scenario. When using the optimal MRCS value instead of the default setting, the methods tend to report multiple informative smaller clusters instead of reporting a single larger cluster that contains the true irregular cluster.
The proposed methods consistently exhibited higher sensitivity and positive predictive value (PPV) at the most frequently selected MRCS value than the default setting. Additionally, the rate of misclassification was much lower. The overall sensitivity of the proposed methods was slightly lower than that of the default setting. However, the overall PPV was higher than that of the default setting. Across all scenarios, it appears that all five methods yielded similar overall detection accuracy in terms of sensitivity, PPV, and misclassification. The overall sensitivity of SCIC_{1} was comparable to SCIC_{2}, while the overall PPV of SCIC_{1} was slightly higher than that of SCIC_{2}.
The simulation results for the ordinal model are provided in Additional file 2: Tables A23â€“A48). The proposed methods and the other three methods for the ordinal model have similar trends in simulation results for the multinomial model. The sensitivity and PPV of SCIC_{1} and SCIC_{2} at the most often selected MRCS value were higher than those of the default setting. The overall PPV of the proposed methods was higher than that of the default setting, while the sensitivity was comparable. Additionally, the misclassification rate was consistently lower. We noticed that the overall sensitivity of the SCIC_{2} was slightly higher than that of the SCIC_{1} in cluster models (D) and (E), which involve two clusters. The Gini coefficient exhibited higher sensitivity and PPV, and lower misclassification at the most often chosen MRCS value, but its overall performance was quite similar to that of the default setting.
Application to Korea Community Health Survey data
We used the Korea Community Health Survey (KCHS) data to illustrate the usefulness of the proposed method. The KCHS is an annual survey conducted by the Korea Disease Control and Prevention Agency since 2008 to gather communitybased health statistics. This survey was carried out across 253 community health centers, covering various aspects such as health behaviors, selfreported health indicators, and demographic characteristics. For our analysis, we used the â€˜reason for starting to drinkâ€™ as the nominal categorical variable from the 2019 KCHS data. Subjects who had never consumed alcohol were excluded. The â€˜reason for starting to drinkâ€™ was categorized into four groups: (1) recommended by people, (2) out of curiosity, (3) to promote friendship, and (4) other reasons. It would be valuable to examine the spatial autocorrelation to assess whether this outcome variable exhibits inherent spatial dependency. However, based on the literature search conducted thus far, it seems that there is no established method for calculating spatial autocorrelation in the context of multinomial data. The results of the spatial cluster detection analysis might provide insights into spatial autocorrelation. Using the multinomialbased spatial scan statistic with elliptical windows, we searched for regions in Seoul and Gyeonggi province that exhibited distinct distributions of the â€˜reason for starting to drinkâ€™ among males in their 20 and 30 s.
The reported clusters differed depending on the method used to optimize the MRCS value. Figure 3 shows a map of the significant spatial clusters reported by each method. A summary of those clusters is presented in Table 6. The SCIC_{1} and SCIC_{2} methods selected an optimal MRCS of 10%, which is smaller than the default setting. When using the default setting, three large clusters were reported. In contrast, the proposed methods identified six smaller clusters that seem to carry more meaningful information. Cluster 1 reported using the SCICs belongs to cluster 1 reported using the default setting. Similarly, cluster 2 reported using SCICs belongs to cluster 2 reported using the default setting. Clusters 3, 4, and 5 reported using the SCICs belong to cluster 3 reported using the default setting. The proposed methods seemed to reveal more meaningful smaller clusters that were not identified by the default setting. It is worth noting that cluster 4 reported using the SCICs was a hidden smaller cluster with the highest relative risk (RR) in category 3, rather than in category 1 as cluster 3 identified in the default setting. Additionally, the proposed methods reported another regions as cluster 6, which went unnoticed by the default setting.
The Elbow method selected 4% as the optimal MRCS, while the MCSP and MCHSP selected 2% as optimal. These three methods identified clusters that either consisted of smaller clusters within the clusters detected by the default setting, smaller clusters partially overlapping with the default clusters, or smaller clusters in entirely new regions without any overlap with the default clusters. Those clusters could provide more informative and interpretable results compared to those identified using the default setting. However, the clusters obtained using these methods are primarily composed of very small clusters consisting of only one or two regions. Particularly when using the MCHSP method, it might be difficult to consider them as clusters since some reported clusters consisting of one region are remote and not adjacent to other clusters.
Discussion and conclusion
To select the optimal MRCS value when using the spatial scan statistics, several optimization criteria have been developed such as the Gini coefficient [17, 19,20,21], MCSP [23], MCHSP [24], and Elbow method [22]. However, the Gini coefficient for the multinomial model has not been developed. The other optimization criteria (i.e., MCSP, MCHSP and Elbow method) have been developed and evaluated only for the Poisson model. Thus, we have proposed the SCIC to choose the optimal MRCS value for the multinomialbased spatial scan statistic.
We have evaluated the performance of the proposed methods through an extensive simulation study. Particularly, in the scenarios with the two heterogeneous clusters, we observed consistent and robust results for both the multinomial and ordinal models: (1) the SCICs mostly selected the MRCS value that matched the size of the true cluster as the optimal MRCS, and (2) the detection accuracy achieved at the optimal MRCS using SCICs outperformed the results obtained with the default setting. We have also evaluated the performance of the existing methods by appropriately applying to the multinomial model. The overall detection accuracy obtained using the proposed methods was comparable to that of other existing methods. This might be because these methods are all defined based on the likelihood. While the sensitivity of the proposed methods at the selected optimal MRCS value was higher than the default setting, the overall sensitivity was slightly lower. This could be considered a limitation of our method, as it suggests the potential for missing certain regions of true clusters in some situations. However, this trend was observed across all evaluated methods.
Despite delivering comparable performance, the existing methods have certain limitations. The Gini coefficient cannot be applied to the multinomial model. The Elbow method assumes that the sum of the LRT statistic for significant clusters monotonically increases as the MRCS values increase. However, in certain cases, multiple significant clusters may be reported at small MRCS values, causing the sum of the LRT statistic to initially increase and then decrease. As a result, identifying the proper elbow point becomes challenging. The MCSP and MCHSP methods require distinct definitions of the union loglikelihood ratio test statistic for each probability model. Additionally, the MCHSP method suffers from a lengthy computation time due to the necessity of calculating the spatial contiguity matrix.
We have introduced the SCICs for the multinomial model, which can be easily extended to all probability models based on likelihood. These criteria offer computational efficiency as they directly calculate the criteria without requiring any modification of the test statistics. Consequently, we propose that utilizing the SCICs when selecting the optimal MRCS for the multinomial and ordinalbased spatial scan statistics would be beneficial. By employing the SCICs, we anticipate identifying more meaningful and interpretable clusters compared to using the default setting.
Between the two versions of the SCICs, we find that the SCIC_{1} appears more appropriate as it includes information of the number of cases in addition to the regional information. Through simulation results of the multinomial model, we observed that the SCIC_{1} outperformed the SCIC_{2} in terms of PPV. However, in the simulation results of the ordinal model, both the overall sensitivity and PPV were comparable between the SCIC_{1} and SCIC_{2} in the single cluster setting. In the two clusters setting, the overall sensitivity of SCIC_{2} was slightly higher than that of SCIC_{1}. Nevertheless, the differences in overall sensitivity between the SCIC_{1} and SCIC_{2} were minimal and not deemed significant.
In summary, we propose a novel approach to optimizing the MRCS value for the multinomialbased spatial scan statistic. Compared to the default setting, our SCIC measures improve the accuracy of reported clusters. Also, the SCIC measures have the advantages of easily extending to other probability models over the existing measures. In public health and disease surveillance, our approach has the potential to enhance spatial cluster detection by providing greater accuracy and meaningful insights.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 BIC:

Bayes information criterion
 KCHS:

Korea Community Health Survey
 LRT:

Likelihood ratio test
 LLR:

Loglikelihood ratio
 MCSP:

Maximum clustering setproportion
 MRCS:

Maximum reported cluster size
 MSWS:

Maximum scanning window size
 PPV:

Positive predicted value
 SCIC:

Spatial cluster information criterion
References
Kulldorff M. A spatial scan statistic. Commun Stat Theory Methods. 1997;26(6):1481â€“96.
Cook AJ, Gold DR, Li Y. Spatial cluster detection for censored outcome data. Biometrics. 2007;63(2):540â€“9.
Jung I, Kulldorff M, Klassen AC. A spatial scan statistic for ordinal data. Stat Med. 2007;26(7):1594â€“607.
Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. Int J Health Geogr. 2009;8:58.
Huang L, Tiwari RC, Zou Z, Kulldorff M, Feuer EJ. Weighted normal spatial scan statistic for heterogeneous population data. J Am Stat Assoc. 2009;104(487):886â€“98.
Jung I, Kulldorff M, Richard OJ. A spatial scan statistic for multinomial data. Stat Med. 2010;29(18):1910.
Mai G, Janowicz K, Hu Y, Gao S. ADCN: an anisotropic densitybased clustering algorithm for discovering spatial point patterns with noise. Trans GIS. 2018;22:348â€“69.
Kang Y, Wu K, Gao S, Ng I, Rao J, Ye S, Zhang F, Fei T. STICC: a multivariate spatial clustering method for repeated geographic pattern discovery with consideration of spatial contiguity. Int J Geogr Inf Sci. 2022;36(8):1518â€“49.
Knox. Detection of clusters. In: Elliott P, editor. Methodologies of Enquiry into Disease Clustering. Wembley: Small Area Health Statistics Unit; 1989. p. 17â€“22.
Hu Y, Gao S, Janowicz K, Yu B, Li W, Prasad S. Extracting and understanding urban areas of interest using geotagged photos. Comput Environ Urban Syst. 2015;54:240â€“54.
Damiani ML, Issa H, Fotino G, Heurich M, Cagnacci F. Introducing presence and stationarity index to study partial migration patterns: an application of a spatiotemporal clustering technique. Int J Geogr Inf Sci. 2016;30(5):907â€“28.
Huang Q. Mining online footprints to predict userâ€™s next location. Int J Geogr Inf Sci. 2017;31:523â€“41.
Gruebner O, Lowe S, Tracy M, Joshi S, CerdÃ¡ M, Norris F, Subramanian S, Galea S. Mapping concentrations of posttraumatic stress and depression trajectories following Hurricane Ike. Sci Rep. 2016;6:32242.
Cordes J, Castro MC. Spatial analysis of COVID19 clusters and contextual factors in New York City. Spat Spatiotemporal Epidemiol. 2020;34:100355.
Richards Steed R, Bakian AV, Smith KR, Wan N, Brewer S, Medina R, VanDerslice J. Evidence of transgenerational effects on autism spectrum disorder using multigenerational spacetime cluster detection. Int J Health Geogr. 2022;21:13.
Ribeiro SHR, Costa MA. Optimal selection of the spatial scan parameters for cluster detection: a simulation study. Spat Spatiotemporal Epidemiol. 2012;3(2):107â€“20.
Han J, Zhu L, Kulldorff M, Hostovich S, Stinchcomb DG, Tatalovich Z, Lewis DR, Feuer EJ. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. Int J Health Geogr. 2016;15:27.
Gini C. VariabilitÃ e mutabilitÃ . Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini T). Rome: Libreria Eredi Virgilio Veschi; 1912.
Kim S, Jung I. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data. PLoS ONE. 2017;12:e0182234.
Yoo H, Jung I. Optimizing the maximum reported cluster size for normalbased spatial scan statistics. Commun Stat Appl Methods. 2018;25:373â€“83.
Lee S, Moon J, Jung I. Optimizing the maximum reported cluster size in the spatial scan statistic for survival data. Int J Health Geogr. 2021;20:33.
Meysami M, French JP, Lipner EM. Estimating the optimal population upper bound for scan methods in retrospective disease surveillance. Biom J. 2021;63:1633â€“51.
Ma Y, Yin F, Zhang T, Zhou XA, Li X. Selection of the maximum spatial cluster size of the spatial scan statistic by using the maximum clustering setproportion statistic. PLoS ONE. 2017;11(1):e0147918.
Wang W, Zhang T, Yin F, Xiao X, Chen S, Zhang X, Li X, Ma Y. Using the maximum clustering heterogeneous setproportion to select the maximum window size for the spatial scan statistic. Sci Rep. 2020;10:4900.
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461â€“4.
Neath AA, Cavanaugh JE. The Bayesian information criterion: background, derivation, and applications. WIRE Comput Stat. 2012;4:199â€“203.
Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geogr. 2005;4:11.
Tango T. A test for spatial disease clustering adjusted for multiple testing. Stat Med. 2000;19:191â€“204.
Tango T. Spatial scan statistics can be dangerous. Stat Methods Med Res. 2021;30(1):75â€“86.
Kodinariya TM, Makwana PR. Review on determining number of cluster in kmeans clustering. Int J. 2013;1(6):90â€“5.
Delgado H, Anguera X, Fredouille C, Serrano J. Novel clustering selection criterion for fast binary key speaker diarization. INTERSPEECH. 2015. p. 3091â€“5.
Kulldorff M, Huang L, Pickle L, Duczmal L. An elliptic spatial scan statistic. Stat Med. 2006;25:3929â€“43.
Costa MA, AssunÃ§Ã£o RM, Kulldorff M. Constrained spanning tree algorithms for irregularlyshaped spatial clustering. Comput Stat Data Anal. 2012;56:1771â€“83.
Kleinman K, Rsatscan. Tools, classes, and methods for interfacing with SaTScan standalone software. 2015. https://CRAN.Rproject.org/package=rsatscan/.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
IJ conceived the study. JM and MK conducted the simulations and analyzed the data. JM drafted the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was approved by the SNU Research Ethics Team (IRB No. E1912/001010).
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Simulation results for multinomial model (A1â€“A22).
Additional file 2.
Simulation results for ordinal model (A23â€“A48).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Moon, J., Kim, M. & Jung, I. Optimizing the maximum reported cluster size for the multinomialbased spatial scan statistic. Int J Health Geogr 22, 30 (2023). https://doi.org/10.1186/s12942023003534
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12942023003534