Skip to main content

Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

Abstract

Background

Correctly identifying spatial disease cluster is a fundamental concern in public health and epidemiology. The spatial scan statistic is widely used for detecting spatial disease clusters in spatial epidemiology and disease surveillance. Many studies default to a maximum reported cluster size (MRCS) set at 50% of the total population when searching for spatial clusters. However, this default setting can sometimes report clusters larger than true clusters, which include less relevant regions. For the Poisson, Bernoulli, ordinal, normal, and exponential models, a Gini coefficient has been developed to optimize the MRCS. Yet, no measure is available for the multinomial model.

Results

We propose two versions of a spatial cluster information criterion (SCIC) for selecting the optimal MRCS value for the multinomial-based spatial scan statistic. Our simulation study suggests that SCIC improves the accuracy of reporting true clusters. Analysis of the Korea Community Health Survey (KCHS) data further demonstrates that our method identifies more meaningful small clusters compared to the default setting.

Conclusions

Our method focuses on improving the performance of the spatial scan statistic by optimizing the MRCS value when using the multinomial model. In public health and disease surveillance, the proposed method can be used to provide more accurate and meaningful spatial cluster detection for multinomial data, such as disease subtypes.

Introduction

In public health and disease surveillance, the spatial scan statistic is a widely used method for identifying spatial clusters with significantly high or low risk of disease outcomes. This method is based on the likelihood ratio test statistic for each scanning window to compare its inside and outside. The scanning window that maximizes the test statistic is identified as the most likely cluster. Secondary clusters with high values of the test statistics are also identified. The statistical significance of the most likely cluster and secondary clusters is determined using the Monte Carlo hypothesis testing. The spatial scan statistic has been developed for various probability models such as Poisson [1], Bernoulli [1], exponential [2], ordinal [3], normal [4, 5], and multinomial [6]. SaTScan™ software is freely available for conducting spatial cluster detection analysis using various models of the spatial scan statistic.

The spatial scan statistic differs from spatial clustering methods such as ADCN [7] and STICC [8] in that the method is designed for identifying clusters rather than dividing spatial data into distinct subgroups. A cluster is defined as geographically and/or temporally bounded group of occurrences of sufficient size and concentration to be unlikely to have occurred by chance [9]. The clusters are characterized by the statistical distribution of outcome, not just by distance between geographic objects as in density-based clustering. Spatial clustering methods are commonly used in geodata mining [10,11,12], while the spatial scan statistic is widely utilized for detecting geographic disease clusters [13,14,15].

In SaTScan™, researchers are required to specify the scanning window shape and the maximum scanning window size (MSWS). In many studies, the MSWS value is set to the default setting, which is 50% of the total population. A simulation study by Ribeiro and Costa [16] revealed that spatial cluster detection results can vary depending on the MSWS value. Nevertheless, their findings do not suggest running the analysis multiple times with different MSWS values to find the best results, as it may lead to a multiple testing problem, as argued by Han et al. [17]. They proposed an alternative approach, suggesting that the analysis should be rerun with a fixed large MSWS value while adjusting the maximum reported cluster size (MRCS) values. Setting the MRCS value to the default 50% may result in the reporting of clusters larger than the true clusters, encompassing less meaningful regions. Therefore, it is advisable to carefully select an optimal MRCS value.

Several studies have recently developed criteria to select the optimal value of the MRCS. Han et al. [17] proposed an optimization criterion using the Gini coefficient [18] specifically for the Poisson-based spatial scan statistic. Their simulation study showed that the proposed Gini coefficient effectively identified the correct clusters. However, it is important to note that the Gini coefficient needs to be defined differently for different probability models. Kim and Jung [19], Yoo and Jung [20], and Lee et al. [21] developed the Gini coefficient for the ordinal-, normal-, and exponential-based spatial scan statistics, respectively. Yet, no Gini coefficient has been developed for the multinomial-based spatial scan statistic. The difficulty in defining a clear Gini coefficient for the multinomial-based spatial scan statistic arises from its inapplicability to nominal values.

Other studies [22,23,24] have proposed alternative criteria for selecting the optimal MRCS or MSWS. However, these studies only evaluated the performance of their methods for the Poisson-based spatial scan statistic. Because the methods are likelihood-based optimization criteria, they can potentially be extended to other probability models. Nevertheless, it remains crucial to carefully evaluate the effectiveness of these methods when applied to probability models other than the Poisson model.

In this study, we propose a spatial cluster information criterion (SCIC) inspired by the formulation of the Bayes Information Criterion (BIC) [25] to choose the optimal MRCS value for the multinomial-based spatial scan statistic. The SCIC can be defined for the spatial scan statistic irrespective of the underlying probability model, as its approach is rooted in the likelihood ratio test statistic. To assess the performance of our proposed method, we conducted a simulation study for both the multinomial-based and ordinal-based spatial scan statistics. We compared the performance of our proposed method with that of existing approaches. To exemplify the methodology, we utilized the Korea Community Health Survey (KCHS) data collected by the Korea Centers for Disease Control and Prevention.

Methods

Spatial scan statistic for multinomial data

The multinomial-based spatial scan statistic [6] is used to detect disease clusters with statistically different disease-type distributions. Let \({p}_{k}\) and \({q}_{k}\) denote the probabilities of category \(k\) inside and outside the scanning window \(z\), respectively. If we want to identify regions with different disease-type distributions, the null and alternative hypotheses are stated as

$${{H}_{0}}: {{p}_{1}}={{q}_{1}}, \ldots, {{p}_{K}}={{q}_{K}}\; for\;all\;z\in Z\quad v.s. \quad {{H}_{1}}: not \, {{H}_{0}}$$

where \(Z\) denotes the set of all scanning windows and \(K\) denotes the total number of categories. The likelihood ratio test statistic, given the scanning window z, is denoted as

$${\lambda }_{z}=\frac{\prod _{k}\left\{{\left(\frac{\sum _{i\in z}{c}_{ik}}{\sum _{k}\sum _{i\in z}{c}_{ik}}\right)}^{\sum _{i\in z}{c}_{ik}}\cdot {\left(\frac{\sum _{i\notin z}{c}_{ik}}{\sum _{k}\sum _{i\notin z}{c}_{ik}}\right)}^{\sum _{i\notin z}{c}_{ik}}\right\}}{\prod _{k}\left\{{\left(\frac{{C}_{k}}{C}\right)}^{{C}_{k}}\right\}}$$

where \({c}_{ik}\) is the number of cases belonging to category \(k\) inside the region \(i\), \({C}_{k}\) is the total number of cases belonging to category \(k\) in the whole study area and \(C\) is the total number of cases in the whole study area.

Spatial cluster information criterion (SCIC)

Now we propose an optimization criterion called the spatial cluster information criterion (SCIC) for selecting the optimal MRCS value. Our criterion draws inspiration from the formulation of the Bayes information criterion (BIC) [25], which is a widely used criterion in statistical modeling for model selection. The BIC for a candidate model \({M}_{u}\) is defined as

$$BIC\left({M}_{u}\right)=-2\cdot logL\left(\widehat{{\theta }_{u}}|y\right)+u\cdot log\left(v\right),$$

where \(y\) is observed data, \(L\left({\theta }_{u}|y\right)\) is the likelihood of \(y\) given the model \({M}_{u}\), \(\widehat{{\theta }_{u}}\) is the maximum likelihood estimation (MLE) of \({\theta }_{u}\) that maximizes the \(L\left({\theta }_{u}|y\right)\), \(u\) is the number of parameters in the model \({M}_{u}\), and \(v\) is the total number of observations. The BIC equation includes a penalty term as the second component, which penalizes models with additional parameters. The model exhibiting the minimum BIC value is considered the most appropriate selection [26].

We define the SCIC as the sum of the LLR test statistic for all significant clusters, along with a penalty term. In the multinomial-based spatial scan statistic, the LLR test statistic for each scanning window is used to measure the degree of heterogeneity in the spatial distribution of the categories. A higher LLR test statistic indicates a greater degree of heterogeneity within the scanning window compared to the surrounding area. However, as the scanning window size increases, there is a tendency for the LLR test statistic to rise due to the growing number of cases included within the window.

The spatial scan statistic has faced criticism for its tendency to identify clusters that are considerably larger than the actual clusters, often incorporating neighboring regions with no elevated risk of disease occurrence [27,28,29]. This tendency is mainly noticeable when the default settings of MSWS and MRCS, both set at 50%, are used with circular scanning windows. Optimizing the MRCS improves the spatial scan statistic’s ability to identify clusters with greater precision [17, 19,20,21]. To utilize the sum of the LRT statistics as an optimizing criterion, we need to offset the inflation of the test statistic due to a large number of observations within the window.

The penalty term in the SCIC is defined in two versions. In the first version, the penalty term is calculated by multiplying the logarithm of the number of cases within the significant clusters by the product of the number of categories and the number of significant clusters. In the second version, we substitute the number of regions inside the significant clusters for the number of cases. This is based on the understanding that the number of cases within a cluster tends to increase as the number of regions inside the cluster increases. Both versions serve as optimization criteria with similar implications. For the multinomial model, the algorithm for computing the SCIC is as follows:

  • (Step 1) For a given MRCS \(m\)% (\(m\)=1, …, 50), denote \({J}_{m}\) significant clusters reported using the multinomial-based spatial scan statistic by \({Z}_{1}^{\left(m\right)}, \cdots , {Z}_{{J}_{m}}^{\left(m\right)}\).

  • (Step 2) For each \(m\), calculate the SCIC for all significant clusters as follows:

    $${SCIC}_{1}\left(m\right)=-2\sum _{j=1}^{{J}_{m}}log\left({\lambda }_{{Z}_{j}^{\left(m\right)}}\right)+K\cdot {J}_{m}\cdot log\left({\tau }^{\left(m\right)}\right)$$
    (Version 1)
    $${SCIC}_{2}\left(m\right)=-2\sum _{j=1}^{{J}_{m}}log\left({\lambda }_{{Z}_{j}^{\left(m\right)}}\right)+K\cdot {J}_{m}\cdot log\left({\delta }^{\left(m\right)}\right)$$
    (Version 2)

    where \({\lambda }_{{Z}_{j}^{\left(m\right)}}\) denotes the LRT statistic for the multinomial-based spatial statistic given the \({j}^{th}\) significant cluster \({Z}_{j}^{\left(m\right)}\), \(K\) is the total number of categories, and \({\tau }^{\left(m\right)}\) and \({\delta }^{\left(m\right)}\) denote the sum of the number of total cases and the sum of the number of regions inside all significant clusters, respectively.

  • (Step 3) Choose the MRCS which minimizes the SCIC as the optimal MRCS.

Figure 1 illustrates the flowchart of the proposed method.

Fig. 1
figure 1

The flowchart of the proposed method

Elbow method, MCS-P, and MCHS-P

For the Poisson-based spatial scan statistic, optimization criteria such as the elbow method [22], the maximum clustering set–proportion (MCS-P) [23], and the maximum clustering heterogeneous set-proportion (MCHS-P) [24] have been proposed to determine the optimal value of MRCS or MSWS. Since these methods are likelihood-based optimization criteria, we have adapted them to the multinomial model in order to evaluate and compare their performance with our proposed approaches. The logical order is the same as the SCICs, with the only difference being the measure being calculated. It’s important to emphasize that we should consider optimizing MRCS, not MSWS, to avoid the multiple testing problem, as noted by Han et al. [17].

The elbow method [30] is commonly employed in unsupervised learning to determine the optimal number of clusters by identifying the elbow point. In the context of selecting the optimal MRCS value, Meysami et al. [22] proposed an optimization criterion for the Poisson model by adopting the method for finding the optimal elbow point as suggested by Delgado et al. [31]. We employ the method for the multinomial model by calculating the negative sum of the likelihood ratio test (LRT) statistic values over all \({J}_{m}\) significant clusters for each \(m\) as

$$-LRT\left(m\right)=-\sum _{j=1}^{{J}_{m}}{\lambda }_{{Z}_{j}^{\left(m\right)}}$$

where \({\lambda }_{{Z}_{j}^{\left(m\right)}}\) denotes the LRT statistics value for the \({j}{\text{th}}\) significant cluster \({Z}_{j}^{\left(m\right)}\) (\(j\)= 1, …, \({J}_{m}\)). If no significant cluster is present, use the maximum LRT statistic. The elbow plot is constructed by connecting the points (\(m, -LRT\left(m\right)\)) for \(m\)= 1, …, 50. For each \(m\), we calculate the orthogonal distance between each point (\(m, -LRT(m)\)) and the line connecting the first and last points. The optimal MRCS is the one that maximizes this orthogonal distance.

Ma et al. [23] proposed the maximum clustering set–proportion (MCS-P) as an optimization criterion to determine the optimal value of the MSWS for the Poisson-based spatial scan statistic. This criterion assumes that all identified significant clusters are homogeneous clusters with the same relative risks. However, considering the issue of multiple testing, analyzing the data multiple times with different MSWS values to select the best result might not be appropriate. In our study, we adapt the MCS-P criterion to the multinomial model and utilize it to select the optimal MRCS, while keeping the MSWS value fixed at 50%. To apply the MCS-P to the multinomial model, we first define the union cluster set \({Z}_{A}^{\left(m\right)}\) by merging all \({J}_{m}\) clusters for each \(m\) as

$${Z}_{A}^{\left(m\right)}={\bigcup }_{j=1}^{{J}_{m}}{Z}_{j}^{\left(m\right)}$$

where \({Z}_{j}^{\left(m\right)}\) is the \({j}{\text{th}}\) detected significant cluster (\(j\)= 1, …, \({J}_{m}\)). Then, we calculate the union log-likelihood ratio (LLR) test statistic \(log{\lambda }_{{Z}_{A}^{\left(m\right)}}\) given the union cluster set \({Z}_{A}^{\left(m\right)}\) as

$$log{\lambda }_{{Z}_{A}^{\left(m\right)}}=\sum _{k}\left\{\sum _{i\in {Z}_{A}^{\left(m\right)}}{c}_{ik}\cdot log\left(\frac{\sum _{i\in {Z}_{A}^{\left(m\right)}}{c}_{ik}}{\sum _{i\in {Z}_{A}^{\left(m\right)}}{c}_{i}}\right)+\left({C}_{k}-\sum _{i\in {Z}_{A}^{\left(m\right)}}{c}_{ik}\right)\cdot log\left(\frac{{C}_{k}-\sum _{i\in {Z}_{A}^{\left(m\right)}}{c}_{ik}}{C-\sum _{i\in {Z}_{A}^{\left(m\right)}}{c}_{i}}\right)\right\}+\sum _{k}{C}_{k}\cdot log\left(\frac{{C}_{k}}{C}\right)$$

where \({c}_{ik}\), \({C}_{k}\), and \(C\) were as defined previously and \({c}_{i}\) is the number of cases inside the region \(i\). The optimal MRCS is the one that maximizes the union LLR test statistic \(log{\lambda }_{{Z}_{A}^{\left(m\right)}}\).

Considering the possibility of detected significant clusters being heterogeneous with varying relative risks, Wang et al. [24] introduced the maximum clustering heterogeneous set-proportion (MCHS-P) as an optimization criterion to determine the optimal value of the MSWS. As previously discussed, we employ the MCS-P criterion in the multinomial model and utilize it to select the optimal MRCS, while maintaining a fixed MSWS value of 50%. For each \(m\), we define the heterogeneous cluster set \({Z}_{B}^{\left(m\right)}\) by merging \({J}_{m}\) detected significant clusters into \({W}_{m} ({W}_{m}\le {J}_{m})\) merged clusters according to their spatial contiguity.

$${Z}_{B}^{\left(m\right)}=\left\{{Z}_{{B}_{1}}^{\left(m\right)}, {\ldots , Z}_{{B}_{{W}_{m}}}^{\left(m\right)}\right\}$$

Then we calculate the union LLR test statistic \(log{\lambda }_{{Z}_{B}^{\left(m\right)}}\) given the heterogeneous cluster set \({Z}_{B}^{\left(m\right)}\) as

$$log{\lambda }_{{Z}_{B}^{\left(m\right)}}=\sum _{k}\left\{\sum _{i\in {Z}_{{B}_{1}}^{\left(m\right)}}{c}_{ik}\cdot log\left(\frac{\sum _{i\in {Z}_{{B}_{1}}^{\left(m\right)}}{c}_{ik}}{\sum _{i\in {Z}_{{B}_{1}}^{\left(m\right)}}{c}_{i}}\right)+\cdots +\sum _{i\in {Z}_{{B}_{{W}_{m}}}^{\left(m\right)}}{c}_{ik}\cdot log\left(\frac{\sum _{i\in {Z}_{{B}_{{W}_{m}}}^{\left(m\right)}}{c}_{ik}}{\sum _{i\in {Z}_{{B}_{{W}_{m}}}^{\left(m\right)}}{c}_{i}}\right)+\left({C}_{k}-\sum _{i\in {Z}_{B}^{\left(m\right)}}{c}_{ik}\right)\cdot log\left(\frac{{C}_{k}-\sum _{i\in {Z}_{B}^{\left(m\right)}}{c}_{ik}}{C-\sum _{i\in {Z}_{B}^{\left(m\right)}}{c}_{i}}\right)\right\}+\sum _{k}{C}_{k}\cdot log\left(\frac{{C}_{k}}{C}\right)$$

The optimal MRCS is the one that maximizes the union LLR test statistic \(log{\lambda }_{{Z}_{B}^{\left(m\right)}}\).

Simulation study

We conducted a simulation study to evaluate the performance of the proposed method for the multinomial model in comparison to other existing methods. The study region comprised Seoul and Gyeonggi Province in South Korea, consisting of 69 districts. For the simulation, we considered five different true cluster models as depicted in Fig. 2. True cluster models (A) and (B) represented one circular-shaped and one elliptical-shaped true cluster, respectively, each consisting of 5 districts, which accounted for 8% of the entire study region. True cluster model (C) depicted one irregular-shaped true cluster with 10 districts, representing 15% of the entire study region. True cluster models (D) and (E) assumed two circular-shaped and two elliptical-shaped true clusters, respectively, each consisting of 5 districts.

Fig. 2
figure 2

True cluster models in the simulation study

For each true cluster model, we considered various scenarios of the alternative hypothesis, assuming four categories. The parameter setting for the alternative hypothesis was adopted from a previous study [6]. The null hypothesis was set to equal probabilities of 0.25 for each of four categories. In the previous study [6], several different alternative hypotheses were used to evaluate the multinomial-based spatial scan statistic and successfully showed that the multinomial-based spatial scan statistic worked well under those hypotheses. In this study, we aimed to assess a method for optimizing the MRCS for the multinomial-based spatial scan statistic and believe that it would be good to evaluate its performance under the same hypotheses. Furthermore, because the alternative hypotheses satisfy the likelihood ratio ordering, we were also able to evaluate the performance of the ordinal model [3]. For the true cluster models with two clusters, we included heterogeneous settings where different alternative hypotheses were assigned to each cluster, as well as homogeneous settings where the same alternative hypotheses were applied to both clusters. This allowed us to examine the performance of the proposed method in more plausible heterogeneous settings, where the relative risks of each category differ between the two clusters. We considered four alternative hypotheses for the true cluster models with one cluster and two homogeneous clusters, as well as three alternative hypotheses for the true cluster models with two heterogeneous clusters. This resulted in a total of 26 scenarios considered in combination. Table 1 presents the simulation scenarios for the true cluster model along with their respective alternative hypotheses.

Table 1 Simulation scenarios for the true cluster model and alternative hypothesis

Under each scenario, we generated 1000 datasets, each containing 1000 cases distributed among four categories. For each data set, we repeatedly identified clusters by varying the MRCS values. In SaTScan™, the MRCS value was set to 1%, 2%, 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50%. As SaTScan™ provides Gini coefficient values for these 17 candidate MRCS values in the Bernoulli and Poisson models, we computed the SCICs, Gini coefficient (for the ordinal model), Elbow method, MCS-P and MCHS-P values for these 17 candidate MRCS values for consistency. Then, we compared the clusters reported by each method using the optimal MRCS selected, with the true clusters. Regarding the scanning window shape, we presented the simulation results obtained when using the elliptical windows as the main results because Kulldorff et al. [32] found that the spatial scan statistic with elliptic windows exhibited good performance in terms of the power when the shape of the true cluster is elliptical or circular.

Over 1000 randomly generated datasets, we recorded the frequency at which each candidate MRCS value was selected as the optimal MRCS for each method. To compare the performance of the proposed method with other existing methods and default setting (MRCS value of 50%), we used sensitivity, positive predicted value (PPV) and misclassification as the performance measures, as per a previous study [33]. Sensitivity represents the proportion of correctly identified districts within the true cluster, while PPV represents the proportion of correctly identified districts within the detected cluster. A method with higher values of these measures indicates greater precision in identifying the true cluster. A lower sensitivity means that the method failed to identify some districts that belong to the true cluster. A lower PPV means that the method identified some districts that do not belong to the true cluster. Misclassification indicates the proportion of incorrectly identified districts within the true or detected cluster. Higher sensitivity and PPV values, along with lower misclassification values, indicate better performance in accurately identifying clusters. We calculated the average sensitivity, PPV, and misclassification over 1000 simulated datasets for two sets of MRCS values: (1) those selected by SCIC1, SCIC2, Gini coefficient (only for the ordinal model), Elbow method, MCS-P, and MCHS-P, and (2) the default value of 50%. The simulation was conducted using SaTScan™ version 10.0 and R software version 4.0.2, employing the ‘rsatscan’ package [34].

Results

Simulation study results

Tables 2, 3, 4, 5 present the simulation results for cluster model (B). The other results are provided in Additional file 1. For cluster models (A), (B), (D), and (E), all five methods most often selected the optimal MRCS value equal to the size of the true cluster from the 17 candidate MRCS values, regardless of the alternative hypothesis scenario. For cluster model (C) of irregular-shaped cluster, all five methods most often chose an optimal MRCS value of 12%, which is smaller than the size of the true cluster (30%), irrespective of the alternative hypothesis scenario. When using the optimal MRCS value instead of the default setting, the methods tend to report multiple informative smaller clusters instead of reporting a single larger cluster that contains the true irregular cluster.

Table 2 Multinomial model: simulation results for the true cluster model (B) and alternative hypothesis (1) using elliptical windows
Table 3 Multinomial model: simulation results for the true cluster model (B) and alternative hypothesis (2) using elliptical windows
Table 4 Multinomial model: simulation results for the true cluster model (B) and alternative hypothesis (3) using elliptical windows
Table 5 Multinomial model: simulation results for the true cluster model (B) and alternative hypothesis (4) using elliptical windows

The proposed methods consistently exhibited higher sensitivity and positive predictive value (PPV) at the most frequently selected MRCS value than the default setting. Additionally, the rate of misclassification was much lower. The overall sensitivity of the proposed methods was slightly lower than that of the default setting. However, the overall PPV was higher than that of the default setting. Across all scenarios, it appears that all five methods yielded similar overall detection accuracy in terms of sensitivity, PPV, and misclassification. The overall sensitivity of SCIC1 was comparable to SCIC2, while the overall PPV of SCIC1 was slightly higher than that of SCIC2.

The simulation results for the ordinal model are provided in Additional file 2: Tables A23–A48). The proposed methods and the other three methods for the ordinal model have similar trends in simulation results for the multinomial model. The sensitivity and PPV of SCIC1 and SCIC2 at the most often selected MRCS value were higher than those of the default setting. The overall PPV of the proposed methods was higher than that of the default setting, while the sensitivity was comparable. Additionally, the misclassification rate was consistently lower. We noticed that the overall sensitivity of the SCIC2 was slightly higher than that of the SCIC1 in cluster models (D) and (E), which involve two clusters. The Gini coefficient exhibited higher sensitivity and PPV, and lower misclassification at the most often chosen MRCS value, but its overall performance was quite similar to that of the default setting.

Application to Korea Community Health Survey data

We used the Korea Community Health Survey (KCHS) data to illustrate the usefulness of the proposed method. The KCHS is an annual survey conducted by the Korea Disease Control and Prevention Agency since 2008 to gather community-based health statistics. This survey was carried out across 253 community health centers, covering various aspects such as health behaviors, self-reported health indicators, and demographic characteristics. For our analysis, we used the ‘reason for starting to drink’ as the nominal categorical variable from the 2019 KCHS data. Subjects who had never consumed alcohol were excluded. The ‘reason for starting to drink’ was categorized into four groups: (1) recommended by people, (2) out of curiosity, (3) to promote friendship, and (4) other reasons. It would be valuable to examine the spatial autocorrelation to assess whether this outcome variable exhibits inherent spatial dependency. However, based on the literature search conducted thus far, it seems that there is no established method for calculating spatial autocorrelation in the context of multinomial data. The results of the spatial cluster detection analysis might provide insights into spatial autocorrelation. Using the multinomial-based spatial scan statistic with elliptical windows, we searched for regions in Seoul and Gyeonggi province that exhibited distinct distributions of the ‘reason for starting to drink’ among males in their 20 and 30 s.

The reported clusters differed depending on the method used to optimize the MRCS value. Figure 3 shows a map of the significant spatial clusters reported by each method. A summary of those clusters is presented in Table 6. The SCIC1 and SCIC2 methods selected an optimal MRCS of 10%, which is smaller than the default setting. When using the default setting, three large clusters were reported. In contrast, the proposed methods identified six smaller clusters that seem to carry more meaningful information. Cluster 1 reported using the SCICs belongs to cluster 1 reported using the default setting. Similarly, cluster 2 reported using SCICs belongs to cluster 2 reported using the default setting. Clusters 3, 4, and 5 reported using the SCICs belong to cluster 3 reported using the default setting. The proposed methods seemed to reveal more meaningful smaller clusters that were not identified by the default setting. It is worth noting that cluster 4 reported using the SCICs was a hidden smaller cluster with the highest relative risk (RR) in category 3, rather than in category 1 as cluster 3 identified in the default setting. Additionally, the proposed methods reported another regions as cluster 6, which went unnoticed by the default setting.

Fig. 3
figure 3

A map of the significant spatial clusters identified using the multinomial-based spatial scan statistic with elliptical windows at the MRCS suggested by (1) default setting, (2) SCIC1, (3) SCIC2, (4) elbow method, (5) MCS-P, and (6) MCHS-P

Table 6 A summary of the significant spatial clusters identified using the multinomial-based spatial scan statistic with elliptical windows at the MRCS suggested by (1) default setting, (2) SCIC1, (3) SCIC2, (4) elbow method, (5) MCS-P, and (6) MCHS-P

The Elbow method selected 4% as the optimal MRCS, while the MCS-P and MCHS-P selected 2% as optimal. These three methods identified clusters that either consisted of smaller clusters within the clusters detected by the default setting, smaller clusters partially overlapping with the default clusters, or smaller clusters in entirely new regions without any overlap with the default clusters. Those clusters could provide more informative and interpretable results compared to those identified using the default setting. However, the clusters obtained using these methods are primarily composed of very small clusters consisting of only one or two regions. Particularly when using the MCHS-P method, it might be difficult to consider them as clusters since some reported clusters consisting of one region are remote and not adjacent to other clusters.

Discussion and conclusion

To select the optimal MRCS value when using the spatial scan statistics, several optimization criteria have been developed such as the Gini coefficient [17, 19,20,21], MCS-P [23], MCHS-P [24], and Elbow method [22]. However, the Gini coefficient for the multinomial model has not been developed. The other optimization criteria (i.e., MCS-P, MCHS-P and Elbow method) have been developed and evaluated only for the Poisson model. Thus, we have proposed the SCIC to choose the optimal MRCS value for the multinomial-based spatial scan statistic.

We have evaluated the performance of the proposed methods through an extensive simulation study. Particularly, in the scenarios with the two heterogeneous clusters, we observed consistent and robust results for both the multinomial and ordinal models: (1) the SCICs mostly selected the MRCS value that matched the size of the true cluster as the optimal MRCS, and (2) the detection accuracy achieved at the optimal MRCS using SCICs outperformed the results obtained with the default setting. We have also evaluated the performance of the existing methods by appropriately applying to the multinomial model. The overall detection accuracy obtained using the proposed methods was comparable to that of other existing methods. This might be because these methods are all defined based on the likelihood. While the sensitivity of the proposed methods at the selected optimal MRCS value was higher than the default setting, the overall sensitivity was slightly lower. This could be considered a limitation of our method, as it suggests the potential for missing certain regions of true clusters in some situations. However, this trend was observed across all evaluated methods.

Despite delivering comparable performance, the existing methods have certain limitations. The Gini coefficient cannot be applied to the multinomial model. The Elbow method assumes that the sum of the LRT statistic for significant clusters monotonically increases as the MRCS values increase. However, in certain cases, multiple significant clusters may be reported at small MRCS values, causing the sum of the LRT statistic to initially increase and then decrease. As a result, identifying the proper elbow point becomes challenging. The MCS-P and MCHS-P methods require distinct definitions of the union log-likelihood ratio test statistic for each probability model. Additionally, the MCHS-P method suffers from a lengthy computation time due to the necessity of calculating the spatial contiguity matrix.

We have introduced the SCICs for the multinomial model, which can be easily extended to all probability models based on likelihood. These criteria offer computational efficiency as they directly calculate the criteria without requiring any modification of the test statistics. Consequently, we propose that utilizing the SCICs when selecting the optimal MRCS for the multinomial- and ordinal-based spatial scan statistics would be beneficial. By employing the SCICs, we anticipate identifying more meaningful and interpretable clusters compared to using the default setting.

Between the two versions of the SCICs, we find that the SCIC1 appears more appropriate as it includes information of the number of cases in addition to the regional information. Through simulation results of the multinomial model, we observed that the SCIC1 outperformed the SCIC2 in terms of PPV. However, in the simulation results of the ordinal model, both the overall sensitivity and PPV were comparable between the SCIC1 and SCIC2 in the single cluster setting. In the two clusters setting, the overall sensitivity of SCIC2 was slightly higher than that of SCIC1. Nevertheless, the differences in overall sensitivity between the SCIC1 and SCIC2 were minimal and not deemed significant.

In summary, we propose a novel approach to optimizing the MRCS value for the multinomial-based spatial scan statistic. Compared to the default setting, our SCIC measures improve the accuracy of reported clusters. Also, the SCIC measures have the advantages of easily extending to other probability models over the existing measures. In public health and disease surveillance, our approach has the potential to enhance spatial cluster detection by providing greater accuracy and meaningful insights.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

BIC:

Bayes information criterion

KCHS:

Korea Community Health Survey

LRT:

Likelihood ratio test

LLR:

Log-likelihood ratio

MCS-P:

Maximum clustering set-proportion

MRCS:

Maximum reported cluster size

MSWS:

Maximum scanning window size

PPV:

Positive predicted value

SCIC:

Spatial cluster information criterion

References

  1. Kulldorff M. A spatial scan statistic. Commun Stat Theory Methods. 1997;26(6):1481–96.

    Article  Google Scholar 

  2. Cook AJ, Gold DR, Li Y. Spatial cluster detection for censored outcome data. Biometrics. 2007;63(2):540–9.

    Article  PubMed  Google Scholar 

  3. Jung I, Kulldorff M, Klassen AC. A spatial scan statistic for ordinal data. Stat Med. 2007;26(7):1594–607.

    Article  PubMed  Google Scholar 

  4. Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. Int J Health Geogr. 2009;8:58.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Huang L, Tiwari RC, Zou Z, Kulldorff M, Feuer EJ. Weighted normal spatial scan statistic for heterogeneous population data. J Am Stat Assoc. 2009;104(487):886–98.

    Article  CAS  Google Scholar 

  6. Jung I, Kulldorff M, Richard OJ. A spatial scan statistic for multinomial data. Stat Med. 2010;29(18):1910.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Mai G, Janowicz K, Hu Y, Gao S. ADCN: an anisotropic density-based clustering algorithm for discovering spatial point patterns with noise. Trans GIS. 2018;22:348–69.

    Article  Google Scholar 

  8. Kang Y, Wu K, Gao S, Ng I, Rao J, Ye S, Zhang F, Fei T. STICC: a multivariate spatial clustering method for repeated geographic pattern discovery with consideration of spatial contiguity. Int J Geogr Inf Sci. 2022;36(8):1518–49.

    Article  Google Scholar 

  9. Knox. Detection of clusters. In: Elliott P, editor. Methodologies of Enquiry into Disease Clustering. Wembley: Small Area Health Statistics Unit; 1989. p. 17–22.

  10. Hu Y, Gao S, Janowicz K, Yu B, Li W, Prasad S. Extracting and understanding urban areas of interest using geotagged photos. Comput Environ Urban Syst. 2015;54:240–54.

    Article  Google Scholar 

  11. Damiani ML, Issa H, Fotino G, Heurich M, Cagnacci F. Introducing presence and stationarity index to study partial migration patterns: an application of a spatio-temporal clustering technique. Int J Geogr Inf Sci. 2016;30(5):907–28.

    Article  Google Scholar 

  12. Huang Q. Mining online footprints to predict user’s next location. Int J Geogr Inf Sci. 2017;31:523–41.

    Article  Google Scholar 

  13. Gruebner O, Lowe S, Tracy M, Joshi S, Cerdá M, Norris F, Subramanian S, Galea S. Mapping concentrations of posttraumatic stress and depression trajectories following Hurricane Ike. Sci Rep. 2016;6:32242.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Cordes J, Castro MC. Spatial analysis of COVID-19 clusters and contextual factors in New York City. Spat Spatio-temporal Epidemiol. 2020;34:100355.

    Article  Google Scholar 

  15. Richards Steed R, Bakian AV, Smith KR, Wan N, Brewer S, Medina R, VanDerslice J. Evidence of transgenerational effects on autism spectrum disorder using multigenerational space-time cluster detection. Int J Health Geogr. 2022;21:13.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Ribeiro SHR, Costa MA. Optimal selection of the spatial scan parameters for cluster detection: a simulation study. Spat Spatio-temporal Epidemiol. 2012;3(2):107–20.

    Article  Google Scholar 

  17. Han J, Zhu L, Kulldorff M, Hostovich S, Stinchcomb DG, Tatalovich Z, Lewis DR, Feuer EJ. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. Int J Health Geogr. 2016;15:27.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Gini C. Variabilità e mutabilità. Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini T). Rome: Libreria Eredi Virgilio Veschi; 1912.

  19. Kim S, Jung I. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data. PLoS ONE. 2017;12:e0182234.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Yoo H, Jung I. Optimizing the maximum reported cluster size for normal-based spatial scan statistics. Commun Stat Appl Methods. 2018;25:373–83.

    Google Scholar 

  21. Lee S, Moon J, Jung I. Optimizing the maximum reported cluster size in the spatial scan statistic for survival data. Int J Health Geogr. 2021;20:33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Meysami M, French JP, Lipner EM. Estimating the optimal population upper bound for scan methods in retrospective disease surveillance. Biom J. 2021;63:1633–51.

    Article  PubMed  Google Scholar 

  23. Ma Y, Yin F, Zhang T, Zhou XA, Li X. Selection of the maximum spatial cluster size of the spatial scan statistic by using the maximum clustering set-proportion statistic. PLoS ONE. 2017;11(1):e0147918.

    Article  Google Scholar 

  24. Wang W, Zhang T, Yin F, Xiao X, Chen S, Zhang X, Li X, Ma Y. Using the maximum clustering heterogeneous set-proportion to select the maximum window size for the spatial scan statistic. Sci Rep. 2020;10:4900.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.

    Article  Google Scholar 

  26. Neath AA, Cavanaugh JE. The Bayesian information criterion: background, derivation, and applications. WIRE Comput Stat. 2012;4:199–203.

    Article  Google Scholar 

  27. Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geogr. 2005;4:11.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Tango T. A test for spatial disease clustering adjusted for multiple testing. Stat Med. 2000;19:191–204.

    Article  CAS  PubMed  Google Scholar 

  29. Tango T. Spatial scan statistics can be dangerous. Stat Methods Med Res. 2021;30(1):75–86.

    Article  PubMed  Google Scholar 

  30. Kodinariya TM, Makwana PR. Review on determining number of cluster in k-means clustering. Int J. 2013;1(6):90–5.

    Google Scholar 

  31. Delgado H, Anguera X, Fredouille C, Serrano J. Novel clustering selection criterion for fast binary key speaker diarization. INTERSPEECH. 2015. p. 3091–5.

  32. Kulldorff M, Huang L, Pickle L, Duczmal L. An elliptic spatial scan statistic. Stat Med. 2006;25:3929–43.

    Article  PubMed  Google Scholar 

  33. Costa MA, Assunção RM, Kulldorff M. Constrained spanning tree algorithms for irregularly-shaped spatial clustering. Comput Stat Data Anal. 2012;56:1771–83.

    Article  Google Scholar 

  34. Kleinman K, Rsatscan. Tools, classes, and methods for interfacing with SaTScan stand-alone software. 2015. https://CRAN.R-project.org/package=rsatscan/.

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

IJ conceived the study. JM and MK conducted the simulations and analyzed the data. JM drafted the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Inkyung Jung.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the SNU Research Ethics Team (IRB No. E1912/001-010).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Simulation results for multinomial model (A1–A22).

Additional file 2.

Simulation results for ordinal model (A23–A48).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moon, J., Kim, M. & Jung, I. Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic. Int J Health Geogr 22, 30 (2023). https://doi.org/10.1186/s12942-023-00353-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12942-023-00353-4

Keywords