Detecting multiple spatial disease clusters: information criterion and scan statistic approach

Background Detecting the geographical tendency for the presence of a disease or incident is, particularly at an early stage, a key challenge for preventing severe consequences. Given recent rapid advancements in information technologies, it is required a comprehensive framework that enables simultaneous detection of multiple spatial clusters, whether disease cases are randomly scattered or clustered around specific epicenters on a larger scale. We develop a new methodology that detects multiple spatial disease clusters and evaluates its performance compared to existing other methods. Methods A novel framework for spatial multiple-cluster detection is developed. The framework directly stands on the integrated bases of scan statistics and generalized linear models, adopting a new information criterion that selects the appropriate number of disease clusters. We evaluated the proposed approach using a real dataset, the hospital admission for chronic obstructive pulmonary disease (COPD) in England, and simulated data, whether the approach tends to select the correct number of clusters. Results A case study and simulation studies conducted both confirmed that the proposed method performed better compared to conventional cluster detection procedures, in terms of higher sensitivity. Conclusions We proposed a new statistical framework that simultaneously detects and evaluates multiple disease clusters in a large study space, with high detection power compared to conventional approaches.


Introduction
In the middle of the 19th century, a deadly cholera outbreak affected the Soho area of London, UK. John Snow, a British physician, plotted the cases of cholera victims on a map and identified many victims within a short distance of a water pump on Broad Street. The disease map led him to a historic landmark, with the water from the pump identified as the source of cholera [1]. However, what if other cholera victims had also clustered around another pump just 200 yards away? Would this still be considered as a single cluster or preferably another cluster with a different epicenter? Although the cause of disease or incident cannot be determined only by mapping the victims, disease maps are useful in initial investigations of disease causes. Whether the cases of diseases are scattered randomly or clustered around multiple specific centers is a long-standing question in epidemiological studies [2].
To date, detecting the tendency of a clustering incident, particularly at an early stage, is still a key challenge for practitioners in preventing severe epidemics and pandemics. Given recent rapid advancements in the utility of combined health and geographical information, the Takahashi and Shimadzu Int J Health Geogr (2020) 19:33 challenge has become more complex and has initiated a range of methodological developments. Based on the domain with which disease clusters are dealt, the types of disease clustering are threefold: being purely temporal, purely spatial, and spatio-temporal, for each of which different test techniques are proposed [2]. In particular, spatial clusters indicate a spatial tendency for the presence of a disease or incident, the risk of which is relatively high to other surrounding regions.
There have been many statistical tests widely used [3] for identifying meaningful spatial clusters. Amongst those techniques, a class called the general test [4] searches for clusters without any preconceived assumptions on their locations. Whether the statistical significance information of each cluster is available, however, depends on the technique employed [5]. The techniques that do not determine any statistical significance are called global clustering tests, techniques developed by Moran [6], Whitemore et al. [7], Oden [8], Tango [9], Rogerson [10] and Bonetti and Pagano [11]. In contrast, the other techniques that provide the statistical significance information, on which the present study focus, are called cluster detection tests (CDTs), including those proposed by Besag and Newell [4], Turnbull et al. [12], Kulldorff and Nagarwalla [13], Kulldorff [14], Tango [15].
Within CDTs, the circular spatial scan statistic [14] has been used extensively along with SaTScan software [16]; examples include, as part of their cancer surveillance initiative, investigating the geographical variation of breast, lung, prostate, and colorectal cancer incidences in New York State [17]. A distinctive feature of the methodology is to adopt a circular scanning window varying its size for defining potential clusters. Such a fixed shape of the scanning window could perform less effective when detecting clusters that lie in non-circular shape regions, like regions alongside a river [18]. More recent developments focus on non-circular cluster forms, employing different spatial scan statistics; examples can be found in Patil [19], Assuncao et al. [20], and Tango and Takahashi [18]. The flexibly shaped scan statistic [18] implemented in FleXScan software [21] adopts the scan approach with an exhaustive search of all cluster candidates within a given radius of any area. This approach balances out the unfeasible exhaustive search by restricting it within pre-specified neighborhoods of each area [20]. Tango and Takahashi [22] also proposed a flexible spatial scan statistic implemented with a restricted likelihood ratio. Their technique requires much less computational time compared to the original statistic and effectively detects clusters of any shape when the relative risk (RR) becomes large.
Even though such extensive methodological developments have been made, there seems to have been little attention to the accurate statistical evaluation on the simultaneous detection of multiple clusters, in other words, identifying an appropriate number of cluster regions at the same time. A significant shortcoming of previous CDTs is that they cannot provide any statistical significance information for the identified multiple clusters. Such a limitation is simply because most of the methodologies focus on "single" cluster detection while investigating the extended study space within which more than one cluster is expected. Some CDTs can be adjusted for multiple cluster detection employing spatial scan statistics [14,[23][24][25], by iteratively running a conventional CDT single cluster detection algorithmit leaves out sub-regions that are already identified as disease clusters in previous iterations until satisfactory results are obtained [14]. While the detection procedure is recursively performed, the cluster of the first choice is often referred to as the "primary" cluster, while the remaining clusters are referred to as "secondary" clusters; the conventional procedure is therefore often named as the secondary-cluster procedure (SCP).
The utility of CDTs becomes challenging when evaluating the number of clusters that lie within the study region. Each iteration of cluster detection in SCPs identifies only one cluster; thus, any test statistics, including associated p-values of the iteration, are only valid for evaluating that specific cluster. As a consequence, the current conventional approaches fail to provide an accurate assessment for selected multiple clusters. Therefore, a comprehensive approach is needed. A recent study suggests that a combined approach of statistical modeling and model selection can offer a potential solution by illustrating a case study that detects purely temporal clusters with a time series model in Takahashi and Shimadzu [26]. However, it is not always straightforward if the time series framework is directly applicable to a spatial context, which involves an extra dimension. It is unable to take advantage of the ordering structure in data-time series data are one-directional along with time, from the beginning to the end, but spatial data do not possess such a clear ordering structure. It is even unclear whether a similar approach can perform with a high detection power for cluster detection and, thus, extra care is required to develop a multiple-cluster detection framework in spatial contexts.
Here, we propose a unified framework that enables simultaneous detection and evaluation of multiple spatial-clusters by combining generalized linear models (GLMs) and information criterion approaches. The framework encompasses the procedure proposed for detecting purely temporal clusters in Takahashi and Shimadzu [26] as a special case. We present an illustrative example, the hospital admission for chronic obstructive Page 3 of 11 Takahashi and Shimadzu Int J Health Geogr (2020) 19:33 pulmonary disease (COPD) in England, available from a textbook [27], for evaluating the performance of the proposed method. The results are compared with an SCP approach. The consistency property of the proposed procedure is also investigated in a simulation study.

Methods
The proposed method will be evaluated through real and simulation data. As an illustrative example, we applied the method to the spatial distribution of the hospital admission for COPD in England for 2010 and compared the detection performance with an SCP for the spatial tendency of disease risk. COPD is a group of lung conditions that cause breathing difficulties, including emphysema and chronic bronchitis, and is common in the middle to older aged adults who smoke. Although the leading cause of COPD is smoking, some cases are due to long-term exposure to harmful fumes or dust. Figure 1 shows the spatial distribution of the risk of hospital admission for COPD. There were m = 324 sub-regions (local authorities) in England amongst which the total number of cases reported was 22,293. The data was taken from the book "Spatio-Temporal Methods in Environmental Epidemiology" by Shaddick and Zidek [27] (from the authors' website: http://empsl ocal.ex.ac.uk/ peopl e/staff /gs454 /). The color gradient corresponds to standardized admission rates adjusted by the underlying age-sex profile of the population within the sub-region; a darker color indicates a higher rate of COPD hospital admission. A simulation study is set up to investigate the consistent property, whether the proposed method tends to select the correct number of clusters when the actual number of clusters is known. The simulation data are motivated by the COPD data to keep some reality in the spatial distribution of disease. However, the focus is given on the evaluation of detecting low RR clusters ranging from 1 to 1.6.
In the simulation study, we assumed five clusters [A-E; Fig. 2] consisting of a different number of sub-regions, with each cluster showing a different RR according to the seven different scenarios (S1-S7) shown in Table 1. For instance, Scenario 1 (S1) indicated the null, i.e., there was no cluster, whereas Scenarios 2-5 (S2-S5) had five clusters (A-E) and Scenarios 6 and 7 (S6 and S7) assumed only single cluster (A) in the study area. For the remaining sub-regions (B-E), the RRs were set to 1.0. We generated 1000 datasets for each scenario and compared the estimated power calculated from the two cluster detection tests, the SCP and the proposed methods, at a significance level of 0.05.

Methodological developments
We first describe the challenge in detecting multipleclusters in a spatial extent, formulating it as a mixture Poisson GLM. Here, the formulation allows that the proposed procedure directly stands on the likelihood principle and encompasses the SCP as a special case, demonstrating the critical fact that selecting appropriate multiple clusters is an exact parallel to the covariate selection in regression modeling, i.e., model selection. We then propose a new criterion for choosing a model with the appropriate number of clusters in favor of the maximum marginal likelihood, in a similar manner in deriving the Bayesian information criterion (BIC).

Multiple-cluster model and its likelihood
Consider a study space (or area) G consisting of m segments (or sub-regions), each of which corresponds to the smallest element in the space (e.g., counties and states). We write the number of cases within segment i as Y i , which is assumed to follow a Poisson distribution independently with an expected value µ i -i.e., Y i |µ i ∼ Poisson(µ i ) . And the observations (which is not random variable) of which Y i is denoted in lowercase as y i , i = 1, 2, . . . , m . Additionally, let W denote the set of all potential scanning zones (sets of connected segments) of any size, the construction of which set W relies on an employed scanning method. Assuming that there are K clusters: w = {w 1 , w 2 , . . . , w K } , in space G , each mutually exclusive window w k contains a set of adjacent segments as a cluster; i.e., w k ∩ w k ′ = φ for w k = w k ′ . Note that K = 0 and K = 1 indicate no cluster and a single cluster in the study space, respectively. The number of cases, y i , is expected to be higher within hot-spot clusters compared to in other parts of the study space. The expected number of cases can be modeled as for K ≥ 1 and log µ i = α 0 + log µ 0 i for K = 0 . Here, the indicator variable z ki = 1 , if segment i is a member of k− th cluster ( i ∈ w k ) and z ki = 0 otherwise. Note that all coefficients are positive, β k > 0 . For segments that fall

Fig. 2 Assumed cluster areas A-E in simulation studies
Page 5 of 11 Takahashi and Shimadzu Int J Health Geogr (2020) 19:33 into the k-th hot-spot cluster, w k , a parameter of model (1), becomes θ i = θ w k = exp (α + β k ) . In contrast, for those that fall outside of the clusters ( w ), the parameter is θ i = θw = exp (α) . Here, there is some flexibility in the constant term µ 0 i := µ 0 i (x i ) that is often modeled as a function of other covariates x i , such as demographic or environmental factors; this yields the null model; i.e., the expected number of cases, when there is initially no cluster in the study space such that β = 0 . The null model is therefore described as log µ i = α + log µ 0 i . The likelihood function of model (1) can be constructed as follows. Now, f i y i |z, ψ = f (y i |µ 0 i , z, ψ) is the probability function of Y i = y i given the two arguments: the locations of a hot-spot window, z: = z(w) = (z ki ) , which is a K × m matrix, and the parameters ψ = (α, β 1 , β 2 , . . . , β K ) . The conditional loglikelihood function can be expressed as where z 0i = 1 if i / ∈ K k=1 w k , and otherwise as z 0i = 0 . If we assume z to be randomly selected from a probability function h(z) , the complete (full) log-likelihood function of ψ becomes: where L(ψ) is the likelihood function of ψ.

Information criterion for selecting an appropriate K
Multiple-cluster model (1) suggests that the problem of detecting multiple clusters can be approached as a model selection problem to find an appropriate number of clusters, K (≤ K max ) . We propose a new information criterion that chooses K in favor of the maximum marginal likelihood, ML(y, z) = ∫ exp{log L(ψ)}g(ψ)dψ , where g(ψ) is a prior probability function of parameter ψ . This can be achieved as follows. Applying Taylor expansion and Laplace approximations to the marginal likelihood function, it can be approximated [28] as where ψ is the maximum likelihood estimator of ψ, and q = K + 1 . The model evaluation criterion can then be obtained by eliminating terms with an order less than O(1) with respect to the large sample size m ; that is, To select an appropriate number of clusters, K , we define a relative difference statistic based on criterion C(K ) as where C 0 = C(0) , the criterion under the null model. Appropriate multiple clusters are selected from the set of candidates w = (w 1 , w 2 , . . . , w K ) with respect to max K RDC(K ).
For the calculation of the proposed criterion (2), the probability function h(z) must be specified. We recommend h(z) = (1/m) K as an approximation of the probability of selecting locations w given the fixed windows size, shape, and direction, when the window size is relatively very small, #{i|i ∈ w} ≪ m , with respect to the whole data size m . Thus, a cluster selection criterion is now given as C(K ) = −2l ψ |z − 2 log (h(z)) + (K + 1) log m, (K ≥ 1).

Statistical significance of overall clusters
The Monte Carlo hypothesis testing procedure evaluates the statistical significance of appropriate models in the same manner as the standard scan statistic. Under the null hypothesis, a large number of random datasets are generated; however, for each of these, max K RDC(K ) is instead calculated as a test statistic (see details [26]).

Candidates of multiple clusters w
For the multiple-cluster model (1), candidate clusters, z , i.e., w among a large number of combinations of sets in W , must be chosen in advance. Using an SCP method, namely the flexibly shaped scan statistic, we sequentially selected candidate clusters w * 1 , w * 2 , . . . , w * K max up to the predefined maximum number K max . While the single cluster detection procedure is iteratively applied, the cluster of the first choice, w * 1 , is often called the "primary" cluster, with the remaining w * 2 , w * 3 , . . . , w * K max referred to as "secondary" clusters. Note that K max = 1 corresponds to the detection of only the primary cluster. In practice, we predefine the maximum number of candidates (e.g., K max = 10, 20, . . . ) or a p-value threshold, p s (e.g., p s < 0.5, 0.8, 1.0 ) derived as the "secondary cluster" by SCPs, as there are no overlaps among the candidate clusters. The p-value for each cluster selected by an SCP is often calculated by the Monte Carlo hypothesis testing procedure. The selection of candidates may differ depending on the scanning method used (e.g., circular, flexible, and so forth).

An illustrative example
As an illustrative example, we applied the method to the COPD data in England ( m = 324 sub-regions) for 2010, shown in Fig. 1. A comparison of our proposed method and conventional SCP revealed a distinctive difference in the number of detected clusters. The proposed method tended to detect more clusters compared to the conventional SCP approach, as shown in Fig. 3 and Table 2. Note that some clusters are next to each other as if they are in the same single cluster, for example w * 1 , w * 2 , w * 11 , w * 12 ; however, they are not because their RRs differ. In the analysis, the candidate clusters w were chosen by the restricted flexible shaped scan statistic [22] with the maximum number of the area as 20. The p-values were calculated by the Monte Carlo hypothesis testing procedure with 9999 replications for each cluster selected by the SCP.
Our proposed method suggested a total of 15 clusters w * 1 , w * 2 , . . . , w * 15 with the p-value of the multiple cluster C(K ) = −2l ψ |z + (3K + 1) log m, (K ≥ 1). model as p M = 0.0001 ( C(15) = 2926.92 and RDC(15) = 0.2242, where C 0 = 3724.78). In contrast, the conventional SCP detected K = 10 clusters w * 1 , w * 2 , . . . , w * 10 at a significance level of p s < 0.05 (Table 2). Although clusters w * 11 , w * 12 , . . . , w * 15 with p s > 0.05 were excluded by the conventional SCP approach, the proposed method suggested that they should be included, as the p-value of the multiple cluster model was p M = 0.0001. Table 3 shows the number of detected significant multiple clusters K of the SCP and proposed procedure along with the total power among 1000 datasets for each scenario. Note that the RRs of S5 were set to resemble those of the first five clusters in the example data (Table 1). Table 4 shows the sensitivity (Sen) and positive predictive value (PPV) of regions detected as significant, as well as their averages and number of detections with Sen = 1 and PPV = 1 among the 1000 datasets.

Simulation study
The total powers for both procedures were very similar, except for S4. However, the SCP tended to detect a smaller number of clusters compared to the proposed method. The sensitivity of the SCP was lower than that of the proposed procedure. Notably, for weak clusters with low RRs, RR = 1.3 (S3), RR = 1.2 (S4), and mixed RRs (S5), the SCP failed to detect the five clusters with a higher power. Therefore, the sensitivity of the SCP and the probability of Sen = 1 for these scenarios were much lower than that of the proposed procedure.
In contrast, the proposed procedure tended to detect more clusters than the actual value. The PPVs of the proposed procedure were slightly lower than those of the SCP approach, but its sensitivity appeared to be higher. These simulation results suggest that the proposed procedure can detect regions within the assumed clusters with RR > 1.0 accurately with slightly extended regions. A similar performance was observed in scenarios S6 and S7 for which a single cluster was assumed.

Discussion
Several studies have been conducted to detect multiple clusters using scan statistics other than SCPs. For example, Zhang et al. [23] proposed an adjusted p-value for a sequential detection approach, recursively locating clusters based upon all previously detected clusters. Although this method performs better with a higher power than conventional SCPs, the relative sizes of the adjusted p-values for secondary clusters are irrelevant to the order in which the clusters are sequentially detected; thus, the k-th cluster may have a smaller p-value than the previously detected (k − 1)-th cluster. Additionally, the procedure can only evaluate the significance of individual clusters but not of multiple clusters as a whole.
In the spatial context, a multiple cluster detection procedure using spatial scan statistics was described in [24,25]. However, this method cannot assess the significance of multiple clusters as a whole. A generalized linear mixed model with Moran's I statistic and stepwise procedure allows for multiple cluster evaluation, accounting for random spatial effects. The power of the approach is lower than that of the standard scan statistic [29]. A recent study [30] suggested a quasi-likelihood approach that deals with spatial correlation. However, quasilikelihood suffers from the multiple testing problem in selecting multiple clusters, as the approach does not provide a full-likelihood. Our approach avoids this issue by utilizing the model selection framework with the proposed information criterion based on the full-likelihood principle.
We proposed an information criterion for selecting an appropriate number of clusters. The information criterion approach is based on the framework proposed by Takahashi and Shimadzu [26] for detecting multiple temporal-clusters. The idea of model selection has been used in more general statistical modeling contexts; for instance, Akaike information criterion (AIC) and Bayesian information criterion (BIC) are used to estimate the number of multiple clusters [31,32] and finite mixtures [33]. However, in situations  Table 3 The number of detected significant multiple clusters K of the secondary-cluster procedure (SCP) and proposed procedures in the simulation study where large datasets are used, conventional information criteria, including −2 log likelihood, AIC, and BIC, perform poorly and cannot accurately select an appropriate number of clusters. The proposed criterion is derived from the marginal likelihood of the multiple cluster model and accounts for the probability distribution of selected candidate clusters. Our examples and simulations clearly demonstrate that the proposed criteria perform well for identifying appropriate multiple clusters. Figure 4 shows the comparison of the proposed criterion C with other conventional criteria: −2 log L , AIC, and BIC, at K ( K = 0, 1, . . . , 20 ). Although some inflection points were observed at around K = 11, the proposed criterion C attained a minimum value, i.e., the maximum value of RDC , at K = 15. In contrast, other criteria monotonically decrease and do not reach minimum values for K ≤ 20.
A more conservative p-value is calculated by the secondary procedure as compared to the primary cluster procedure [23,34]. Thus, the former identifies fewer significant secondary clusters relative to true clusters. This was observed in our simulation study, while the proposed procedure tends to detect more clusters, contrasting the reported result in the purely temporal setting [26], although this may largely depend on the scenario assumed.
Our case study and simulation studies demonstrate that the proposed framework performs well, although some limitations remain. First, multiple cluster detection depends on the scanning method initially used, and we adopted the conventional secondary procedure to preselect candidate clusters for a GLM. This implies that choosing the optimal scan statistic with high detection accuracy is essential. It requires further investigations on various detection test statistics as well as other scanning methods, including the union cluster situation. Second, the spatial dependence structure must be considered for better cluster detection. These methods will provide insight for future research.

Conclusion
We proposed a new statistical framework that combines the scan statistic and GLMs to simultaneously detect and evaluate multiple disease clusters in a large study space. The framework can determine whether the presence of a specific disease or incident is entirely random over geographical space. We also developed a new information criterion to select the appropriate number of clusters in the spatial context. Together with these approaches, the proposed framework enables the estimation and evaluation of multiple clusters with high detection power, as demonstrated in our simulation study. Further, a distinctive feature of our simultaneous   detection framework is that it can calculate the p-value of detected multiple-clusters as a whole, as opposed to one at a time, as in conventional SCPs.