A semiparametric cluster detection method — a comprehensive power comparison with Kulldorff's method
© Wen and Kedem; licensee BioMed Central Ltd. 2009
Received: 12 September 2009
Accepted: 31 December 2009
Published: 31 December 2009
A semiparametric density ratio method which borrows strength from two or more samples can be applied to moving window of variable size in cluster detection. The method requires neither the prior knowledge of the underlying distribution nor the number of cases before scanning. In this paper, the semiparametric cluster detection procedure is combined with Storey's q-value, a type of controlling false discovery rate (FDR) method, to take into account the multiple testing problem induced by the overlapping scanning windows.
It is shown by simulations that for binary data, using Kulldorff's Northeastern benchmark data, the semiparametric method and Kulldorff's method performs similarly well. When the data are not binary, the semiparametric methodology still works in many cases, but Kulldorff's method requires the choices of a correct probability model, namely the correct scan statistic, in order to achieve comparable power as the semiparametric method achieves. Kulldorff's method with an inappropriate probability model may lose power.
The semiparametric method proposed in the paper can achieve good power when detecting localized cluster. The method does not require a specific distributional assumption other than the tilt function. In addition, it is possible to adapt other scan schemes (e.g., elliptic spatial scan, flexible shape scan) to search for clusters as well.
Scan statistics arise when scanning in time, or space, or both, looking for clusters of certain events or cases. Here the cluster can be defined as a certain spatial or temporal subregion where the the probability distribution of an event is different from that in the rest of the region. More generally, a cluster is a subregion where the behavior of an observable is different from the behavior of the observable in the rest of the region. For instance, a subregion comprised of several counties with higher rate of one certain type of cancer than the rate of other counties in the study region defines a cancer cluster . Besides epidemiology, more examples of clusters can be found in many other fields, including criminology, genetics, mining, astronomy, and so on. See Glaz and Balakrishnan (1999), Glaz et al. (2001) [2, 3].
As a contemporary research topic, research on scan statistics can be traced back to the 1960's . Since then it came out many scan statistics methods, such as scan statistics with Poisson assumptions , scan statistics in purely temporal domain , Upper-level set scan [6–8], Flexible shaped scan [9, 10], etc. Among them, Kulldorff's spatial scan statistics has been one of the most popular methods and has been applied in many scientific fields for detecting either purely spatial or higher dimensions clusters, e.g. space-time scan [11–15]. However, Kulldorff's method requires assumptions on the underlying distribution (Bernoulli, Poisson, exponential, etc.) of the scan region. If an inappropriate model is chosen, the power of cluster detection may significantly decrease . In addition, the method requires knowledge of the number of events over the region of interest for Monte Carlo hypothesis testing.
Kedem and Wen (2007)  proposed a semiparametric cluster detection procedure which requires much less than complete distributional assumptions, and which does not require the number of cases prior to scanning. They also performed a limited power study showing that when data are largely compatible with the Bernoulli or Poisson assumptions, the semiparametric method could be as powerful as the focused test  and as Kulldorff's method, but the semiparametric method could achieve higher power when data deviate from the Bernoulli or Poisson assumptions. However, the power study did not involve a scanning stage where the moving window scan the whole study region for cluster candidates. It was assumed that the cluster was already known, and only wanted to compare which method, either Kulldorff's method or the semiparametric one, had higher power in detecting the difference correctly between the cluster region and the background. Hence, the power study was called limited. Although the case study for the North Humberside childhood leukemia data set [11, 18] of the paper had the scan stage, the significant level was not adjusted for multiple testing yet.
In this paper, we extended the original semiparametric cluster detection procedure by incorporating Storey's q-value method [19, 20], a type of false discovery rate (FDR) methodology , to take into account the multiple testing problem inherent in cluster detection. A comprehensive power study was then conducted using Kulldorff's northeastern benchmark data set  for binary power comparison and generated quantized normal data for non-binary power comparison. The results show that for binary data the semiparametric method and Kulldorff's method are quite comparable. However, when data are not binary, the semiparametric method performs well regardless of the probability model whereas Kulldorff's method requires the correct model in order to get a comparably good power. When the chosen model is inappropriate, Kulldorff's method may not achieve good power. This findings are also consistent with the limited power study performed in Kedem and Wen (2007) . The extended semiparametric method was again illustrated by the North Humberside childhood leukemia data set.
Kulldorff's Scan Statistics
By Kulldorff's scan statistics method, each scan window Z is associated with a likelihood ratio test statistic λ (Z) which can be computed based on the chosen underlying probability model and the observed data. The scan window associated with the maximum λ (Z) is defined as the primary cluster candidate occurring not by random chance. The maximum likelihood ratio itself is called the Kulldorff's spatial scan statistic, and the null hypothesis is rejected for large value. After the spatial scan statistic and the primary cluster candidate are determined, a Monte Carlo hypothesis procedure or a permutation test procedure is executed to generate the probability distribution of Kulldorff's scan statistic under the null hypothesis of no cluster in the study region, and a p-value is obtained [23, 24].
For different types of data, Kulldorff proposed different scan statistics based on the corresponding probability models.
Bernoulli-based scan statistic is used when individual entities have only two states such as an individual person having cancer or not. The part (a) of Figure 1 shows a typical setup of Kulldorff's scan statistics method. The following notation is used.
G: the whole study region
Z: the scan window
μ(G): the total number of individual entities (e.g. people) in G
μ(Z): the number of individual entities in Z
n G : the total number of events in G
n Z : the number of events inside Z
p: the rate of events that occurred inside the scan window Z
q: the rate of events that occurred outside the scan window Z
Where . If we were scanning for cold spots, then "> " would change to "< "; if for either hot or cold spots, then it would be "≠ " (Kulldorff, 1999). After all the λ (Z) are computed, we determine the primary cluster candidate from , and reject the null for large values of . A Monte Carlo based p-value can be obtained from randomization of the cases across the whole study region given the total number of cases.
where n Z is the observed number of cases inside the scan window Z, and e Z is the expected number of cases inside the scan window Z under the null hypothesis of no cluster. As before, if we were scanning for a cluster other than hot spot, we simply change the inequality sign as needed. After getting all the λ (Z)'s, find and the Monte Carlo based p-value. When p is small, e.g. a rare disease, the Poisson model gives a close approximation to the Bernoulli model and obtains very similar results .
Jung et al. (2006)  proposed an ordinal model, which is used when individual entities have K > 2 ordinal categories such as the different stages of a kidney disease, stage 1, 2, 3, 4 and 5. A higher category may reflect a more serious cancer stage. Suppose the study region consists of I sub-regions and the variable of interest is recorded in K categories. Each individual in the analytical sample falls into one category. Let c ik be the number of individuals in location i who fall into category k, where i = 1, 2,..., I and k = 1, 2,..., K. Let C k = Σ i c ik be the number of observations in category k across the study region, and C = Σ k C k = Σ k Σ i c ik be the total number of observations in the whole study region.
where and are the MLEs of p k and q k under the alternative hypothesis. A "Pool-Adjacent-Violators" algorithm [25, 26] can be applied to compute and . After all the λ (Z)'s are obtained, get . A Monte Carlo based p-value can be obtained by randomization of the observations across the study region given the total number of observations in each category, C1, C2,..., C K .
The above gives a brief summary of Kulldorff's scan statistics under Bernoulli model, Poisson model, and ordinal model, respectively. As for other types of scan statistics in Kulldorff's scan statistics family, see Huang et al. (2007) for the exponential model , Kulldorff et al. (2007) for the multivariate scan model , Kulldorff et al. (2009) for the normal model , and Kulldorff et al. (2006) for elliptic window scans .
Kulldorff's method has been applied successfully in many applications. However, it requires assumptions on the underlying distribution (Bernoulli, Poisson, exponential, etc.) of the scan region and the knowledge of the number of events over the region of interest for Monte Carlo hypothesis testing. Moreover, It may take considerable amount of computation time to obtain the Monte Carlo based p-value.
Semiparametric Scan Statistics Method
Kedem and Wen (2007)  proposed a semiparametric scan statistics method for cluster detection using a density ratio model studied by Fokianos et al. (2001) and Qin and Zhang (1997) [30, 31]. Similar to Kulldorff's scanning procedure, the semiparametric scan statistics method also uses a movable variable-size window to scan the whole study region, but performs for each window a two-sample test. Because the moving window generates a large set of overlapping scan windows, it results in a large number of semiparametric test statistics and their corresponding p-values as well. The largest test statistic is defined as the semiparametric scan statistic. To take account of the multiple-testing problem induced by the multiple scanning windows, the original p-values are replaced by Storey's q-values, a type of controlling false discovery rate (FDR) methodology. Then the primary cluster candidate will be the window (subregion) corresponding to the largest test statistic or equivalently the smallest q-value. If this q-value is less than a pre-decided level (say 0.05), we claim the located cluster candidate is a true cluster. We describe briefly the semiparametric scan statistics method and the q-value FDR methodology in the following subsections.
Semiparametric Test Statistics
Here h(x) is a known vector-valued function of x which may take on a scalar form such as x or a vector-valued form such as (x, x2)', α1 is a scalar, but β1 can be a scalar or vector depending on h(x). Clearly, β1 = 0 implies α1 = 0. Therefore, β1 = 0 implies the null hypothesis of no cluster is accepted since the probability distribution inside and outside the scanning window are equal. See Kedem and Wen (2007)and Fokianos et al. (2001) [16, 30] for the general setup for m ≥ 2 samples.
where t i is i th observation in the combined sample t= (x1, x2)', and ρ1 = n1/n2.
where Σ is a special case of S -1 V S -1 under q = 1. See Appendix 2 for the definition of the matrices S and V.
Kedem and Wen (2007)  proposed three test statistics for testing the null hypothesis H0: β1 = 0, which means no cluster.
χ1 test statistic
where ρ1 = n1/n2 and Var [h(t)] is the covariance matrix of h(t) with respect to the reference distribution. It follows under H0 that χ1 is approximately distributed as Chi-square with v degrees of freedom, where v is the dimension of β1, which is the same as the dimension of the function h(x). The null hypothesis H0: β1 = 0 is rejected for large values of χ1. In practice, Var [h(t)] is replaced by its estimator [16, 30]. In addition, if the data are binary (0-1), the χ1 statistic can be simplified to a close form of equation (15) in the Appendix 1.
χ2 test statistic
tests for a general linear hypothesis Hθ= c, where θ= , w is the total number of samples excluding the reference sample. H is p' × [(1 + v)w)] predetermined matrix of rank v', v' < (1 + v)w, c is a vector in , and the variance-covariance matrix Σ = S -1 V S -1 . It follows under H0 that χ2 is asymptotically distributed as χ2 with (v') degrees of freedom provided the inverse exists , and H0 is rejected for large values. This test is useful when one wants to borrow information from other sources besides the current study region, although we did not use it in this paper.
Likelihood ratio test statistic
Under H0: β1 = 0, LR is asymptotically approximately distributed as Chi-square with v degrees of freedom, and H0 is rejected for large values. Recall v is the dimension of β1 and it depends on the choice of the function h(x). Similarly, when the data are 0-1 binary, the likelihood ratio statistic reduces to equation (16) in the Appendix 1.
Storey's FDR Method and q-value
Classification of m hypotheses tests
# of Accept
# of Reject
# of true nulls
# of true alternatives
In many cases, when m, the total number of hypotheses, is large, it easily to have significant ones, which makes Pr(R > 0) ≈ 1. Thus, pFDR (eq. 11) is close to FDR (eq. 10) in numerical value, but pFDR has some conceptual advantages. A thorough motivation of using pFDR rather than FDR can be found in Storey (2003) .
where Γ α is a nested significance region at level α . This means the q-value is the minimum pFDR at which the corresponding test may be called significant. In this way, q-value simultaneously takes into account the multiple testing problem.
In our semiparametric scan situation, we generate a lot of overlapping scan windows, and each window associates to a hypothesis test and a test statistic, so m here is usually large. Because of the large m, we adopted the algorithm in Storey et al. (2003)  to estimated the q-value for each scan window as follows:
5. Choose the region with the largest test statistic, for example, the largest likelihood ratio test statistic, as the primary cluster candidate, and its q-value is q(p(1)), the smallest q-value among all the tests. If q(p(1)) is less than a pre-decided false discovery rate, say q(p(1)) < 0.05, we claim there is clustering and the located primary candidate is a true cluster region.
The semiparametric approach has several advantages. First, the reference (or background) distribution, G(x), and all the parameters such as β1 are estimated from the combined data t, not just from a single sample either inside the window or outside the window. Second, for a properly chosen h(x), the above tests are quite powerful. The simulation results indicate that the χ1-test competes well with the corresponding F-test . Moreover, Gagnon (2005)  shows that for m = 2 the χ1-test can be more powerful than the common t-test for a known h(x) but unspecified distributions. Third, in testing equidistribution, other than an assumption regarding the tilt function h(t), the semiparametric density ratio method does not require distributional assumptions. Fourth, the semiparametric method can be applied to either continuous or discrete distributions. Fifth, since the asymptotic distributions of the above mentioned test statistics are known, in principle there is no need for the time consuming Monte-Carlo methods to compute the p-values. Lastly, Storey's q-value method can adjust the original p-values, and easily handles the multiple testing problem inherent in cluster detection.
Data Sets for the Power Comparison
Both Kulldorff's and the semiparametric scan statistics methods are suitable for both binary and non-binary data. So we use both types of data to perform the simulated power study. The following two subsections give a brief description of the binary and the non-binary data sets used in the study.
Binary Data Set — Northeastern USA Benchmark Data
For binary data scans, we use Kulldorff's Northeastern USA purely spatial benchmark data  which consist of 245 counties in northeastern United States, from Maine, New York, Rhode Island, Pennsylvania, Maryland, Washington DC, among others. Each county is graphically represented by its centroid coordinates. The case data, which is the number of people who have breast cancer, are aggregated to each county centroid with the total number of cases in northeastern states (the whole study region) being fixed. The population data of each county are based on the female population of the 1990 census. The benchmark data set contains two types of data, hot-spot clusters and global clustering data, generated in different ways.
Data are generated by a first-order clustering model, where cases are located independently of each other but the relative risk is different in different geographical areas. In this benchmark data set, the cluster region can be either a single region containing one or more counties, or a collection of multiple regions, where the risk of breast cancer is much higher than in the rest of area. Three types of clusters, rural, urban, and mixed, are generated depending on the location of the cluster. A rural cluster is the type of cluster where a small population relative to a large graphical area, such as the Grand Isle county in north Vermont close to the Canadian border. An urban cluster is the type of cluster which has a large population relative to a small graphical area, such as New York county which includes Manhattan. A mixed cluster is the type of cluster where a big city is surrounded by rural areas, such as Allegheny county in western Pennsylvania where Pittsburgh is. See Kulldorff et al. (2003)  for more detail.
Data are generated by purely second-order clustering model, where any one particular case is randomly located, but the location of cases are dependent on each other. Thus, under the alternative hypothesis of global clustering, cases are clustered wherever they occur in the region. In this benchmark data set, a certain number of cases are first generated to be randomly located throughout the whole northeastern states. These original cases then generate other new cases close by. If each original case generates one additional case, we call them twins; if two additional cases are generated, we call them triplets. The case generation is based on a global chain rN-nearest neighbor rule. The global chain is constructed by a Hamiltonian cycle chain which passes through as many counties as possible exactly once and any two counties next to each other on the chain always border each other graphically. For twins, the additional case is assigned to county j if Σ k I(d ik <d ij )n k <rN ≤ Σ k I(d ik ≤ d ij )n k , where n k is the population size of county k, N = Σ k n k is the total population size, r is some constant in the interval of (0,0.5) with the bigger r the broader geographical area to spread the cases, and d ij is the distance in one particular direction along the chain connecting county i and county j. For triplets, the two new cases are assigned in opposite directions along the chain. Data sets corresponding to different r were generated from a probability distribution or be determined in advance. Notice that although the data generation mechanism of the second-order clustering model is very different with the first-order clustering model, the resulting point patterns may look quite similar, and hence indistinguishable.
Both hot-spot and global clustering data sets in this benchmark data set have two separate groups of data with a total of 600 and 6000 simulated cases, respectively. For each group the same null hypothesis of no cluster is used where the relative risk for each county is equal, and the cases as well as their locations are independent of each other. In order to perform power comparison, 100000 random data sets with a total of 600 and 6000 cases were generated under the null hypothesis, respectively. They are used to estimate the critical cut-off point of significance for each group. For the alternative hypothesis of clusters which are called scenarios in this paper, 10000 random data sets per scenarios were generated to estimate the power using the previous determined cut-off points.
For each group of fixed total cases (600 or 6000 cases), Kulldorff generated 35 hot-spot clustering scenarios and 26 global clustering scenarios for his power comparison. For instance, a scenario of "rural and urban 600, size 4" means that the total of number of cases in the whole study region is 600 and the study region contains two hot-spot clusters. One cluster is in a rural region including four counties, and the other one is in an urban region including four counties also. A scenario of "global clustering twin 6000, exponential 0.02" means that the total number of cases is 6000 and the cases are generated under the alternative hypothesis of global clustering. The value r is not constant but randomly generated from an exponential distribution with mean 0.02 among the data set generated under this scenario.
We did not use all the scenarios in Kulldorff's Northeastern US benchmark data in this paper. Instead, we randomly chose one or two scenarios from each clustering pattern. Finally, nine hot-spot clustering scenarios and six global clustering scenarios were used in our power study for the binary data case. After selecting these scenarios, Kulldorff's method with the Poisson model (Bernoulli model is also appropriate) and the semiparametric method with the tilt function h(x) = x were applied to the same data sets in each scenario to compare the power. Because data are binary, the simplified semiparametric likelihood ratio test statistic (eq. 16) is used in the computation. All the tests are two-sided to detect for either high or low valued clusters.
Non-binary Data Set — Ordinal Categorical Data
Parameters for Generating the Ordinal Categorical Data from Quantized Normal
Inside the cluster
Outside the cluster
Quantized Normal II
μ = 13, σ2= 8
μ = 13, σ2= 4
Quantized Normal III
μ = 7.2, σ2= 13
μ = 6, σ2= 9
Quantized Normal III (small)
μ = 6.5, σ2= 13
μ = 6, σ2= 9
For each type of data, the average sample size within each state is around 130, hence a total number of 130 * 18 = 2340 observations in each generated data set. Cluster patterns included single cluster and multiple clusters for Quantized Normal II, Quantized Normal III, and Quantized Normal III (small) data, respectively. To perform the power comparison within a reasonable time scale, we used 100 runs, namely 100 random data sets, per cluster pattern in our power comparison study. The single clusters means only one state was randomly chosen as the cluster region. By multiple clusters we mean four states were randomly chosen simultaneously as the cluster region, where the four states are not necessarily contiguous. Notice that multiple clusters constitute a stronger clustering pattern which in general is easier to detect.
In our power comparison for non-binary ordinal categorical data, Kulldorff's method chose Poisson model (inappropriate model) and the ordinal model (correct model) using SaTScan v7.0.1 http://www.satscan.org/. 999 replications were used to decide the p-value under each model. The semiparametric method chose the vector tilt function h(x) = (x, x2)'. Observe that it is also appropriate to choose h(x) = (x, x2)' for binary data case, because in that case the coefficient of x2 term is 0. See Appendix 1 for detail.
We compared the power of Kulldorff's and the semiparametric scan statistics methods in detecting potential clusters. Because there are various cluster patterns, either single or multiple clusters, and each cluster region may contain one or more counties or states, for simplicity, in our power comparison we more focus on detecting the existence rather than the precise delineation of the cluster region. For instance, for a data set with a pattern of multiple cluster regions and multiple counties in each cluster region, we deem the detection successful whenever a significance (i.e., p-value or q-value less than 0.05) is obtained. We do not strictly require the detected cluster region to be exactly the same as originally simulated. The detected cluster region could fully or partially cover the desired area.
Comparison for Binary Data
For scenario 0, both Figures 3 and 4 show that the type I error of Kulldorff's method are all exactly 0.05 for both 600 cases and 6000 cases. This is expected since Kulldorff's method uses a Monte Carlo procedure to derive the cut-off point for the corresponding significance level. The power from the semiparametric method under the q-value significance level of 0.1 is 0.052 for the 600 cases, and 0.054 for the 6000 cases, almost equal to Kulldorff's type I error level. Notice that higher q-value significance level means in favor of the alternative. If the q-value significance level is lower down to 0.05, the type I error of the semiparametric method reduced to 0.027 for both cases. It shows that if one wants to make the power comparison under exactly the same type I error level, say 0.05, the q-value significance level must be set in the interval between 0.05 to 0.1. In addition, the type I error of the semiparametric method with Bonferroni correction is the lowest, which is not a surprise, since Bonferroni correction is most conservative.
For scenarios No. 1 to 9, both figures show that the two methods work very well in detecting hot-spot clusters, but Kulldorff's method seems to be slightly more powerful. When the study area has a stronger pattern of clustering, such as containing more cluster regions, the power of both methods increases, which demonstrates the validity of the two methods. For instance, scenario No. 6 in Figure 3 has a total of 600 cases and two cluster regions where each cluster region contains 16 counties. The power of Kulldorff's method is 0.996, whereas the power of semiparametric method with q-value significance level of 0.1 and 0.05 obtain the power 0.996 and 0.993, respectively, which is quite close to the power of Kulldorff's. The semiparametric method with Bonferroni correction is 0.968, which is expected to be the lowest, but still is quite reasonable.
For scenarios No. 10 to 15, there is no clear cut to say Kulldorff's method or semiparametric method is better, and both methods do not achieve high power as in hot-spot detection. This is so because both Kulldorff's and the semiparametric methods are not designed to detect global clustering pattern. It is also shown that for both methods, the larger the r is, the lower is the detection power. This is reasonable, because larger r means larger range for the data to spread, hence more difficult to form a local cluster pattern.
Comparison for Ordinal Categorical Data
For ordinal categorical data which are generated from the quantized normal II type, Figure 5 shows that the semiparametric method with the likelihood ratio test has the highest power of detecting potential clusters among all the tests. The semiparametric method with the χ1 test works well but not as good as the likelihood ratio test. This is in line with the limited power study in Kedem and Wen (2007)  which showed the likelihood ratio test was the most powerful tests among three tests from the semiparametric density ratio model. Moreover, when the clustering pattern is stronger, for instance, the study region contains multiple clusters, the power of χ1 test increases to 0.94, which is almost close to the power of the likelihood ratio test. The power of Kulldorff's method with Poisson model is very low, because Poisson model is designed to detect the difference in the mean not in the variance. Kulldorff's method with ordinal model is comparable to the semiparametric method with the χ1 test in detecting potential clusters.
For ordinal categorical data which are generated from the quantized normal III type, Figure 6 shows that the semiparametric method performs in the same way as in the quantized normal II case. The semiparametric method with the likelihood ratio test still has the highest power in both single and multiple clusters situation. The power of all tests increases as the clustering pattern becomes stronger. Kulldorff's method with the ordinal model performs equally well as compared to semiparametric method. Kulldorff's method with Poisson model "works" but still much less powerful than the other tests. This is because for the quantized normal III data, Kulldorff's Poisson model can detect the changes in the mean while still ignoring changes in the variance.
The part (a) of Figure 7 shows the power results when the difference (in the mean) between the cluster and non-cluster regions is small. The data are still ordinal categorical data generated from the quantized normal III type. However, this time the difference is that the mean in the cluster region was set to be close to the the non-cluster region, while the variances inside and outside the cluster region were kept the same as in the previous case (refer to Table 2 for the simulation parameters). In this way, the clusters were more difficult to detect. Not surprisingly, the power for the semiparametric method with the likelihood ratio test and Kulldorff's method with the ordinal model continue achieving the highest power, although it is much lower than the power in the previous case where the differences inside and outside the cluster region were substantial. The power of the χ1 test of the semiparametric method also decreases due to the weaker cluster pattern, and Kulldorff's method with the Poisson model yields the lowest power almost 0. The multiple cluster case is similar to the single cluster case but with a relatively higher power. Interestingly, if we look at the accuracy, which means detecting the true cluster region correctly with its exact size, the most accurate method as shown in the part (b) of Figure 7 is the semiparametric method with the likelihood ratio test statistic, which achieves more than 50% higher accuracy rate than Kulldorff's method with the ordinal model.
A Case Study — North Humberside childhood leukemia data set
Discussion and Conclusion
In this paper, We extended the semiparametric cluster detection method of Kedem and Wen (2007)  by incorporating Storey's q-value method, a type of FDR method, to overcome the multiple-testing problem raised by numerous scan windows of variable size. The results of the power study show that when detecting localized clusters, both the semiparametric and Kulldorff's method can achieve comparably good power. For binary population-case data, Kulldorff's method with Poisson model may have a slightly higher power than the semiparametric method with the tilt function h(x) = x. We may choose the tilt function as h(x) = (x, x2)' for binary data as well although the x2 term is not necessary. See Appendix 1 for a more detailed explanation. For non-binary data, such as ordinal categorical data, the semiparametric method with the tilt function h(x) = (x, x2)' is slightly more powerful than Kulldorff's method with the ordinal model. If Kulldorff's method is applied with an inappropriate model, it may fail to detect any potential clusters as shown in the simulation study for non-binary data. We also find that in our semiparametric method the likelihood ratio test seems to have higher power than the χ1 test in detecting potential clusters. When the localized clustering pattern is strong, for instance, multiple cluster regions or the difference inside and outside the cluster region is large, both tests obtain good power. When the clustering pattern is weak, that is, the difference is not that large, the likelihood ratio test seems to be acuter, while the χ1 test could be insensitive to the undesired fluctuations. In addition, the Likelihood ratio test only can test "either high or low values", but χ1 test can be easily transformed into a one-sided test  for two-sample comparison and scalar tilt function. In practice, it is prudent to use both tests for potential clusters. Potentially, if one wants to borrow information from other sources besides the current study region, the χ2-test of the semiparametric method could be used.
In the power study for non-binary data, we only used the ordinal categorical data, which are integer data, to compare the power of the two methods. However, Kulldorff's method also provide exponential model to analyze the survival data, which are mostly assumed to be exponentially distributed, and a normal model to analyze continuous data. A future study is to compare the power of the semiparametric method and Kulldorff's method for continuous data. But we may expect the semiparametric method will still work well, since the semiparametric density ratio model was originated from continuous data. In addition, the semiparametric model may have a more consistent setup than Kulldorff's method. Kullorff's method needs to choose different models, namely different scan statistics, for different types of data, whereas the semiparametric method requires no complete distributional assumptions except for the tilt function h(x). It could choose h(x) = (x, x2)' for many types of data, no matter continuous or discrete.
If the underlying distribution is known exactly, semiparametric could choose the appropriate tilt function h(x) to get the best performance. For instance, if the data are known from Bernoulli or Poisson distribution, we can use the tilt function h(x) = x; if the data are from normal, we can use h(x) = (x, x2)'; if the data are from Gamma distribution, we can use h(x) = (x; log x)'. A clue of how to choose a satisfactory h(x) for a given situation can be derived from common exponential families. More examples can be found in Kedem and Wen (2007) or Kay and Little (1987) [16, 42]. In addition, Fokianos et. al. (2001)  suggested that h(x) could be approximated by a polynomial, B splines, or kernel estimation. If the underlying distribution is not known, there could be a problem of the misspecified tilt function. Fokianos and Kaimi has discussed that a misspecified h(x) could decrease the power of the corresponding tests [43, 44]. Yet, there are examples where very different choices of h(x) could lead to similar test results. For instance, in an application to meteorological data in Kedem et al. (2004)  the choice of h(x) = x or h(x) = log x led to very similar results. It seems that for a non-homogeneous regional variance the choice of h(x) = (x, x2)' suggested by the normal distribution is sensible.
The semiparametric density ratio model essentially tests the homogeneity or equidistribution of two or more samples, therefore, besides Kulldorff's circular scan window, the semiparametric method may also adapt to other shape of the scanning window or scanning schemes, such as the Kulldorff's elliptic window scan, irregular shape upper-level set scan, or flexible scan, etc. More precisely, in scanning for clusters, and regardless of the regular or irregular shape of the scanning window, as long as the window separates the whole study region into two samples, one inside the window and one outside the window, the semiparametric method can be applied. However, it also makes the semiparametric method ignore the information about the location of the cases except whether the case is inside or outside the current evaluated window. Thus the semiparametric method may not have good power for global type clustering as shown in scenarios 10 to 15 in Figure 3 and 4 where clustering occurs throughout the study region.
A limiting factor of this power study might be that in the power study for non-binary type data we only used 100 runs at one point for each power comparison. That is because it is tedious and time consuming to run SatScan software and document the results manually. In addition, the semiparametric method also takes longer time to numerically estimate the parameters α and β especially when the combined sample size is big, while for binary scan the MLEs of α and β have already been obtained in closed forms. However, although 100-run does not sound like a large number in simulation, the results are still meaningful. From Figures 5 and 6, it is already clear to show the semiparametric method achieves comparable good power as Kulldorff's method with the correct model achieves. If Kulldorff's method is applied with an inappropriate model, the power may decrease a lot. Figure 7 demonstrates that when the difference between the cluster and non-cluster region is small, the detecting power for both methods decrease, but the likelihood ratio test from the semiparametric method seems have better accuracy in detecting clusters.
A last note is about the π0 in Storey's q-value method. The critical part of the q-value method of controlling the false discovery rate is to give a good estimate of π0, the proportion of the true null hypotheses among all the tests. The current method we used in this study is based on the algorithm suggested by Storey et al. (2003)  and assume the distribution of p-value from each test is uniform over the (0,1) interval. However, Yang (2004)  pointed out that if the p-values were not uniformly distributed, the power of the q-value method may decrease. He suggested to compute a weighted average of π0 from the distribution of the raw p-values which are greater than a threshold (say 0.4). Thus it may give a better control and more robust estimate of π0.
Appendix 1 — Simplified Semiparametric Test Statistics for Binary Data
If the data are 0-1 binary, such as cancer or not cancer, we can simplify the likelihood and obtain closed forms for the estimated parameters. Recall:
N G : The combined sample size for the whole study region.
n G : The number of cases in the whole study region.
N Z : The sample size within the scan window.
n G : The number of cases within the scan window.
ti: The i th observation from the combined sample in the whole study region, i = 1,..., N G .
ρ1: The relative sample size, which is equal to N Z /(N G - N Z ).
x Zj : The j th observation within the scan window Z, j = 1,..., N Z .
If β1 = 0, it implies α1 = 0 and eα 1+β 1= 1, which means the relative rates inside and outside the scan window are equal.
Where and is as in equation (14).
the x2 term is confounded with x, and β12 is not estimable. Therefore, tilt function h(x) = x with parameters α1, β1 are sufficient for binary case.
Appendix 2 — Derivation of S, V
The last term is 0 for j ≠ j' and (n j /n)Var [h(∈j 1)] for j = j'.
Special thanks to Professor Martin Kulldorff for all of his help during this research. Also thanks for the valuable comments from the three reviewers.
- Pickle LW, Feuer EJ, Edwards BK: U.S. Predicted Cancer Incidence, 1999: Complete Maps by County and State From Spatial Projection Models. National Cancer Institute, Cancer Surveillance Monograph No. 5. 2003, NIH Publication No. 03-5435.Google Scholar
- Glaz J, Balakrishnan N: Scan Statistics and Applications. 1999, New York: SpringerView ArticleGoogle Scholar
- Glaz J, Naus J, Wallenstein S: Scan statistics. 2001, New York: SpringerView ArticleGoogle Scholar
- Naus JI: The distribution of the size of the maximum cluster of points on a line. Journal of the American Statistical Association. 1965, 60: 532-538. 10.2307/2282688.View ArticleGoogle Scholar
- Shmueli G, Fienberg SE: Current and potential statistical methods for monitoring multiple data streams for bio-surveillance. Statistical Methods in Counter-Terrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication. Edited by: Wilson A, Wilson G, Olwell DH. 2006, New York: Springer, 109-140.View ArticleGoogle Scholar
- Patil GP, Taillie C: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 2004, 11: 183-197. 10.1023/B:EEST.0000027208.48919.7e.View ArticleGoogle Scholar
- Patil GP, Rathbun SL, Acharya R, Patankar P, Modarres R: Geoinformatic surveillance of hotspot detection, prioritization and early warning. Proceedings of the 2005 National Conference on Digital Government Research. 2005, 301-302.Google Scholar
- Modarresa R, Patil GP: Hotspot Detection with Bivariate Data. Journal of Statistical Planning and Inference. 2007, 11: 3643-3654. 10.1016/j.jspi.2007.03.040.View ArticleGoogle Scholar
- Tango T, Takahashi K: A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005, 4: 11-25. 10.1186/1476-072X-4-11.PubMedPubMed CentralView ArticleGoogle Scholar
- Takahashi K, Kulldorff M, Tango T, Yih K: A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring. International Journal of Health Geographics. 2008, 7: 14-27. 10.1186/1476-072X-7-14.PubMedPubMed CentralView ArticleGoogle Scholar
- Kulldorff M: A spatial scan statistic. Communication in Statististics: Theory and methods. 1997, 26: 1481-1496. 10.1080/03610929708831995.View ArticleGoogle Scholar
- Kulldorff M: Spatial scan statistics: models, calculations, and applications. Scan Statistics and Applications. Edited by: Glaz J, Balakrishnan N. 1999, Boston: Birkhäuser, 303-322.View ArticleGoogle Scholar
- Kulldorff M: Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society A. 2001, 164 (part 1): 61-72. 10.1111/1467-985X.00186.View ArticleGoogle Scholar
- Huang L, Kulldorff M, Gregorio D: A spatial scan statistic for survival data. Biometrics. 2007, 63 (1): 109-118. 10.1111/j.1541-0420.2006.00661.x.PubMedView ArticleGoogle Scholar
- Jung I, Kulldorff M, Klassen AC: A spatial scan statistic for ordinal data. Statistics in Medicine. 2007, 26 (7): 1594-1607. 10.1002/sim.2607.PubMedView ArticleGoogle Scholar
- Kedem B, Wen S: Semiparametric cluster detection. Journal of Statistical Theory and Practice. 2007, 1: 49-72.View ArticleGoogle Scholar
- Waller LA, Lawson AB: The power of focused tests to detect disease clustering. Statistics in Medicine. 1995, 14: 2291-2308. 10.1002/sim.4780142103.PubMedView ArticleGoogle Scholar
- Alexander F, Cartwright R, McKinney PA, J RT: Investigation of spatial clustering of rare diseases: childhood malignancies in North Humberside. Journal of Epidemiology and Community Health. 1990, 44: 39-46. 10.1136/jech.44.1.39.PubMedPubMed CentralView ArticleGoogle Scholar
- Storey JD: A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B. 2002, 64: 479-498. 10.1111/1467-9868.00346.View ArticleGoogle Scholar
- Storey JD: The positive false discovery rate: a bayesian interpretation and the q-value. Annals of Statistics. 2003, 31: 2013-2035. 10.1214/aos/1074290335.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995, 57 (1): 289-300.Google Scholar
- Kulldorff M, Tango T, Park PJ: Power comparison for disease clustering tests. Computational Statistics & Data Analysis. 2003, 42: 665-684. 10.1016/S0167-9473(02)00160-3.View ArticleGoogle Scholar
- Kulldorff M, Heffernan R, Hartman J, Assunção RM, Mostashari F: A space-time permutation scan statistic for the early detection of disease outbreaks. PLoS Medicine. 2005, 2: 216-224. 10.1371/journal.pmed.0020059.View ArticleGoogle Scholar
- Dwass M: Modified randomization tests for nonparametric hypotheses. The Annuals of Mathematical Statistics. 1957, 28: 181-187. 10.1214/aoms/1177707045.View ArticleGoogle Scholar
- Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD: Statistical Inference under Order Restrictions. 1974, New York: WileyGoogle Scholar
- Dykstra R, Kochar S, Robertson T: Inference for likelihood ratio ordering in the two-sample problem. Journal of the American Statistical Association. 1995, 90 (431): 1034-1040. 10.2307/2291340.View ArticleGoogle Scholar
- Kulldorff M, Mostashari F, Duczmal L, Yih WK, Kleinman K, Platt R: Multivariate scan statistics for disease surveillance. Statistics in Medicine. 2007, 26 (8): 1824-1833. 10.1002/sim.2818.PubMedView ArticleGoogle Scholar
- Kulldorff M, Huang L, Konty K: A spatial scan statistic for normally distributed data. 2009,http://populationmedicine.net/content/docs/normalscan-posted.pdfGoogle Scholar
- Kulldorff M, Huang L, Pickle L, Duczmal L: An elliptic spatial scan statistic. Statistics in Medicine. 2007, 25 (22): 3929-3943. 10.1002/sim.2490.View ArticleGoogle Scholar
- Fokianos K, Kedem B, Qin J, Short DA: A semiparametric approach to the one-way layout. Technometric. 2001, 43: 56-65. 10.1198/00401700152404327.View ArticleGoogle Scholar
- Qin J, Zhang B: A goodness of fit test for logistic regression models based on case-control data. Biometrika. 1997, 84: 609-618. 10.1093/biomet/84.3.609.View ArticleGoogle Scholar
- Qin J, Lawless JF: Empirical likelihood and general estimating equations. The Annals of Statistics. 1994, 22: 300-325. 10.1214/aos/1176325370.View ArticleGoogle Scholar
- Sen PK, Singer JM: Large sample method in statistics, An introduction with Applications. 1993, London: Chapman and Hall/CRCView ArticleGoogle Scholar
- Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003, 19 (3): 368-375. 10.1093/bioinformatics/btf877.PubMedView ArticleGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences. 2003, 100: 9440-9445. 10.1073/pnas.1530509100.View ArticleGoogle Scholar
- Tsai C, Hsueh M, Chen J: Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data. Biometrics. 2003, 59 (4): 1071-1081. 10.1111/j.0006-341X.2003.00123.x.PubMedView ArticleGoogle Scholar
- Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society Series B. 2004, 66: 187-205. 10.1111/j.1467-9868.2004.00439.x.View ArticleGoogle Scholar
- Gagnon R: Certain Computational Aspects of Power Efficiency and of State Space Models. PhD thesis, University of Maryland, College Park, Department of Mathematics. 2005Google Scholar
- Cuzick J, Edwards R: Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society B. 1990, 52: 73-104.Google Scholar
- McKinney PA, Alexander FE, Cartwright RA, Parker L: Parental occupations of children with leukemia in west Cumbria, North Humberside, and Gateshead. British Medical Journal. 1991, 302: 681-687. 10.1136/bmj.302.6778.681.PubMedPubMed CentralView ArticleGoogle Scholar
- Colt JS, Blair A: Parental Occupational Exposures and Risk of Childhood Cancer. Environmental Health Perspectives Supplements. 1998, 106 (S3): 909-925. 10.2307/3434208.View ArticleGoogle Scholar
- Kay R, Little S: Transformations of the explanatory variables in the logistic regression model for binary data. Biometrika. 1987, 74: 495-501. 10.1093/biomet/74.3.495.View ArticleGoogle Scholar
- Fokianos K, I K: On the effect of misspecifying the density of ratio model. Annals of the Institute of Statistical Mathematics. 2006, 58 (3): 475-497. 10.1007/s10463-005-0022-8.View ArticleGoogle Scholar
- Fokianos K: Density ratio model selection. Journal of Statistical Computations and Simulation. 2007, 77 (9): 805-819. 10.1080/10629360600673857.View ArticleGoogle Scholar
- Kedem B, Wolff DB, Fokianos K: Statistical comparison of algorithms. IEEE Transactions on Instrumentation and Measurement. 2004, 53: 770-776. 10.1109/TIM.2004.827301.View ArticleGoogle Scholar
- Yang X: Qvalue methods may not always control false discovery rate in genomic applications. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. 2004, 556-557.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.