Methodology  Open  Published:
A semiparametric cluster detection method — a comprehensive power comparison with Kulldorff's method
International Journal of Health Geographicsvolume 8, Article number: 73 (2009)
Abstract
Background
A semiparametric density ratio method which borrows strength from two or more samples can be applied to moving window of variable size in cluster detection. The method requires neither the prior knowledge of the underlying distribution nor the number of cases before scanning. In this paper, the semiparametric cluster detection procedure is combined with Storey's qvalue, a type of controlling false discovery rate (FDR) method, to take into account the multiple testing problem induced by the overlapping scanning windows.
Results
It is shown by simulations that for binary data, using Kulldorff's Northeastern benchmark data, the semiparametric method and Kulldorff's method performs similarly well. When the data are not binary, the semiparametric methodology still works in many cases, but Kulldorff's method requires the choices of a correct probability model, namely the correct scan statistic, in order to achieve comparable power as the semiparametric method achieves. Kulldorff's method with an inappropriate probability model may lose power.
Conclusions
The semiparametric method proposed in the paper can achieve good power when detecting localized cluster. The method does not require a specific distributional assumption other than the tilt function. In addition, it is possible to adapt other scan schemes (e.g., elliptic spatial scan, flexible shape scan) to search for clusters as well.
Background
Scan statistics arise when scanning in time, or space, or both, looking for clusters of certain events or cases. Here the cluster can be defined as a certain spatial or temporal subregion where the the probability distribution of an event is different from that in the rest of the region. More generally, a cluster is a subregion where the behavior of an observable is different from the behavior of the observable in the rest of the region. For instance, a subregion comprised of several counties with higher rate of one certain type of cancer than the rate of other counties in the study region defines a cancer cluster [1]. Besides epidemiology, more examples of clusters can be found in many other fields, including criminology, genetics, mining, astronomy, and so on. See Glaz and Balakrishnan (1999), Glaz et al. (2001) [2, 3].
As a contemporary research topic, research on scan statistics can be traced back to the 1960's [4]. Since then it came out many scan statistics methods, such as scan statistics with Poisson assumptions [3], scan statistics in purely temporal domain [5], Upperlevel set scan [6–8], Flexible shaped scan [9, 10], etc. Among them, Kulldorff's spatial scan statistics has been one of the most popular methods and has been applied in many scientific fields for detecting either purely spatial or higher dimensions clusters, e.g. spacetime scan [11–15]. However, Kulldorff's method requires assumptions on the underlying distribution (Bernoulli, Poisson, exponential, etc.) of the scan region. If an inappropriate model is chosen, the power of cluster detection may significantly decrease [16]. In addition, the method requires knowledge of the number of events over the region of interest for Monte Carlo hypothesis testing.
Kedem and Wen (2007) [16] proposed a semiparametric cluster detection procedure which requires much less than complete distributional assumptions, and which does not require the number of cases prior to scanning. They also performed a limited power study showing that when data are largely compatible with the Bernoulli or Poisson assumptions, the semiparametric method could be as powerful as the focused test [17] and as Kulldorff's method, but the semiparametric method could achieve higher power when data deviate from the Bernoulli or Poisson assumptions. However, the power study did not involve a scanning stage where the moving window scan the whole study region for cluster candidates. It was assumed that the cluster was already known, and only wanted to compare which method, either Kulldorff's method or the semiparametric one, had higher power in detecting the difference correctly between the cluster region and the background. Hence, the power study was called limited. Although the case study for the North Humberside childhood leukemia data set [11, 18] of the paper had the scan stage, the significant level was not adjusted for multiple testing yet.
In this paper, we extended the original semiparametric cluster detection procedure by incorporating Storey's qvalue method [19, 20], a type of false discovery rate (FDR) methodology [21], to take into account the multiple testing problem inherent in cluster detection. A comprehensive power study was then conducted using Kulldorff's northeastern benchmark data set [22] for binary power comparison and generated quantized normal data for nonbinary power comparison. The results show that for binary data the semiparametric method and Kulldorff's method are quite comparable. However, when data are not binary, the semiparametric method performs well regardless of the probability model whereas Kulldorff's method requires the correct model in order to get a comparably good power. When the chosen model is inappropriate, Kulldorff's method may not achieve good power. This findings are also consistent with the limited power study performed in Kedem and Wen (2007) [16]. The extended semiparametric method was again illustrated by the North Humberside childhood leukemia data set.
Methods
Kulldorff's Scan Statistics
Kulldorff's scan statistics method uses a large collection of overlapping scan windows to detect clusters, both the location and the size, and evaluate their significance. See Figure 1 for an illustration. For spatial data, the method first imposes a circular window on a map and let the circle centroid move across the study region. For any given centroid, the radius of the window varies continuously from zero to some upper limit. Usually this upper limit is set to be the radius which covers 50% of the whole study region or population. In this way, the method generates a large set of scan windows ℤ with different centroid and size. Under the null hypothesis of no cluster, the underlying behavior of the data throughout the whole study region is the same. Under the alternative hypothesis, there is at least one scan window for which the underlying behavior inside the window is different as compared with its complement, which means any scan window Z (Z ∈ ℤ) could be a potential cluster. In practice, some data are updated periodically. In such cases, spacetime scans are used [13, 23]. The scanning procedure of spacetime scans is almost identical to the purely spatial scan, except that the scan window becomes a threedimensional cylinder (see part (b) of Figure 1) instead of twodimensional circle. Since the statistical formulation of spacetime scan is identical to the twodimensional case, we only describe the 2D purely spatial scan in this paper.
By Kulldorff's scan statistics method, each scan window Z is associated with a likelihood ratio test statistic λ (Z) which can be computed based on the chosen underlying probability model and the observed data. The scan window associated with the maximum λ (Z) is defined as the primary cluster candidate occurring not by random chance. The maximum likelihood ratio itself is called the Kulldorff's spatial scan statistic, and the null hypothesis is rejected for large value. After the spatial scan statistic and the primary cluster candidate are determined, a Monte Carlo hypothesis procedure or a permutation test procedure is executed to generate the probability distribution of Kulldorff's scan statistic under the null hypothesis of no cluster in the study region, and a pvalue is obtained [23, 24].
For different types of data, Kulldorff proposed different scan statistics based on the corresponding probability models.
Bernoulli model
Bernoullibased scan statistic is used when individual entities have only two states such as an individual person having cancer or not. The part (a) of Figure 1 shows a typical setup of Kulldorff's scan statistics method. The following notation is used.

G: the whole study region

Z: the scan window

μ(G): the total number of individual entities (e.g. people) in G

μ(Z): the number of individual entities in Z

n_{ G }: the total number of events in G

n_{ Z }: the number of events inside Z

p: the rate of events that occurred inside the scan window Z

q: the rate of events that occurred outside the scan window Z
We test the null hypothesis H_{0}: p = q of no cluster. It can be shown that, for an alternative hypothesis that there is a hot spot cluster p > q, each scan window Z invokes a likelihood ratio test statistic as in equation (1):
Where . If we were scanning for cold spots, then "> " would change to "< "; if for either hot or cold spots, then it would be "≠ " (Kulldorff, 1999). After all the λ (Z) are computed, we determine the primary cluster candidate from , and reject the null for large values of . A Monte Carlo based pvalue can be obtained from randomization of the cases across the whole study region given the total number of cases.
Poisson model
Poissonbased scan statistic is used for the comparison of the number of cases inside and outside a scan window when searching for clusters. Suppose we have a study region composed of I subregions. For i = 1, 2,..., I, assume the number of events x_{ i }which occur in an "interval" μ (A_{ i }) is a Poisson process with intensity rate p inside the scan window and q outside, respectively. Similarly to the Bernoulli model, the likelihood ratio test statistic for each scan window Z for testing H_{0}: p = q is shown in equation (2):
where n_{ Z }is the observed number of cases inside the scan window Z, and e_{ Z }is the expected number of cases inside the scan window Z under the null hypothesis of no cluster. As before, if we were scanning for a cluster other than hot spot, we simply change the inequality sign as needed. After getting all the λ (Z)'s, find and the Monte Carlo based pvalue. When p is small, e.g. a rare disease, the Poisson model gives a close approximation to the Bernoulli model and obtains very similar results [11].
Ordinal model
Jung et al. (2006) [15] proposed an ordinal model, which is used when individual entities have K > 2 ordinal categories such as the different stages of a kidney disease, stage 1, 2, 3, 4 and 5. A higher category may reflect a more serious cancer stage. Suppose the study region consists of I subregions and the variable of interest is recorded in K categories. Each individual in the analytical sample falls into one category. Let c_{ ik }be the number of individuals in location i who fall into category k, where i = 1, 2,..., I and k = 1, 2,..., K. Let C_{ k }= Σ_{ i }c_{ ik }be the number of observations in category k across the study region, and C = Σ_{ k }C_{ k }= Σ_{ k }Σ_{ i }c_{ ik }be the total number of observations in the whole study region.
The null hypothesis of no cluster in this model means p_{1} = q_{1}, ..., p_{ k }= q_{ k }, where p_{ k }and q_{ k }are the unknown probabilities that an observation belongs to category k inside and outside the scanning window, respectively. To detect subregions with high rates of higher stages as compared with the rest of the area, one possible alternative hypothesis could be . Obviously when K = 2, the ordinal model set up reduces to the Bernoulli model. Following similar scan procedures as in Bernoulli and Poisson models, the likelihood ratio test statistic for each scan window is:
where and are the MLEs of p_{ k }and q_{ k }under the alternative hypothesis. A "PoolAdjacentViolators" algorithm [25, 26] can be applied to compute and . After all the λ (Z)'s are obtained, get . A Monte Carlo based pvalue can be obtained by randomization of the observations across the study region given the total number of observations in each category, C_{1}, C_{2},..., C_{ K }.
The above gives a brief summary of Kulldorff's scan statistics under Bernoulli model, Poisson model, and ordinal model, respectively. As for other types of scan statistics in Kulldorff's scan statistics family, see Huang et al. (2007) for the exponential model [14], Kulldorff et al. (2007) for the multivariate scan model [27], Kulldorff et al. (2009) for the normal model [28], and Kulldorff et al. (2006) for elliptic window scans [29].
Kulldorff's method has been applied successfully in many applications. However, it requires assumptions on the underlying distribution (Bernoulli, Poisson, exponential, etc.) of the scan region and the knowledge of the number of events over the region of interest for Monte Carlo hypothesis testing. Moreover, It may take considerable amount of computation time to obtain the Monte Carlo based pvalue.
Semiparametric Scan Statistics Method
Kedem and Wen (2007) [16] proposed a semiparametric scan statistics method for cluster detection using a density ratio model studied by Fokianos et al. (2001) and Qin and Zhang (1997) [30, 31]. Similar to Kulldorff's scanning procedure, the semiparametric scan statistics method also uses a movable variablesize window to scan the whole study region, but performs for each window a twosample test. Because the moving window generates a large set of overlapping scan windows, it results in a large number of semiparametric test statistics and their corresponding pvalues as well. The largest test statistic is defined as the semiparametric scan statistic. To take account of the multipletesting problem induced by the multiple scanning windows, the original pvalues are replaced by Storey's qvalues, a type of controlling false discovery rate (FDR) methodology. Then the primary cluster candidate will be the window (subregion) corresponding to the largest test statistic or equivalently the smallest qvalue. If this qvalue is less than a predecided level (say 0.05), we claim the located cluster candidate is a true cluster. We describe briefly the semiparametric scan statistics method and the qvalue FDR methodology in the following subsections.
Semiparametric Test Statistics
Using Kulldorff's circular window scan scheme, consider a scan window which separates the study region into two parts. This results in two samples,
where x_{1} is the sample inside the scanning window with sample size n_{1}, and the x_{2} is the sample outside the scanning window with sample size n_{2}, g_{ j }(x) is the probability density function of x_{ ji }, j = 1, 2, i = 1,..., n_{ j }. Choosing the sample outside the scanning window as the reference sample and g_{2}(x) as the reference density, it is assumed that the density ratio between the density inside the scanning window and the reference density has an exponential form
Here h(x) is a known vectorvalued function of x which may take on a scalar form such as x or a vectorvalued form such as (x, x^{2})', α_{1} is a scalar, but β_{1} can be a scalar or vector depending on h(x). Clearly, β_{1} = 0 implies α_{1} = 0. Therefore, β_{1} = 0 implies the null hypothesis of no cluster is accepted since the probability distribution inside and outside the scanning window are equal. See Kedem and Wen (2007)and Fokianos et al. (2001) [16, 30] for the general setup for m ≥ 2 samples.
Following the profiling procedure discussed in Qin and Lawless (1994) [32], Qin and Zhang (1997), and Fokianos et al. (2001), we obtain the loglikelihood:
where t_{ i }is i th observation in the combined sample t= (x_{1}, x_{2})', and ρ_{1} = n_{1}/n_{2}.
It follows from the likelihood (4) that the maximum likelihood estimators and from the combined sample are asymptotically normal as n →∞,
where Σ is a special case of S^{1}V S^{1} under q = 1. See Appendix 2 for the definition of the matrices S and V.
The maximum likelihood estimator of the reference (background) distribution function from the combined sample is given as,
Kedem and Wen (2007) [16] proposed three test statistics for testing the null hypothesis H_{0}: β_{1} = 0, which means no cluster.
χ_{1} test statistic
where ρ_{1} = n_{1}/n_{2} and Var [h(t)] is the covariance matrix of h(t) with respect to the reference distribution. It follows under H_{0} that χ_{1} is approximately distributed as Chisquare with v degrees of freedom, where v is the dimension of β_{1}, which is the same as the dimension of the function h(x). The null hypothesis H_{0}: β_{1} = 0 is rejected for large values of χ_{1}. In practice, Var [h(t)] is replaced by its estimator [16, 30]. In addition, if the data are binary (01), the χ_{1} statistic can be simplified to a close form of equation (15) in the Appendix 1.
χ_{2} test statistic
tests for a general linear hypothesis Hθ= c, where θ= , w is the total number of samples excluding the reference sample. H is p' × [(1 + v)w)] predetermined matrix of rank v', v' < (1 + v)w, c is a vector in , and the variancecovariance matrix Σ = S^{1}V S^{1}. It follows under H_{0} that χ^{2} is asymptotically distributed as χ_{2} with (v') degrees of freedom provided the inverse exists [33], and H_{0} is rejected for large values. This test is useful when one wants to borrow information from other sources besides the current study region, although we did not use it in this paper.
Likelihood ratio test statistic
Under H_{0}: β_{1} = 0, LR is asymptotically approximately distributed as Chisquare with v degrees of freedom, and H_{0} is rejected for large values. Recall v is the dimension of β_{1} and it depends on the choice of the function h(x). Similarly, when the data are 01 binary, the likelihood ratio statistic reduces to equation (16) in the Appendix 1.
Storey's FDR Method and qvalue
Controlling the false discovery rate (FDR) is a less conservative way to handle multiple testing problems. It was first proposed by Benjamini and Hochberg (1995) [21]. Since then, the FDR methodology has been further developed and applied to many fields, especially in genomic research [34–36]. FDR is defined to be the expected proportion of falsely rejected hypotheses (false positives) as in equation (10),
where V and R are defined in Table 1. From the table it is clear that if m = m_{0}, which means all the null hypotheses are true, then FDR is equivalent to the familywise error rate (FWER). To see that, recalling FWER is defined as P(V ≥ 1), m = m_{0} makes S = 0, and hence = 1 for all R > 0, and therefore FDR = 1 * P(R > 0) = P(V ≥ 1) = FWER. If m_{0} <m, then FDR ≤ FWER, which means a potential gain in power at the cost of increasing the likelihood of making type I errors [21].
Storey (2002) and Storey et al. (2004) [19, 37] improved the original Benjamini and Hochberg's FDR methodology by estimating the π_{0} = m_{0}/m, the proportion of true null hypotheses. In addition, Storey used positive FDR (pFDR) as in equation (11) instead of FDR as in equation (10).
In many cases, when m, the total number of hypotheses, is large, it easily to have significant ones, which makes Pr(R > 0) ≈ 1. Thus, pFDR (eq. 11) is close to FDR (eq. 10) in numerical value, but pFDR has some conceptual advantages. A thorough motivation of using pFDR rather than FDR can be found in Storey (2003) [20].
The qvalue is defined to be the pFDR analogue of the pvalue. It gives the error measurement with respect to pFDR for each observed test statistic of each particular hypothesis. More precisely, the qvalue of one particular observed test statistic T = t from a set of tests can be defined to be
where Γ_{ α }is a nested significance region at level α [19]. This means the qvalue is the minimum pFDR at which the corresponding test may be called significant. In this way, qvalue simultaneously takes into account the multiple testing problem.
In our semiparametric scan situation, we generate a lot of overlapping scan windows, and each window associates to a hypothesis test and a test statistic, so m here is usually large. Because of the large m, we adopted the algorithm in Storey et al. (2003) [35] to estimated the qvalue for each scan window as follows:
1. Obtain the pvalue for each scan window, and sort
2. Estimate π_{0} using a cubic spline function. First, for a range of λ, say λ = 0, 0.01,...,0.95, calculate
Then let be the natural cubic spline of . Finally, set the estimate of π_{0} to be .
3. Calculate
4. For i = m  1, m  2,...,1, calculate the estimated qvalue for the i th most significant one as
5. Choose the region with the largest test statistic, for example, the largest likelihood ratio test statistic, as the primary cluster candidate, and its qvalue is q(p(1)), the smallest qvalue among all the tests. If q(p(1)) is less than a predecided false discovery rate, say q(p(1)) < 0.05, we claim there is clustering and the located primary candidate is a true cluster region.
The semiparametric approach has several advantages. First, the reference (or background) distribution, G(x), and all the parameters such as β_{1} are estimated from the combined data t, not just from a single sample either inside the window or outside the window. Second, for a properly chosen h(x), the above tests are quite powerful. The simulation results indicate that the χ_{1}test competes well with the corresponding Ftest [30]. Moreover, Gagnon (2005) [38] shows that for m = 2 the χ_{1}test can be more powerful than the common ttest for a known h(x) but unspecified distributions. Third, in testing equidistribution, other than an assumption regarding the tilt function h(t), the semiparametric density ratio method does not require distributional assumptions. Fourth, the semiparametric method can be applied to either continuous or discrete distributions. Fifth, since the asymptotic distributions of the above mentioned test statistics are known, in principle there is no need for the time consuming MonteCarlo methods to compute the pvalues. Lastly, Storey's qvalue method can adjust the original pvalues, and easily handles the multiple testing problem inherent in cluster detection.
Data Sets for the Power Comparison
Both Kulldorff's and the semiparametric scan statistics methods are suitable for both binary and nonbinary data. So we use both types of data to perform the simulated power study. The following two subsections give a brief description of the binary and the nonbinary data sets used in the study.
Binary Data Set — Northeastern USA Benchmark Data
For binary data scans, we use Kulldorff's Northeastern USA purely spatial benchmark data [22] which consist of 245 counties in northeastern United States, from Maine, New York, Rhode Island, Pennsylvania, Maryland, Washington DC, among others. Each county is graphically represented by its centroid coordinates. The case data, which is the number of people who have breast cancer, are aggregated to each county centroid with the total number of cases in northeastern states (the whole study region) being fixed. The population data of each county are based on the female population of the 1990 census. The benchmark data set contains two types of data, hotspot clusters and global clustering data, generated in different ways.
Hotspot clusters
Data are generated by a firstorder clustering model, where cases are located independently of each other but the relative risk is different in different geographical areas. In this benchmark data set, the cluster region can be either a single region containing one or more counties, or a collection of multiple regions, where the risk of breast cancer is much higher than in the rest of area. Three types of clusters, rural, urban, and mixed, are generated depending on the location of the cluster. A rural cluster is the type of cluster where a small population relative to a large graphical area, such as the Grand Isle county in north Vermont close to the Canadian border. An urban cluster is the type of cluster which has a large population relative to a small graphical area, such as New York county which includes Manhattan. A mixed cluster is the type of cluster where a big city is surrounded by rural areas, such as Allegheny county in western Pennsylvania where Pittsburgh is. See Kulldorff et al. (2003) [22] for more detail.
Global clustering
Data are generated by purely secondorder clustering model, where any one particular case is randomly located, but the location of cases are dependent on each other. Thus, under the alternative hypothesis of global clustering, cases are clustered wherever they occur in the region. In this benchmark data set, a certain number of cases are first generated to be randomly located throughout the whole northeastern states. These original cases then generate other new cases close by. If each original case generates one additional case, we call them twins; if two additional cases are generated, we call them triplets. The case generation is based on a global chain rNnearest neighbor rule. The global chain is constructed by a Hamiltonian cycle chain which passes through as many counties as possible exactly once and any two counties next to each other on the chain always border each other graphically. For twins, the additional case is assigned to county j if Σ_{ k }I(d_{ ik }<d_{ ij })n_{ k }<rN ≤ Σ_{ k }I(d_{ ik }≤ d_{ ij })n_{ k }, where n_{ k }is the population size of county k, N = Σ_{ k }n_{ k }is the total population size, r is some constant in the interval of (0,0.5) with the bigger r the broader geographical area to spread the cases, and d_{ ij }is the distance in one particular direction along the chain connecting county i and county j. For triplets, the two new cases are assigned in opposite directions along the chain. Data sets corresponding to different r were generated from a probability distribution or be determined in advance. Notice that although the data generation mechanism of the secondorder clustering model is very different with the firstorder clustering model, the resulting point patterns may look quite similar, and hence indistinguishable.
Both hotspot and global clustering data sets in this benchmark data set have two separate groups of data with a total of 600 and 6000 simulated cases, respectively. For each group the same null hypothesis of no cluster is used where the relative risk for each county is equal, and the cases as well as their locations are independent of each other. In order to perform power comparison, 100000 random data sets with a total of 600 and 6000 cases were generated under the null hypothesis, respectively. They are used to estimate the critical cutoff point of significance for each group. For the alternative hypothesis of clusters which are called scenarios in this paper, 10000 random data sets per scenarios were generated to estimate the power using the previous determined cutoff points.
For each group of fixed total cases (600 or 6000 cases), Kulldorff generated 35 hotspot clustering scenarios and 26 global clustering scenarios for his power comparison. For instance, a scenario of "rural and urban 600, size 4" means that the total of number of cases in the whole study region is 600 and the study region contains two hotspot clusters. One cluster is in a rural region including four counties, and the other one is in an urban region including four counties also. A scenario of "global clustering twin 6000, exponential 0.02" means that the total number of cases is 6000 and the cases are generated under the alternative hypothesis of global clustering. The value r is not constant but randomly generated from an exponential distribution with mean 0.02 among the data set generated under this scenario.
We did not use all the scenarios in Kulldorff's Northeastern US benchmark data in this paper. Instead, we randomly chose one or two scenarios from each clustering pattern. Finally, nine hotspot clustering scenarios and six global clustering scenarios were used in our power study for the binary data case. After selecting these scenarios, Kulldorff's method with the Poisson model (Bernoulli model is also appropriate) and the semiparametric method with the tilt function h(x) = x were applied to the same data sets in each scenario to compare the power. Because data are binary, the simplified semiparametric likelihood ratio test statistic (eq. 16) is used in the computation. All the tests are twosided to detect for either high or low valued clusters.
Nonbinary Data Set — Ordinal Categorical Data
For nonbinary data scan, we used the simulated ordinal categorical data with one data point corresponding to one observation. The data are aggregated to state level distributed in 18 US states. Most of them are middle west and south states, including Alabama (AL), Arkansas (AR), Texas (TX), Virginia (VA), etc. Each state is graphically represented by its centroid coordinate. Figure 2 shows the map of the states included in our simulation study.
The ordinal categorical data are generated from quantized normal data which truncate the integer part of the original data generated from normal distribution. We refer to quantized normal II and quantized normal III data as in Kedem and Wen (2007) [16]. The quantized normal II data are derived from normal data with the same mean but relatively large variance difference inside and outside the cluster region (inside variance is twice of the outside variance). The quantized normal III data are derived from normal data with both different mean and variance inside and outside the cluster region. To see how Kulldorff's and the semiparametric methods perform when the difference (in mean) between the cluster and the noncluster region is small, namely a more difficult cluster to be detected, the data set "Quantized Normal III (small)" was generated. We did not generate data under quantized normal I (same variance, different mean) mechanism, because it has shown that as long as the mean difference is reasonably large, Kulldorff's method with even an inappropriate scan statistic still would work [16]. Table 2 lists the mean and variance parameters used in this paper to generate the quantized normal data
For each type of data, the average sample size within each state is around 130, hence a total number of 130 * 18 = 2340 observations in each generated data set. Cluster patterns included single cluster and multiple clusters for Quantized Normal II, Quantized Normal III, and Quantized Normal III (small) data, respectively. To perform the power comparison within a reasonable time scale, we used 100 runs, namely 100 random data sets, per cluster pattern in our power comparison study. The single clusters means only one state was randomly chosen as the cluster region. By multiple clusters we mean four states were randomly chosen simultaneously as the cluster region, where the four states are not necessarily contiguous. Notice that multiple clusters constitute a stronger clustering pattern which in general is easier to detect.
In our power comparison for nonbinary ordinal categorical data, Kulldorff's method chose Poisson model (inappropriate model) and the ordinal model (correct model) using SaTScan v7.0.1 http://www.satscan.org/. 999 replications were used to decide the pvalue under each model. The semiparametric method chose the vector tilt function h(x) = (x, x^{2})'. Observe that it is also appropriate to choose h(x) = (x, x^{2})' for binary data case, because in that case the coefficient of x^{2} term is 0. See Appendix 1 for detail.
Results
Power Comparison
We compared the power of Kulldorff's and the semiparametric scan statistics methods in detecting potential clusters. Because there are various cluster patterns, either single or multiple clusters, and each cluster region may contain one or more counties or states, for simplicity, in our power comparison we more focus on detecting the existence rather than the precise delineation of the cluster region. For instance, for a data set with a pattern of multiple cluster regions and multiple counties in each cluster region, we deem the detection successful whenever a significance (i.e., pvalue or qvalue less than 0.05) is obtained. We do not strictly require the detected cluster region to be exactly the same as originally simulated. The detected cluster region could fully or partially cover the desired area.
Comparison for Binary Data
The results of power comparison for the binary casepopulation data using the northeastern US benchmark data set are shown in Figures 3 and 4 for a total number of 600 and 6000 cases, respectively. Kulldorff's method used Poisson model, and the semiparametric method used the likelihood ratio test. All tests are twosided tests. The radius of the scan window varies continuously from covering one county to covering no more than 50% of the whole study region. In each figure, scenario 0 is under the null hypothesis of no cluster, scenarios 1 to 9 are hotspot clustering scenarios, and scenarios 10 to 15 are global clustering scenarios. Each scenario contains five quantities. They are "Kull.paper", "Kull.me", "SemiwFDR = 0.1", "SemiwFDR = 0.05", and "Bonferroni". "Kull.paper" is the corresponding power copied from Kulldorff et al. (2003) paper [22]. "Kull.me" is the corresponding power computed by us based on Kulldorff Poisson model. It is shown from Figure 3 and 4 that the column of "Kull.paper" is almost equal to the column of "Kull.me", which confirms the validity of our computation. So in the latter section, we will use "Kulldorff's method" without distinguishing these two power results. "SemiwFDR = 0.1" is the corresponding power computed based on semiparametric method with a qvalue significance level 0.1. "SemiwFDR = 0.0.5" is the corresponding power based on semiparametric method with a qvalue significance level 0.05. The smaller the qvalue significance level is, the harder it is to reject the null hypothesis, hence the lower the power to detect the cluster. "Bonferroni" is the corresponding power computed based on semiparametric method using Bonferroni correction with 5% familywise error rate, where the original obtained pvalues are multiplied by the number of tests to obtain Bonferroni corrected pvalues. If at least one Bonferroni corrected pvalue is less than 0.05, we claim there is clustering. As we know Bonferroni correction is a popular but conservative approach to handle multiple testing problems, we expect column of "Bonferroni" should have the lowest power among the five.
For scenario 0, both Figures 3 and 4 show that the type I error of Kulldorff's method are all exactly 0.05 for both 600 cases and 6000 cases. This is expected since Kulldorff's method uses a Monte Carlo procedure to derive the cutoff point for the corresponding significance level. The power from the semiparametric method under the qvalue significance level of 0.1 is 0.052 for the 600 cases, and 0.054 for the 6000 cases, almost equal to Kulldorff's type I error level. Notice that higher qvalue significance level means in favor of the alternative. If the qvalue significance level is lower down to 0.05, the type I error of the semiparametric method reduced to 0.027 for both cases. It shows that if one wants to make the power comparison under exactly the same type I error level, say 0.05, the qvalue significance level must be set in the interval between 0.05 to 0.1. In addition, the type I error of the semiparametric method with Bonferroni correction is the lowest, which is not a surprise, since Bonferroni correction is most conservative.
For scenarios No. 1 to 9, both figures show that the two methods work very well in detecting hotspot clusters, but Kulldorff's method seems to be slightly more powerful. When the study area has a stronger pattern of clustering, such as containing more cluster regions, the power of both methods increases, which demonstrates the validity of the two methods. For instance, scenario No. 6 in Figure 3 has a total of 600 cases and two cluster regions where each cluster region contains 16 counties. The power of Kulldorff's method is 0.996, whereas the power of semiparametric method with qvalue significance level of 0.1 and 0.05 obtain the power 0.996 and 0.993, respectively, which is quite close to the power of Kulldorff's. The semiparametric method with Bonferroni correction is 0.968, which is expected to be the lowest, but still is quite reasonable.
For scenarios No. 10 to 15, there is no clear cut to say Kulldorff's method or semiparametric method is better, and both methods do not achieve high power as in hotspot detection. This is so because both Kulldorff's and the semiparametric methods are not designed to detect global clustering pattern. It is also shown that for both methods, the larger the r is, the lower is the detection power. This is reasonable, because larger r means larger range for the data to spread, hence more difficult to form a local cluster pattern.
Comparison for Ordinal Categorical Data
The results of power comparison for nonbinary ordinal categorical data using the simulated middle west and south US data set are shown in Figures 5, 6 and 7. Similar as in the power comparison for binary data, the radius of the scan window varies continuously from covering one county to covering no more than 50% of the whole study region. All the tests are twosided tests. The significance level for Kulldorff's method is still 0.05, and the qvalue significance level for the tests in the semiparametric method is set at 0.05 as well. This means the true type I error level of the semiparametric method might be lower than Kulldorff's as showed in the last subsection.
For ordinal categorical data which are generated from the quantized normal II type, Figure 5 shows that the semiparametric method with the likelihood ratio test has the highest power of detecting potential clusters among all the tests. The semiparametric method with the χ_{1} test works well but not as good as the likelihood ratio test. This is in line with the limited power study in Kedem and Wen (2007) [16] which showed the likelihood ratio test was the most powerful tests among three tests from the semiparametric density ratio model. Moreover, when the clustering pattern is stronger, for instance, the study region contains multiple clusters, the power of χ_{1} test increases to 0.94, which is almost close to the power of the likelihood ratio test. The power of Kulldorff's method with Poisson model is very low, because Poisson model is designed to detect the difference in the mean not in the variance. Kulldorff's method with ordinal model is comparable to the semiparametric method with the χ_{1} test in detecting potential clusters.
For ordinal categorical data which are generated from the quantized normal III type, Figure 6 shows that the semiparametric method performs in the same way as in the quantized normal II case. The semiparametric method with the likelihood ratio test still has the highest power in both single and multiple clusters situation. The power of all tests increases as the clustering pattern becomes stronger. Kulldorff's method with the ordinal model performs equally well as compared to semiparametric method. Kulldorff's method with Poisson model "works" but still much less powerful than the other tests. This is because for the quantized normal III data, Kulldorff's Poisson model can detect the changes in the mean while still ignoring changes in the variance.
The part (a) of Figure 7 shows the power results when the difference (in the mean) between the cluster and noncluster regions is small. The data are still ordinal categorical data generated from the quantized normal III type. However, this time the difference is that the mean in the cluster region was set to be close to the the noncluster region, while the variances inside and outside the cluster region were kept the same as in the previous case (refer to Table 2 for the simulation parameters). In this way, the clusters were more difficult to detect. Not surprisingly, the power for the semiparametric method with the likelihood ratio test and Kulldorff's method with the ordinal model continue achieving the highest power, although it is much lower than the power in the previous case where the differences inside and outside the cluster region were substantial. The power of the χ_{1} test of the semiparametric method also decreases due to the weaker cluster pattern, and Kulldorff's method with the Poisson model yields the lowest power almost 0. The multiple cluster case is similar to the single cluster case but with a relatively higher power. Interestingly, if we look at the accuracy, which means detecting the true cluster region correctly with its exact size, the most accurate method as shown in the part (b) of Figure 7 is the semiparametric method with the likelihood ratio test statistic, which achieves more than 50% higher accuracy rate than Kulldorff's method with the ordinal model.
A Case Study — North Humberside childhood leukemia data set
The semiparametric method with Storey's qvalue for adjusting the multipletesting was again applied to the North Humberside childhood leukemia data set under the same conditions as used in Kedem and Wen (2007) [16]. As shown in Figure 8, both Kulldorff's method and the semiparametric method point to the same cluster as the primary cluster candidate, but but Kulldorff's method gave a highly insignificant pvalue of 0.674, whereas the semiparametric method gave a significant pvalue of 0.002 without adjusting for multiple testing and a close to boundary qvalue of 0.073 adjusted by the Storey's approach. Thus, the results from the semiparametric method are more in line with the medical and environmental studies of Cuzick et al. (1990), Mckinney et al. (1991), and Colt and Blair (1998), which suggested that the North Humberside region had paternal exposure to solvents which were strongly associated with leukemia [39, 41]. It seems the results from the semiparametric method that the located cluster candidate is a true cluster are more credible. More conclusive results may be obtained by increasing the sample size.
Discussion and Conclusion
In this paper, We extended the semiparametric cluster detection method of Kedem and Wen (2007) [16] by incorporating Storey's qvalue method, a type of FDR method, to overcome the multipletesting problem raised by numerous scan windows of variable size. The results of the power study show that when detecting localized clusters, both the semiparametric and Kulldorff's method can achieve comparably good power. For binary populationcase data, Kulldorff's method with Poisson model may have a slightly higher power than the semiparametric method with the tilt function h(x) = x. We may choose the tilt function as h(x) = (x, x^{2})' for binary data as well although the x^{2} term is not necessary. See Appendix 1 for a more detailed explanation. For nonbinary data, such as ordinal categorical data, the semiparametric method with the tilt function h(x) = (x, x^{2})' is slightly more powerful than Kulldorff's method with the ordinal model. If Kulldorff's method is applied with an inappropriate model, it may fail to detect any potential clusters as shown in the simulation study for nonbinary data. We also find that in our semiparametric method the likelihood ratio test seems to have higher power than the χ_{1} test in detecting potential clusters. When the localized clustering pattern is strong, for instance, multiple cluster regions or the difference inside and outside the cluster region is large, both tests obtain good power. When the clustering pattern is weak, that is, the difference is not that large, the likelihood ratio test seems to be acuter, while the χ_{1} test could be insensitive to the undesired fluctuations. In addition, the Likelihood ratio test only can test "either high or low values", but χ_{1} test can be easily transformed into a onesided test [16] for twosample comparison and scalar tilt function. In practice, it is prudent to use both tests for potential clusters. Potentially, if one wants to borrow information from other sources besides the current study region, the χ_{2}test of the semiparametric method could be used.
In the power study for nonbinary data, we only used the ordinal categorical data, which are integer data, to compare the power of the two methods. However, Kulldorff's method also provide exponential model to analyze the survival data, which are mostly assumed to be exponentially distributed, and a normal model to analyze continuous data. A future study is to compare the power of the semiparametric method and Kulldorff's method for continuous data. But we may expect the semiparametric method will still work well, since the semiparametric density ratio model was originated from continuous data. In addition, the semiparametric model may have a more consistent setup than Kulldorff's method. Kullorff's method needs to choose different models, namely different scan statistics, for different types of data, whereas the semiparametric method requires no complete distributional assumptions except for the tilt function h(x). It could choose h(x) = (x, x^{2})' for many types of data, no matter continuous or discrete.
If the underlying distribution is known exactly, semiparametric could choose the appropriate tilt function h(x) to get the best performance. For instance, if the data are known from Bernoulli or Poisson distribution, we can use the tilt function h(x) = x; if the data are from normal, we can use h(x) = (x, x^{2})'; if the data are from Gamma distribution, we can use h(x) = (x; log x)'. A clue of how to choose a satisfactory h(x) for a given situation can be derived from common exponential families. More examples can be found in Kedem and Wen (2007) or Kay and Little (1987) [16, 42]. In addition, Fokianos et. al. (2001) [30] suggested that h(x) could be approximated by a polynomial, B splines, or kernel estimation. If the underlying distribution is not known, there could be a problem of the misspecified tilt function. Fokianos and Kaimi has discussed that a misspecified h(x) could decrease the power of the corresponding tests [43, 44]. Yet, there are examples where very different choices of h(x) could lead to similar test results. For instance, in an application to meteorological data in Kedem et al. (2004) [45] the choice of h(x) = x or h(x) = log x led to very similar results. It seems that for a nonhomogeneous regional variance the choice of h(x) = (x, x^{2})' suggested by the normal distribution is sensible.
The semiparametric density ratio model essentially tests the homogeneity or equidistribution of two or more samples, therefore, besides Kulldorff's circular scan window, the semiparametric method may also adapt to other shape of the scanning window or scanning schemes, such as the Kulldorff's elliptic window scan, irregular shape upperlevel set scan, or flexible scan, etc. More precisely, in scanning for clusters, and regardless of the regular or irregular shape of the scanning window, as long as the window separates the whole study region into two samples, one inside the window and one outside the window, the semiparametric method can be applied. However, it also makes the semiparametric method ignore the information about the location of the cases except whether the case is inside or outside the current evaluated window. Thus the semiparametric method may not have good power for global type clustering as shown in scenarios 10 to 15 in Figure 3 and 4 where clustering occurs throughout the study region.
A limiting factor of this power study might be that in the power study for nonbinary type data we only used 100 runs at one point for each power comparison. That is because it is tedious and time consuming to run SatScan software and document the results manually. In addition, the semiparametric method also takes longer time to numerically estimate the parameters α and β especially when the combined sample size is big, while for binary scan the MLEs of α and β have already been obtained in closed forms. However, although 100run does not sound like a large number in simulation, the results are still meaningful. From Figures 5 and 6, it is already clear to show the semiparametric method achieves comparable good power as Kulldorff's method with the correct model achieves. If Kulldorff's method is applied with an inappropriate model, the power may decrease a lot. Figure 7 demonstrates that when the difference between the cluster and noncluster region is small, the detecting power for both methods decrease, but the likelihood ratio test from the semiparametric method seems have better accuracy in detecting clusters.
A last note is about the π_{0} in Storey's qvalue method. The critical part of the qvalue method of controlling the false discovery rate is to give a good estimate of π_{0}, the proportion of the true null hypotheses among all the tests. The current method we used in this study is based on the algorithm suggested by Storey et al. (2003) [35] and assume the distribution of pvalue from each test is uniform over the (0,1) interval. However, Yang (2004) [46] pointed out that if the pvalues were not uniformly distributed, the power of the qvalue method may decrease. He suggested to compute a weighted average of π_{0} from the distribution of the raw pvalues which are greater than a threshold (say 0.4). Thus it may give a better control and more robust estimate of π_{0}.
Appendix 1 — Simplified Semiparametric Test Statistics for Binary Data
If the data are 01 binary, such as cancer or not cancer, we can simplify the likelihood and obtain closed forms for the estimated parameters. Recall:
N_{ G }: The combined sample size for the whole study region.
n_{ G }: The number of cases in the whole study region.
N_{ Z }: The sample size within the scan window.
n_{ G }: The number of cases within the scan window.
ti: The i th observation from the combined sample in the whole study region, i = 1,..., N_{ G }.
ρ_{1}: The relative sample size, which is equal to N_{ Z }/(N_{ G } N_{ Z }).
x_{ Zj }: The j th observation within the scan window Z, j = 1,..., N_{ Z }.
For binary data, choose the tilt function h(x) = x. Then the profile loglikelihood with parameters α_{1} and β_{1} is
The resulting maximum likelihood estimators are,
If β_{1} = 0, it implies α_{1} = 0 and e^{α 1+β 1}= 1, which means the relative rates inside and outside the scan window are equal.
By equation (7), then χ_{1} test statistic then can be simplified to
Where and is as in equation (14).
By equation (9), then the likelihood ratio test statistic can be simplified to
If the tilt function h(x) is chosen as (x,x^{2})', the parameter β_{11} and β_{12} are associated with the x and x^{2} term, respectively. Notice that for 01 binary data, x = x^{2} in value, thus
the x^{2} term is confounded with x, and β_{12} is not estimable. Therefore, tilt function h(x) = x with parameters α_{1}, β_{1} are sufficient for binary case.
Appendix 2 — Derivation of S, V
The matrices S, V are derived by repeated differentiation of the loglikelihood (eq. 17), which is the generalized version of equation (4) for m samples (m = q + 1) with the m th sample being the reference sample.
Recall that the asymptotic covariance matrix of the estimates in (5) is given by the product
First define
Then E[∇ l(α_{1},..., α_{ q }, β_{1},..., β_{ q })]=0. To obtain the score second moments it is convenient to define ρ_{ m }≡ 1, w_{ m }(t) ≡ 1
and,
for j, j' = 1,..., q. Then, the entries in
are
The last term is 0 for j ≠ j' and (n_{ j }/n)Var [h(∈_{j 1})] for j = j'.
Next, as n → 1,
where S is a q(1 + p) × q(1 + p) matrix with entries corresponding to j, j' = 1,..., q,
References
 1.
Pickle LW, Feuer EJ, Edwards BK: U.S. Predicted Cancer Incidence, 1999: Complete Maps by County and State From Spatial Projection Models. National Cancer Institute, Cancer Surveillance Monograph No. 5. 2003, NIH Publication No. 035435.
 2.
Glaz J, Balakrishnan N: Scan Statistics and Applications. 1999, New York: Springer
 3.
Glaz J, Naus J, Wallenstein S: Scan statistics. 2001, New York: Springer
 4.
Naus JI: The distribution of the size of the maximum cluster of points on a line. Journal of the American Statistical Association. 1965, 60: 532538. 10.2307/2282688.
 5.
Shmueli G, Fienberg SE: Current and potential statistical methods for monitoring multiple data streams for biosurveillance. Statistical Methods in CounterTerrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication. Edited by: Wilson A, Wilson G, Olwell DH. 2006, New York: Springer, 109140.
 6.
Patil GP, Taillie C: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 2004, 11: 183197. 10.1023/B:EEST.0000027208.48919.7e.
 7.
Patil GP, Rathbun SL, Acharya R, Patankar P, Modarres R: Geoinformatic surveillance of hotspot detection, prioritization and early warning. Proceedings of the 2005 National Conference on Digital Government Research. 2005, 301302.
 8.
Modarresa R, Patil GP: Hotspot Detection with Bivariate Data. Journal of Statistical Planning and Inference. 2007, 11: 36433654. 10.1016/j.jspi.2007.03.040.
 9.
Tango T, Takahashi K: A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005, 4: 1125. 10.1186/1476072X411.
 10.
Takahashi K, Kulldorff M, Tango T, Yih K: A flexibly shaped spacetime scan statistic for disease outbreak detection and monitoring. International Journal of Health Geographics. 2008, 7: 1427. 10.1186/1476072X714.
 11.
Kulldorff M: A spatial scan statistic. Communication in Statististics: Theory and methods. 1997, 26: 14811496. 10.1080/03610929708831995.
 12.
Kulldorff M: Spatial scan statistics: models, calculations, and applications. Scan Statistics and Applications. Edited by: Glaz J, Balakrishnan N. 1999, Boston: Birkhäuser, 303322.
 13.
Kulldorff M: Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society A. 2001, 164 (part 1): 6172. 10.1111/1467985X.00186.
 14.
Huang L, Kulldorff M, Gregorio D: A spatial scan statistic for survival data. Biometrics. 2007, 63 (1): 109118. 10.1111/j.15410420.2006.00661.x.
 15.
Jung I, Kulldorff M, Klassen AC: A spatial scan statistic for ordinal data. Statistics in Medicine. 2007, 26 (7): 15941607. 10.1002/sim.2607.
 16.
Kedem B, Wen S: Semiparametric cluster detection. Journal of Statistical Theory and Practice. 2007, 1: 4972.
 17.
Waller LA, Lawson AB: The power of focused tests to detect disease clustering. Statistics in Medicine. 1995, 14: 22912308. 10.1002/sim.4780142103.
 18.
Alexander F, Cartwright R, McKinney PA, J RT: Investigation of spatial clustering of rare diseases: childhood malignancies in North Humberside. Journal of Epidemiology and Community Health. 1990, 44: 3946. 10.1136/jech.44.1.39.
 19.
Storey JD: A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B. 2002, 64: 479498. 10.1111/14679868.00346.
 20.
Storey JD: The positive false discovery rate: a bayesian interpretation and the qvalue. Annals of Statistics. 2003, 31: 20132035. 10.1214/aos/1074290335.
 21.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995, 57 (1): 289300.
 22.
Kulldorff M, Tango T, Park PJ: Power comparison for disease clustering tests. Computational Statistics & Data Analysis. 2003, 42: 665684. 10.1016/S01679473(02)001603.
 23.
Kulldorff M, Heffernan R, Hartman J, Assunção RM, Mostashari F: A spacetime permutation scan statistic for the early detection of disease outbreaks. PLoS Medicine. 2005, 2: 216224. 10.1371/journal.pmed.0020059.
 24.
Dwass M: Modified randomization tests for nonparametric hypotheses. The Annuals of Mathematical Statistics. 1957, 28: 181187. 10.1214/aoms/1177707045.
 25.
Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD: Statistical Inference under Order Restrictions. 1974, New York: Wiley
 26.
Dykstra R, Kochar S, Robertson T: Inference for likelihood ratio ordering in the twosample problem. Journal of the American Statistical Association. 1995, 90 (431): 10341040. 10.2307/2291340.
 27.
Kulldorff M, Mostashari F, Duczmal L, Yih WK, Kleinman K, Platt R: Multivariate scan statistics for disease surveillance. Statistics in Medicine. 2007, 26 (8): 18241833. 10.1002/sim.2818.
 28.
Kulldorff M, Huang L, Konty K: A spatial scan statistic for normally distributed data. 2009,http://populationmedicine.net/content/docs/normalscanposted.pdf
 29.
Kulldorff M, Huang L, Pickle L, Duczmal L: An elliptic spatial scan statistic. Statistics in Medicine. 2007, 25 (22): 39293943. 10.1002/sim.2490.
 30.
Fokianos K, Kedem B, Qin J, Short DA: A semiparametric approach to the oneway layout. Technometric. 2001, 43: 5665. 10.1198/00401700152404327.
 31.
Qin J, Zhang B: A goodness of fit test for logistic regression models based on casecontrol data. Biometrika. 1997, 84: 609618. 10.1093/biomet/84.3.609.
 32.
Qin J, Lawless JF: Empirical likelihood and general estimating equations. The Annals of Statistics. 1994, 22: 300325. 10.1214/aos/1176325370.
 33.
Sen PK, Singer JM: Large sample method in statistics, An introduction with Applications. 1993, London: Chapman and Hall/CRC
 34.
Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003, 19 (3): 368375. 10.1093/bioinformatics/btf877.
 35.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences. 2003, 100: 94409445. 10.1073/pnas.1530509100.
 36.
Tsai C, Hsueh M, Chen J: Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data. Biometrics. 2003, 59 (4): 10711081. 10.1111/j.0006341X.2003.00123.x.
 37.
Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society Series B. 2004, 66: 187205. 10.1111/j.14679868.2004.00439.x.
 38.
Gagnon R: Certain Computational Aspects of Power Efficiency and of State Space Models. PhD thesis, University of Maryland, College Park, Department of Mathematics. 2005
 39.
Cuzick J, Edwards R: Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society B. 1990, 52: 73104.
 40.
McKinney PA, Alexander FE, Cartwright RA, Parker L: Parental occupations of children with leukemia in west Cumbria, North Humberside, and Gateshead. British Medical Journal. 1991, 302: 681687. 10.1136/bmj.302.6778.681.
 41.
Colt JS, Blair A: Parental Occupational Exposures and Risk of Childhood Cancer. Environmental Health Perspectives Supplements. 1998, 106 (S3): 909925. 10.2307/3434208.
 42.
Kay R, Little S: Transformations of the explanatory variables in the logistic regression model for binary data. Biometrika. 1987, 74: 495501. 10.1093/biomet/74.3.495.
 43.
Fokianos K, I K: On the effect of misspecifying the density of ratio model. Annals of the Institute of Statistical Mathematics. 2006, 58 (3): 475497. 10.1007/s1046300500228.
 44.
Fokianos K: Density ratio model selection. Journal of Statistical Computations and Simulation. 2007, 77 (9): 805819. 10.1080/10629360600673857.
 45.
Kedem B, Wolff DB, Fokianos K: Statistical comparison of algorithms. IEEE Transactions on Instrumentation and Measurement. 2004, 53: 770776. 10.1109/TIM.2004.827301.
 46.
Yang X: Qvalue methods may not always control false discovery rate in genomic applications. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. 2004, 556557.
Acknowledgements
Special thanks to Professor Martin Kulldorff for all of his help during this research. Also thanks for the valuable comments from the three reviewers.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BK proposed the idea to apply the semiparametric density ration model to cluster detection. SW implemented the idea and developed the semiparametric cluster detection method. SW wrote the Splus code, conducted the power comparison, and wrote the manuscript. BK gave valuable comments to the manuscript. All authors have read and approved the final manuscript.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Poisson Model
 Cluster Region
 Likelihood Ratio Test Statistic
 Global Cluster
 Power Comparison