A log-Weibull spatial scan statistic for time to event data

Background Spatial scan statistics have been used for the identification of geographic clusters of elevated numbers of cases of a condition such as disease outbreaks. These statistics accompanied by the appropriate distribution can also identify geographic areas with either longer or shorter time to events. Other authors have proposed the spatial scan statistics based on the exponential and Weibull distributions. Results We propose the log-Weibull as an alternative distribution for the spatial scan statistic for time to events data and compare and contrast the log-Weibull and Weibull distributions through simulation studies. The effect of type I differential censoring and power have been investigated through simulated data. Methods are also illustrated on time to specialist visit data for discharged patients presenting to emergency departments for atrial fibrillation and flutter in Alberta during 2010–2011. We found northern regions of Alberta had longer times to specialist visit than other areas. Conclusions We proposed the spatial scan statistic for the log-Weibull distribution as a new approach for detecting spatial clusters for time to event data. The simulation studies suggest that the test performs well for log-Weibull data.


Background
The existence of more than presumed numbers of cases of a disease condition in a geographic region is referred to as a spatial disease cluster. Timely detection of spatial disease clusters enables health authorities to better understand the distribution of disease and if possible, control disease. A large number of methods have been proposed and applied by authors for the identification and evaluation of geographical disease clusters and disease surveillance, and the spatial scan statistics (SSS) is one of them. The SSS, with its possible extensions has been widely used as a standardized approach for the last two decades, not only in the disease clustering but also in various other fields of study like natural disasters [1], forestry [2], astronomical data [3], history [4], and psychology [5]. It was first proposed by Kulldorff and Nagarwalla and has the capability of identifying spatial clusters of variable sizes and locations [6]. The key reasons for the popularity of this method include that it identifies the cluster location and tests the tendency to cluster [7]. According to Costa and Assunção, the latter advantage is considered to be more important in terms of health related interventions than global clustering results [7]. The SSS's based on the Bernoulli and Poisson models are frequently used for count data for cluster identification and geographical disease surveillance [8,9]. These scan statistics have been further extended to other kinds of data such as ordinal [11], multinomial [12], continuous [13], and correlated count data [14].
Time to event data along with the censoring component (e.g., survival data) is one of the important health outcomes for which the SSS is of interest [9]. The SSS for time to event data is used to determine if there are geographical clusters with either longer than expected and/ or shorter than expected time to event. The exponential [9] and Weibull [10] SSS's (adjusted for censoring) have already been developed for time to event data. We propose the log-Weibull as an alternative distribution for the SSS for cluster detection of time to event data. The log-Weibull distribution has wide applications in extreme value theory. Our focus is to establish a new SSS for the detection of rare and extreme events.
In the Methods section, we describe the existing Weibull SSS and the newly developed SSS based on the log-Weibull distribution. The Application section contains the results from the identification of clusters of longer times to specialist follow-up after an emergency department presentation for atrial fibrillation and flutter in Alberta, Canada. Simulation studies are performed to investigate power, the effect of right (type I) differential censoring, and ability to identify the true cluster by the log-Weibull and Weibull spatial scan statistics.

Methods
The SSS identifies the geographic zones from a study region that have the strongest indication of representing a spatial cluster. It uses data such as administrative health data collected for geographical sub-regions, each characterized by a centroid (population or geographic based). The SSS imposes a circular searching window of radius r on each centroid with its center at the coordinate of a centroid [6]. A zone (Z) defined by this circular window is comprised of all the individuals in the sub-regions whose centroids lie inside the circle [6]. For the purpose of the analysis, an upper bound r* is chosen for the radius of the circular window [10]. For each region's centroid, its nearest neighbours covering altogether r * percent of the total population are calculated. For any given position of the centroid, the radius of the window is expanded continuously to take any value between 0 and r* [10]. During the expansion, every time a new zone is created with an inclusion of a new neighbouring centroid in the circular window [14]. Zones defined in this way have irregular geographical boundaries depending on the size and shape of those sub-regions, whose centroids lie inside the spatial scan window [14].
The methodology of the SSS is based on calculating the maximum log likelihood ratio (LLR). The SSS partitions the geographical area into zones (i.e., areas of potential cluster versus the rest of the study region) and the LLR is calculated every time when a new zone is created for each centroid [8,10]. The zone maximizing the LLR is called the primary (most likely) cluster. Let the primary cluster be the zone Ẑ that maximizes the LLR. The hypothesis under consideration is: H 0 : The disease risk is constant over Ẑ ∪Ẑ c vs. H 1 : There is an elevated risk in Ẑ .
Let G be the whole study region which can be partitioned into Z and Z c mutually exclusive sub-regions, where Z indicates a zone designated to be a potential cluster and Z c is the rest of the study region. Let N = n in + n out be the total number of individuals in G , where n in and n out are the total individuals inside and outside the zone, respectively. The subscripts "in" and "out" indicate that the objects are calculated from the individuals inside and outside the zone, respectively.
Let the ith individual have a time to event , where δ i is the indicator to represent if time is censored or not [9]. The observed time is defined as t i = min(T i , L i ) . Let R = r in + r out be the total number of uncensored observations, where r in and r out are the total number of uncensored observations inside and outside the zones, respectively. These are defined as r in = i∈Z δ i and r out = i∈Z c δ i .

Weibull distribution
Bhatt and Tiwari established the SSS based on the Weibull distribution. The Weibull model is a nice generalization of the exponential model that includes a shape parameter with the existing scale parameter [10]. The additional parameter provides the opportunity to the Weibull hazard function to take different shapes rather than to be a constant. We provide a brief summary of the methodology, complete details can be found in the paper presented by Bhatt and Tiwari [10]. Let the times to event T ′ i s, (i = 1, . . . , N ) be i.i.d. with the Weibull probability density function (PDF) f (T i ) = 1 θ pT where θ and p are the scale and shape parameters, respectively. Let the time to event for each individual inside the zone be distributed as the Weibull distribution with θ in and p in as the scale and shape parameters, respectively. Similarly, assume that the times to event for individuals outside the zone are Weibull distributed with θ out and p out as the scale and shape parameters, respectively. The null hypothesis under consideration is H 0 : θ in = θ out versus the alternative hypotheses H 1 : θ in < θ out , H 1 : θ in > θ out , or H 1 : θ in � = θ out . The alternative hypotheses show that at least one zone is detected with either shorter than expected, longer than expected, or simultaneously both longer and shorter than expected times to events. The likelihood ratio test statistic for the Weibull SSS for H 1 :  .

Log-Weibull distribution
The log-Weibull distribution is a specialized case of the generalized extreme value distribution. It is often used to model the distribution of extreme values, strength, event history data such as quick wear-out after reaching a certain age, and logarithms of times [17]. We assume that times to event T ′ i s, (i = 1, . . . , N ) are independently and identically distributed (i.i.d.) with the log-Weibull PDF , where a and b are the location and scale parameters, respectively. The survival function for the log-Weibull distribution is Let the time to event for each individual inside zone Z be log-Weibull distributed with a in and b in as the location and scale parameters, respectively. Similarly, the time to event for each individual outside zone Z(i.e., inside Z c ) follows the log-Weibull distribution with a out and b out as the location and scale parameters, respectively. The null hypothesis H 0 : b in = b out for any Z is contrasted with one of three alternative hypotheses: The likelihood function L(Z) = L(Z, b in , b out ) for the log-Weibull SSS can be written as:  Under So, the likelihood ratio statistic for In order to address the alternative hypotheses b in < b out and b in > b out , the function is multiplied by I b in <b out and I b in >b out , respectively.

Permutation test procedure
Since there is no closed analytical form of the distribution of the test statistic , a permutation test procedure is used to test the statistical inference of the selected clusters. The exact distribution of the time to events is unknown and it is not possible to generate the simulated data under the null hypothesis. To overcome this situation, the observed pairs {(t i , δ i ), i = 1, 2, . . . , N } are permuted 999 times among the individual geographical coordinates of the original study region [9]. For each permuted dataset, the log-likelihood is calculated for each zone and the most likely cluster preserving the maximum log-likelihood in the dataset is saved. A p value is calculated as the fraction of permutations that are at least as extreme as the test statistic from the observed time to event data [18]. This permutation step ensures that no matter how the observed time to event data are distributed, this distribution is preserved for each permuted dataset. This factor provides valid statistical inference since all the permuted datasets are equally distributed [9]. Secondary clusters are the significant spatial clusters that do not overlap with the primary cluster [9]. These clusters are ranked with their corresponding LLR values and the associated p values are calculated by comparing the kth (say) highest likelihood in the real dataset with the maximum likelihood in the randomly permuted datasets [9]. Note that the use of a permutation test procedure means that there will be variation in the exact p values for successive analyses of the same datasets.

Emergency data application
We illustrate the log-Weibull SSS on population based administrative data (age ≥ 35) for patients discharged from the emergency department (ED) who presented with atrial fibrillation and flutter (AFF) in the province of Alberta during April 1, 2010, to March 31, 2011. In 2003, the province of Alberta was divided into nine administrative health areas also called Regional Health Authorities (RHAs) [19]. These RHA's were further partitioned into 70 sub-Regional Health Authorities (sRHAs) (Fig. 1, numbered 1-70). The sRHAs have diverse population sizes ranging from 550 to 140,211 with a median population size of 46,075 in 2011 and are the smallest geographical units available for analysis. For each sRHA's centroid based on population, the latitude and longitude of the centroids are provided by Alberta Health [19]. Distances between the pairs of sRHA population-based centroids are ordered and used to create the nearest neighbours.
The key outcome of interest is the time from ED discharge for AFF to the 1st specialist visit during 365 days of the study period. The specialist in this study is considered as a cardiology (CARD) or internal medicine (INMD). A specialist follow-up visit can occur between ED end time, to the end of the study. Each discharged ED presentation during April 1, 2010, to March 31, 2011, with a follow-up visit to the specialist during its ED end time, to March 31, 2011 is considered a complete time to event outcome. If the patient did not have specialist visit by the end of the study (March 31, 2011), the outcome is referred to as right (type-I) censored. Each Alberta resident making at least one discharged ED presentation for AFF during the fiscal year is referred to as a case (patient).
The methodology used in this study does not adjust for repeated ED presentations of cases. Hence, independent patient data is considered by taking only the last ED visit out of the multiple visits. The calculations are performed using the R and S-Plus [20,21]. Each cluster can contain only a maximum of r * = 10% of the study population. The variable scanning windows are created for each sRHA to absorb neighbours up to 10% of the total population. This upper bound is chosen based on the feasibility of analysis and time restrictions. There are about 1.95 M adults in the study population, among them the discharged subset is comprised of 3039 cases (30% censored, 54% male) with an average age of 68.04 years. The The identified primary and secondary clusters are shown in Table 1 and Fig. 1. The most likely cluster with significantly longer times to events is mainly from R7-R9 RHAs. This cluster is identified with 260 observed number of cases. The LLR is 710.75 with the associated p value (P) of 0.001. This SSS provides two different statistically significant secondary clusters. The first one is a part of R6 and the second cluster is a combination of sRHAs from R1 and R3. Median times to event are 177, 51, and 104 days for inside the primary, secondary (1), and secondary (2) detected clusters, respectively. The corresponding 95% CI's are 128-223, 38-75, and 77-150 days. For the entire province, collectively excluding the primary and both secondary clusters, the median event time is 78 days and the 95% CI is (71, 84) days. Figure 2 shows the Kaplan-Meier curves for the detected primary and secondary clusters and the rest of the province. The SSS based on the Weibull distribution has also been applied to the same Alberta Health data, and is capable of detecting the same primary cluster as of the log-Weibull distribution i.e., from R7-R9 RHA's, with no significant secondary cluster.

Simulation studies
Simulation studies are conducted to investigate the power of detecting a potential cluster and the effect of right differential censoring on cluster detection. All of the datasets are analyzed with the log-Weibull and Weibull SSS's. Time to event data are randomly generated for 500 individuals with five different probability models: the exponential, Weibull, log-Normal, gamma, and log-Weibull. The Alberta geography is used as the geography for analysis and the Alberta population is used to create the zones for the simulation studies. Like the spatial scan analysis of the real administrative data, an upper bound of 10% is imposed on the population size.
For all simulated datasets, a true cluster of 25 individuals is created at a subregion of R201 sRHA, to have longer time to events than the rest of the province. This subregion was chosen because it was rural and away from the detected rural cluster in the real Alberta ED data. R201 was assigned the same percentage of individuals as of the real dataset (i.e., approximately 5% cases in each simulated data). This choice was feasible for simulation studies to run in a reasonable amount of time. Right differential censoring is added with the ratios of 20%:20%, 20%:40%, and 40%:20% for inside:outside the true cluster. For example, 20%:40% means that 20% censoring is used within the true cluster and 40% outside the true cluster.
One thousand simulated datasets are generated from the probability models defined above using the differential censoring settings under the alternative hypotheses of the existence of longer than expected time to event clusters. The choice of 1000 simulations is the same as what was chosen for the development of the Weibull SSS [10] and was computationally timely. For symmetry, parameters for each probability model are chosen in such a way that they provide a constant mean of 2 outside the true cluster and means of 10, 15, and 20 inside the true cluster for each censoring ratio. These values were chosen to be similar to the inside:outside times to event means ratio from real data used in the application.
For each simulated dataset, 999 random permutations are performed to get the p values from the permutation testing procedure. Let, Z * , Z (m) , and M represent the true cluster, the cluster identified in the mth simulations, and total number of simulations, respectively. Power is calculated as the proportion of datasets out of 1000 having p values < 0.05 [9,10], not necessarily detecting the true cluster i.e., In order to observe the strength of identification of the true cluster by each SSS, three different proportions are calculated for mutually exclusive situations from 1000 randomly generated datasets under each probability   In addition to the three cluster performance measures listed above, a global indicator for performance assessment has been used [22] based on the coefficient developed by Tanimoto [23,24]. The Tanimoto coefficient (TC) is computed for each simulated data set and measures the similarity between a simulated and detected cluster by using the ratio of the intersecting cluster cohort to the union cluster cohort. In order to calculate TC, four types of spatial units (SUs) are calculated and defined as: . Global performance is assessed using TC a and TC c by taking both location accuracy and power into account at the same time. Guttmann et al. have assessed the superiority of TC c over TC a based on their functional properties and variability, and observed that TC c has more power of capturing low accuracy in cluster location [22].
Using the log-Weibull SSS ( Table 2, Figs. 3 and 4), the results show that the values of power vary from 0.326 to 0.721 for the 20%:20% censoring, from 0.148 to 0.941 for the 20%:40% censoring situation, and range from 0.350 to 0.737 for the 40%:20% censoring case. Overall, the maximum power is seen when the data are generated under the Weibull distribution and the minimum power is observed for the datasets distributed with the gamma and exponential probability models.
The proportions of datasets perfectly identifying the true cluster fluctuate for the log-Weibull SSS. They are between 0.000 and 0.310 for the 20%:20% case, range from 0.000 to 0.186 for the 20%:40% censoring ratio, and are between 0.000 and 0.264 for the 40%:20% censoring setting, respectively. Under the large cluster identification cohort for the log-Weibull distribution, there are high proportions of the true cluster detected. These proportions range from 0.000 to 1.000 for all three differential censoring situations. Overall, the maximum proportion of perfect identification is achieved for the datasets generated from the log-Weibull distribution. The datasets from the exponential distribution have the highest proportions of large cluster identification including the true cluster among all five probability models. A few decreases are found in the power and the strength of identification of the true cluster for each model, when comparing the 20%:20% to the 20%:40% and 40%:20% censoring cases.
For the log-Weibull SSS, the values of TC a range from 0.060 to 0.448 for all three censoring situations. The TC c values lie between 0.189 and 0.491 with very less variability among the five probability models used to generate the data.
For the Weibull SSS (Table 3, Figs. 5 and 6), the overall results for the power and all the proportions' performances of the datasets are less variable than the results of the log-Weibull SSS. The power values of detecting a potential cluster are between 0.256 and 0.971 for the 20%:20% censoring setting, range from 0.230 to 0.999 for the 20%:40% censoring ratio, and are between 0.355 and 0.981 for the 40%:20% case. The proportions of perfectly detecting a true cluster are high for all three censoring situations across all of the datasets as compared to the log-Weibull distribution, being least for the exponential model. The non-zero proportions of datasets generated under five probability distributions who do not identify the true cluster are between 0.000 and 0.997. The power values increase as the difference between the means of inside and outside the cluster increase and similar effects are seen for the strength of detection of the true cluster.
For the Weibull SSS, the values of TC a and TC c range from 0.090 to 0.478 and 0.226 to 0.489, respectively. This study shows that the Weibull SSS has more similar Table 2 Simulation study results for the log-Weibull spatial scan statistic Average and cumulated Tanimoto coefficients of the log-Weibull spatial scan statistic for cluster detection under right differential censoring. Datasets are generated using five probability models with outside cluster mean = 2 Table 3 Simulation study results for the Weibull spatial scan statistic

Discussion
The spatial scan statistic (SSS) is a widely used statistical technique for the identification of the spatial clusters of different data types by using various probability distributions. In the context of time to event data, the SSS has the ability to detect geographical clusters of cases with either longer and/or shorter than expected event times. These clusters can be adjusted for censoring, if the appropriate probability model is used.
We have proposed the SSS for the log-Weibull distribution as a new approach for detecting spatial clusters for time to event data. The log-Weibull distribution has wide applications in extreme value theory for modeling extreme and rare events. The new log-Weibull method and the Weibull SSS are applied to administrative data from Alberta Health consisting of time from ED discharge for an AFF presentation to 1st specialist visit within 365 days in Alberta during 2010-2011. Results from the SSS show that the primary cluster is detected at the Peace Country, Northern Lights, and Aspen regional Health Authorities. The most likely cluster is comprised of rural areas in northern Alberta which have sparse or low population and have further distances to major metropolitan centres. The results suggest that people living in these northern rural areas may not have regular or quick access to the follow-up care to a specialist after an ED presentation. Our results are in agreement with the recognized issue of health care access for rural residents and strategies such as mobile services, telehealth, and rotating specialists have been suggested and/or implemented [25]. While we recognize that the censoring might be quite early for the patients with an ED visit in late 2011 and the methods may be effected by short follow-up, the effects would be across all areas of the province and we feel that the results are likely linked to real clustering and are plausible given the recognized issue of health care access.
The simulation studies indicate that the power of detecting the potential cluster is higher for the 20%:20% censoring ratio as compared to the 20%:40% and 40%:20% settings. This comparison is also true in the context of identification of a true cluster. When either the Weibull or log-Weibull distributions is used for the SSS, the effect of the right differential censoring on power and detection of the true cluster is similar. For both of the probability models used under the SSS's, as the difference between means of time to event data increase inside and outside the true cluster, the power and proportion of detection of the true cluster also increase. It can be observed from the overall results of both SSS's that the Weibull SSS has good power for detecting a potential cluster for the datasets distributed with any of the five probability models used in this study. However, overall the log-Weibull SSS's performance is satisfactory for the data distributed as the log-Weibull. For the identification of the true cluster, the Weibull SSS shows less variability on the simulated datasets than the log-Weibull SSS. The log-Weibull SSS shows the most power to detect a true cluster for the datasets generated from the log-Weibull distribution. When various differential censoring situations are considered, the global performance indicators for the log-Weibull SSS do not vary widely. Conversely, when there was less censoring inside the cluster than outside the cluster, the log-Weibull SSS had highly variable performance that depended on the underlying data distribution.
The results based on the global indicator for performance assessment also support the above conclusions, identifying that the Weibull SSS detects the true cluster with more power and location accuracy both at the same time, whereas the log-Weibull SSS shows high significant cluster detection accuracy for the datasets generated from log-Weibull probability distribution. It is also observed that the log-Weibull distribution has a good ability to detect a broader cluster including the true cluster instead of identifying exact true cluster. It is suggested that the log-Weibull SSS can be used to detect a spatial cluster for the time to event data distributed as log-Weibull. Based on the simulation study results for both SSSs, the log-Weibull SSS proved to be less effective than the Weibull SSS when the dataset is generated from the exponential distribution. When the underlying data distribution is not exponential, the log-Weibull SSS has slightly reduced performance than the Weibull SSS; however, the log-Weibull SSS had similar performance across different underlying data distributions, especially when the censoring ratio is higher inside the true cluster than outside the true cluster.
There are many opportunities for future work. For example, the proposed methodology based on the SSS for the log-Weibull distribution does not adjust for important factors such as age and gender. In future, such covariates can be adjusted in the analysis of the identification of potential clusters for time to event data. Furthermore, the new developed method can only be performed on a purely spatial setting. The space-time scan statistic has been developed by other authors in both retrospective [15] and prospective [16] ways. In the future, the SSS based on the log-Weibull distribution can be extended to the space-time setting, and similar simulation studies can be performed to investigate power of detection of space-time clusters.

Conclusions
We have proposed a new SSS using the log-Weibull distribution. The new method has been applied to specialist follow-up data in Alberta, and the SSS's have been