 Methodology
 Open access
 Published:
A logWeibull spatial scan statistic for time to event data
International Journal of Health Geographics volume 17, Article number: 20 (2018)
Abstract
Background
Spatial scan statistics have been used for the identification of geographic clusters of elevated numbers of cases of a condition such as disease outbreaks. These statistics accompanied by the appropriate distribution can also identify geographic areas with either longer or shorter time to events. Other authors have proposed the spatial scan statistics based on the exponential and Weibull distributions.
Results
We propose the logWeibull as an alternative distribution for the spatial scan statistic for time to events data and compare and contrast the logWeibull and Weibull distributions through simulation studies. The effect of type I differential censoring and power have been investigated through simulated data. Methods are also illustrated on time to specialist visit data for discharged patients presenting to emergency departments for atrial fibrillation and flutter in Alberta during 2010–2011. We found northern regions of Alberta had longer times to specialist visit than other areas.
Conclusions
We proposed the spatial scan statistic for the logWeibull distribution as a new approach for detecting spatial clusters for time to event data. The simulation studies suggest that the test performs well for logWeibull data.
Background
The existence of more than presumed numbers of cases of a disease condition in a geographic region is referred to as a spatial disease cluster. Timely detection of spatial disease clusters enables health authorities to better understand the distribution of disease and if possible, control disease. A large number of methods have been proposed and applied by authors for the identification and evaluation of geographical disease clusters and disease surveillance, and the spatial scan statistics (SSS) is one of them.
The SSS, with its possible extensions has been widely used as a standardized approach for the last two decades, not only in the disease clustering but also in various other fields of study like natural disasters [1], forestry [2], astronomical data [3], history [4], and psychology [5]. It was first proposed by Kulldorff and Nagarwalla and has the capability of identifying spatial clusters of variable sizes and locations [6]. The key reasons for the popularity of this method include that it identifies the cluster location and tests the tendency to cluster [7]. According to Costa and Assunção, the latter advantage is considered to be more important in terms of health related interventions than global clustering results [7]. The SSS’s based on the Bernoulli and Poisson models are frequently used for count data for cluster identification and geographical disease surveillance [8, 9]. These scan statistics have been further extended to other kinds of data such as ordinal [11], multinomial [12], continuous [13], and correlated count data [14].
Time to event data along with the censoring component (e.g., survival data) is one of the important health outcomes for which the SSS is of interest [9]. The SSS for time to event data is used to determine if there are geographical clusters with either longer than expected and/or shorter than expected time to event. The exponential [9] and Weibull [10] SSS’s (adjusted for censoring) have already been developed for time to event data. We propose the logWeibull as an alternative distribution for the SSS for cluster detection of time to event data. The logWeibull distribution has wide applications in extreme value theory. Our focus is to establish a new SSS for the detection of rare and extreme events.
In the Methods section, we describe the existing Weibull SSS and the newly developed SSS based on the logWeibull distribution. The Application section contains the results from the identification of clusters of longer times to specialist followup after an emergency department presentation for atrial fibrillation and flutter in Alberta, Canada. Simulation studies are performed to investigate power, the effect of right (type I) differential censoring, and ability to identify the true cluster by the logWeibull and Weibull spatial scan statistics.
Methods
The SSS identifies the geographic zones from a study region that have the strongest indication of representing a spatial cluster. It uses data such as administrative health data collected for geographical subregions, each characterized by a centroid (population or geographic based). The SSS imposes a circular searching window of radius r on each centroid with its center at the coordinate of a centroid [6]. A zone (Z) defined by this circular window is comprised of all the individuals in the subregions whose centroids lie inside the circle [6]. For the purpose of the analysis, an upper bound r* is chosen for the radius of the circular window [10]. For each region’s centroid, its nearest neighbours covering altogether r * percent of the total population are calculated. For any given position of the centroid, the radius of the window is expanded continuously to take any value between 0 and r* [10]. During the expansion, every time a new zone is created with an inclusion of a new neighbouring centroid in the circular window [14]. Zones defined in this way have irregular geographical boundaries depending on the size and shape of those subregions, whose centroids lie inside the spatial scan window [14].
The methodology of the SSS is based on calculating the maximum log likelihood ratio (LLR). The SSS partitions the geographical area into zones (i.e., areas of potential cluster versus the rest of the study region) and the LLR is calculated every time when a new zone is created for each centroid [8, 10]. The zone maximizing the LLR is called the primary (most likely) cluster. Let the primary cluster be the zone \(\hat{Z}\) that maximizes the LLR. The hypothesis under consideration is:
H_{0}: The disease risk is constant over \(\hat{Z} \cup \hat{Z}^{c}\) vs. H_{1}: There is an elevated risk in \(\hat{Z}\).
Let \(G\) be the whole study region which can be partitioned into \(Z\) and \(Z^{c}\) mutually exclusive subregions, where Z indicates a zone designated to be a potential cluster and \(Z^{c}\) is the rest of the study region. Let \(N = n_{in} + n_{out}\) be the total number of individuals in \(G\), where \(n_{in}\) and \(n_{out}\) are the total individuals inside and outside the zone, respectively. The subscripts “in” and “out” indicate that the objects are calculated from the individuals inside and outside the zone, respectively.
Let the ith individual have a time to event \(T_{i} ,\,(i = 1,\ldots,N)\) or a fixed right censoring time \(L_{i}\). The event time \(T_{i}\) is observed if \(T_{i} \le L_{i} (\delta_{i} = 1)\), and \(L_{i}\) is observed if \(T_{i} > L_{i} (\delta_{i} = 0)\), where \(\delta_{i}\) is the indicator to represent if time is censored or not [9]. The observed time is defined as \(t_{i} = \hbox{min} (T_{i} ,L_{i} )\). Let \(R = r_{in} + r_{out}\) be the total number of uncensored observations, where \(r_{in}\) and \(r_{out}\) are the total number of uncensored observations inside and outside the zones, respectively. These are defined as \(r_{in} = \sum\nolimits_{i \in Z} {\delta_{i} }\) and \(r_{out} = \sum\nolimits_{{i \in Z^{c} }} {\delta_{i} }\).
Weibull distribution
Bhatt and Tiwari established the SSS based on the Weibull distribution. The Weibull model is a nice generalization of the exponential model that includes a shape parameter with the existing scale parameter [10]. The additional parameter provides the opportunity to the Weibull hazard function to take different shapes rather than to be a constant. We provide a brief summary of the methodology, complete details can be found in the paper presented by Bhatt and Tiwari [10]. Let the times to event \(T_{i} 's,\,(i = 1,\ldots,N)\) be i.i.d. with the Weibull probability density function (PDF) \(f\left( {T_{i} } \right) = \frac{1}{\theta }pT_{i}^{{\left( {p  1} \right)}} e^{{\left( {{{  T_{i}^{p} } \mathord{\left/ {\vphantom {{  T_{i}^{p} } \theta }} \right. \kern0pt} \theta }} \right)}}\), where \(\theta\) and \(p\) are the scale and shape parameters, respectively. Let the time to event for each individual inside the zone be distributed as the Weibull distribution with \(\theta_{in}\) and \(p_{in}\) as the scale and shape parameters, respectively. Similarly, assume that the times to event for individuals outside the zone are Weibull distributed with \(\theta_{out}\) and \(p_{out}\) as the scale and shape parameters, respectively. The null hypothesis under consideration is \(H_{0} :\theta_{in} = \theta_{out}\) versus the alternative hypotheses \(H_{1} :\theta_{in} < \theta_{out}\), \(H_{1} :\theta_{in} > \theta_{out}\), or \(H_{1} :\theta_{in} \ne \theta_{out}\). The alternative hypotheses show that at least one zone is detected with either shorter than expected, longer than expected, or simultaneously both longer and shorter than expected times to events. The likelihood ratio test statistic for the Weibull SSS for \(H_{1} :\theta_{in} \ne \theta_{out}\) is
For \(H_{1} :\theta_{in} < \theta_{out}\), \(\lambda\) is multiplied by \(I\left( {\frac{{r_{in} }}{{\sum\nolimits_{i \in Z} {t_{i}^{p} } }} < \frac{{r_{out} }}{{\sum\nolimits_{{i \in Z^{c} }} {t_{i}^{p} } }}} \right)\) , and similarly for \(H_{1} :\theta_{in} > \theta_{out}\), it is multiplied by \(I\left( {\frac{{r_{in} }}{{\sum\nolimits_{i \in Z} {t_{i}^{p} } }} > \frac{{r_{out} }}{{\sum\nolimits_{{i \in Z^{c} }} {t_{i}^{p} } }}} \right)\).
LogWeibull distribution
The logWeibull distribution is a specialized case of the generalized extreme value distribution. It is often used to model the distribution of extreme values, strength, event history data such as quick wearout after reaching a certain age, and logarithms of times [17]. We assume that times to event \(T_{i} 's,\,(i = 1,\ldots,N)\) are independently and identically distributed (i.i.d.) with the logWeibull PDF \(f\left( {T_{i} } \right) = \frac{1}{b}\exp \left( {\frac{{T_{i}  a}}{b}} \right)\exp \left\{ {  \exp \left( {\frac{{T_{i}  a}}{b}} \right)} \right\}\) , where \(a\) and \(b\) are the location and scale parameters, respectively. The survival function for the logWeibull distribution is \(S\left( {T_{i} } \right) = \exp \left\{ {  \exp \left( {\frac{{T_{i}  a}}{b}} \right)} \right\}.\)
Let the time to event for each individual inside zone Z be logWeibull distributed with \(a_{in}\) and \(b_{in}\) as the location and scale parameters, respectively. Similarly, the time to event for each individual outside zone \(Z\)(i.e., inside \(Z^{c}\)) follows the logWeibull distribution with \(a_{out}\) and \(b_{out}\) as the location and scale parameters, respectively. The null hypothesis \(H_{0} :b_{in} = b_{out}\) for any \(Z\) is contrasted with one of three alternative hypotheses: \(H_{1} :b_{in} < b_{out}\), \(H_{1} :b_{in} > b_{out}\), or \(H_{1} :b_{in} \ne b_{out}\). The likelihood function \(L\left( Z \right) = L\left( {Z,b_{in} ,b_{out} } \right)\) for the logWeibull SSS can be written as:
Taking the natural log on both sides, we have
For \(H_{1} :b_{in} \ne b_{out}\), for at least one zone \(Z\), the corresponding likelihood ratio statistic is
where \(\hat{Z}\) is the zone maximizing \(L\left( {Z,b_{in} ,b_{out} } \right)\) under \(H_{1}\), and \(\hat{L}\) is the maximum of \(L\left( {Z,b_{in} ,b_{out} } \right)\) under \(H_{0}\). The maximum likelihood estimators (MLE’s) of the parameters \(b_{in} ,\,b_{out} ,\,a_{in} ,\) and \(a_{out}\) for any arbitrary zone \(Z\) can be obtained by the following equations,
Thus the MLE’s of the scale parameters \(b_{in}\) and \(b_{out}\) are
\(\hat{b}_{in} = \frac{1}{{r_{in} }}\sum\limits_{i \in Z} {\left( {t_{i}  \hat{a}_{in} } \right)} \left[ {e^{{\left( {\frac{{t_{i}  \hat{a}_{in} }}{{\hat{b}_{in} }}} \right)}}  \delta_{i} } \right]\) and \(\hat{b}_{out} = \frac{1}{{r_{out} }}\sum\limits_{{i \in Z^{c} }} {\left( {t_{i}  \hat{a}_{out} } \right)} \left[ {e^{{\left( {\frac{{t_{i}  \hat{a}_{out} }}{{\hat{b}_{out} }}} \right)}}  \delta_{i} } \right]\), respectively.
Similarly, the MLE’s of the location parameters \(a_{in}\) and \(a_{out}\) are obtained by the equations \(r_{in} = \sum\limits_{i \in Z} {e^{{\left( {\frac{{t_{i}  \hat{a}_{in} }}{{\hat{b}_{in} }}} \right)}} }\) and \(r_{out} = \sum\limits_{{i \in Z^{c} }} {e^{{\left( {\frac{{t_{i}  \hat{a}_{out} }}{{\hat{b}_{out} }}} \right)}} }\), respectively.
Under \(H_{1} :b_{in} \ne b_{out}\), the obtained MLE’s provide
Similarly, under \(H_{0} :b_{in} = b_{out}\),
So, the likelihood ratio statistic for \(H_{1} :b_{in} \ne b_{out}\) is
In order to address the alternative hypotheses \(b_{in} < b_{out}\) and \(b_{in} > b_{out}\), the function \(\lambda\) is multiplied by \(I\left( {\hat{b}_{in} < \hat{b}_{out} } \right)\) and \(I\left( {\hat{b}_{in} > \hat{b}_{out} } \right)\), respectively.
Permutation test procedure
Since there is no closed analytical form of the distribution of the test statistic \(\lambda\), a permutation test procedure is used to test the statistical inference of the selected clusters. The exact distribution of the time to events is unknown and it is not possible to generate the simulated data under the null hypothesis. To overcome this situation, the observed pairs \(\left\{ {\left( {t_{i} ,\delta_{i} } \right),i = 1,2, \ldots ,N} \right\}\) are permuted 999 times among the individual geographical coordinates of the original study region [9]. For each permuted dataset, the loglikelihood is calculated for each zone and the most likely cluster preserving the maximum loglikelihood in the dataset is saved. A p value is calculated as the fraction of permutations that are at least as extreme as the test statistic from the observed time to event data [18]. This permutation step ensures that no matter how the observed time to event data are distributed, this distribution is preserved for each permuted dataset. This factor provides valid statistical inference since all the permuted datasets are equally distributed [9]. Secondary clusters are the significant spatial clusters that do not overlap with the primary cluster [9]. These clusters are ranked with their corresponding LLR values and the associated p values are calculated by comparing the kth (say) highest likelihood in the real dataset with the maximum likelihood in the randomly permuted datasets [9]. Note that the use of a permutation test procedure means that there will be variation in the exact p values for successive analyses of the same datasets.
Results
Emergency data application
We illustrate the logWeibull SSS on population based administrative data (age ≥ 35) for patients discharged from the emergency department (ED) who presented with atrial fibrillation and flutter (AFF) in the province of Alberta during April 1, 2010, to March 31, 2011. In 2003, the province of Alberta was divided into nine administrative health areas also called Regional Health Authorities (RHAs) [19]. These RHA’s were further partitioned into 70 subRegional Health Authorities (sRHAs) (Fig. 1, numbered 1–70). The sRHAs have diverse population sizes ranging from 550 to 140,211 with a median population size of 46,075 in 2011 and are the smallest geographical units available for analysis. For each sRHA’s centroid based on population, the latitude and longitude of the centroids are provided by Alberta Health [19]. Distances between the pairs of sRHA populationbased centroids are ordered and used to create the nearest neighbours.
The key outcome of interest is the time from ED discharge for AFF to the 1st specialist visit during 365 days of the study period. The specialist in this study is considered as a cardiology (CARD) or internal medicine (INMD). A specialist followup visit can occur between ED end time, to the end of the study. Each discharged ED presentation during April 1, 2010, to March 31, 2011, with a followup visit to the specialist during its ED end time, to March 31, 2011 is considered a complete time to event outcome. If the patient did not have specialist visit by the end of the study (March 31, 2011), the outcome is referred to as right (typeI) censored. Each Alberta resident making at least one discharged ED presentation for AFF during the fiscal year is referred to as a case (patient).
The methodology used in this study does not adjust for repeated ED presentations of cases. Hence, independent patient data is considered by taking only the last ED visit out of the multiple visits. The calculations are performed using the R and SPlus [20, 21]. Each cluster can contain only a maximum of \(r* = 10\%\) of the study population. The variable scanning windows are created for each sRHA to absorb neighbours up to 10% of the total population. This upper bound is chosen based on the feasibility of analysis and time restrictions. There are about 1.95 M adults in the study population, among them the discharged subset is comprised of 3039 cases (30% censored, 54% male) with an average age of 68.04 years. The median time to event for the whole dataset is 81 days and the corresponding 95% confidence interval (CI) is 76–86 days.
The identified primary and secondary clusters are shown in Table 1 and Fig. 1. The most likely cluster with significantly longer times to events is mainly from R7R9 RHAs. This cluster is identified with 260 observed number of cases. The LLR is 710.75 with the associated p value (P) of 0.001. This SSS provides two different statistically significant secondary clusters. The first one is a part of R6 and the second cluster is a combination of sRHAs from R1 and R3. Median times to event are 177, 51, and 104 days for inside the primary, secondary (1), and secondary (2) detected clusters, respectively. The corresponding 95% CI’s are 128–223, 38–75, and 77–150 days. For the entire province, collectively excluding the primary and both secondary clusters, the median event time is 78 days and the 95% CI is (71, 84) days. Figure 2 shows the Kaplan–Meier curves for the detected primary and secondary clusters and the rest of the province. The SSS based on the Weibull distribution has also been applied to the same Alberta Health data, and is capable of detecting the same primary cluster as of the logWeibull distribution i.e., from R7R9 RHA’s, with no significant secondary cluster.
Simulation studies
Simulation studies are conducted to investigate the power of detecting a potential cluster and the effect of right differential censoring on cluster detection. All of the datasets are analyzed with the logWeibull and Weibull SSS’s. Time to event data are randomly generated for 500 individuals with five different probability models: the exponential, Weibull, logNormal, gamma, and logWeibull. The Alberta geography is used as the geography for analysis and the Alberta population is used to create the zones for the simulation studies. Like the spatial scan analysis of the real administrative data, an upper bound of 10% is imposed on the population size.
For all simulated datasets, a true cluster of 25 individuals is created at a subregion of R201 sRHA, to have longer time to events than the rest of the province. This subregion was chosen because it was rural and away from the detected rural cluster in the real Alberta ED data. R201 was assigned the same percentage of individuals as of the real dataset (i.e., approximately 5% cases in each simulated data). This choice was feasible for simulation studies to run in a reasonable amount of time. Right differential censoring is added with the ratios of 20%:20%, 20%:40%, and 40%:20% for inside:outside the true cluster. For example, 20%:40% means that 20% censoring is used within the true cluster and 40% outside the true cluster.
One thousand simulated datasets are generated from the probability models defined above using the differential censoring settings under the alternative hypotheses of the existence of longer than expected time to event clusters. The choice of 1000 simulations is the same as what was chosen for the development of the Weibull SSS [10] and was computationally timely. For symmetry, parameters for each probability model are chosen in such a way that they provide a constant mean of 2 outside the true cluster and means of 10, 15, and 20 inside the true cluster for each censoring ratio. These values were chosen to be similar to the inside:outside times to event means ratio from real data used in the application.
For each simulated dataset, 999 random permutations are performed to get the p values from the permutation testing procedure. Let, \(Z^*,Z^{(m)} ,\) and \(M\) represent the true cluster, the cluster identified in the mth simulations, and total number of simulations, respectively. Power is calculated as the proportion of datasets out of 1000 having p values < 0.05 [9, 10], not necessarily detecting the true cluster i.e.,
In order to observe the strength of identification of the true cluster by each SSS, three different proportions are calculated for mutually exclusive situations from 1000 randomly generated datasets under each probability model for all censoring situations. These indicators are essentially the same as those reported for the exponential and Weibull based SSS’s [9, 10], and we have adapted slightly to reflect the aggregate nature of the data.
These are the proportion of datasets:

1.
Perfectly identifying the true cluster \(\left( {{\text{PI}} = \frac{1}{M}\sum\limits_{m = 1}^{M} {I_{{\left[ {Z^* = Z^{\left( m \right)} } \right]}} } } \right)\);

2.
Identifying a large cluster including the true cluster \(\left( {{\text{LC}} = \frac{1}{M}\sum\limits_{m = 1}^{M} {I_{{\left[ {Z^* \subset Z^{\left( m \right)} } \right]}} } } \right)\); and,

3.
Not identifying the true cluster \(\left( {{\text{NI}} = \frac{1}{M}\sum\limits_{m = 1}^{M} {I_{{\left[ {Z^*{ \varsubsetneq }Z^{\left( m \right)} } \right]}} } } \right)\).
In addition to the three cluster performance measures listed above, a global indicator for performance assessment has been used [22] based on the coefficient developed by Tanimoto [23, 24]. The Tanimoto coefficient (TC) is computed for each simulated data set and measures the similarity between a simulated and detected cluster by using the ratio of the intersecting cluster cohort to the union cluster cohort. In order to calculate TC, four types of spatial units (SUs) are calculated and defined as:

True Positive (TP) = SUs both within \(Z^{*}\) and \(Z^{(m)}\);

False Positive (FP) = SUs only within \(Z^{(m)}\);

False Negative (FN) = SUs only within \(Z^{*}\); and,

True Negative (TN) = SUs not within either cluster.
The TC computed for each simulated data set is \(TC = \frac{TP}{TP + FP + FN}\). The geographical region used in this simulation study is divided into 70 SUs. When no significant cluster is detected i.e., p value is higher than 0.05, we get TP = 0, FP = 0, TN = 69, and FN = 1.
The average Tanimoto coefficient (TC_{a}) and the cumulated Tanimoto coefficient (TC_{c}) were used as the statistics of TC. These are defined as \(TC_{a} = \frac{1}{M}\sum\limits_{m = 1}^{M} {\frac{{TP_{m} }}{{\left( {TP_{m} + FP_{m} + FN_{m} } \right)}}}\) and \(TC_{c} = \frac{{\sum\limits_{m = 1}^{M} {TP_{m} } }}{{\sum\limits_{m = 1}^{M} {\left( {TP_{m} + FP_{m} + FN_{m} } \right)} }}\). Global performance is assessed using TC_{a} and TC_{c} by taking both location accuracy and power into account at the same time. Guttmann et al. have assessed the superiority of TC_{c} over TC_{a} based on their functional properties and variability, and observed that TC_{c} has more power of capturing low accuracy in cluster location [22].
Using the logWeibull SSS (Table 2, Figs. 3 and 4), the results show that the values of power vary from 0.326 to 0.721 for the 20%:20% censoring, from 0.148 to 0.941 for the 20%:40% censoring situation, and range from 0.350 to 0.737 for the 40%:20% censoring case. Overall, the maximum power is seen when the data are generated under the Weibull distribution and the minimum power is observed for the datasets distributed with the gamma and exponential probability models.
The proportions of datasets perfectly identifying the true cluster fluctuate for the logWeibull SSS. They are between 0.000 and 0.310 for the 20%:20% case, range from 0.000 to 0.186 for the 20%:40% censoring ratio, and are between 0.000 and 0.264 for the 40%:20% censoring setting, respectively. Under the large cluster identification cohort for the logWeibull distribution, there are high proportions of the true cluster detected. These proportions range from 0.000 to 1.000 for all three differential censoring situations. Overall, the maximum proportion of perfect identification is achieved for the datasets generated from the logWeibull distribution. The datasets from the exponential distribution have the highest proportions of large cluster identification including the true cluster among all five probability models. A few decreases are found in the power and the strength of identification of the true cluster for each model, when comparing the 20%:20% to the 20%:40% and 40%:20% censoring cases.
For the logWeibull SSS, the values of TC_{a} range from 0.060 to 0.448 for all three censoring situations. The TC_{c} values lie between 0.189 and 0.491 with very less variability among the five probability models used to generate the data.
For the Weibull SSS (Table 3, Figs. 5 and 6), the overall results for the power and all the proportions’ performances of the datasets are less variable than the results of the logWeibull SSS. The power values of detecting a potential cluster are between 0.256 and 0.971 for the 20%:20% censoring setting, range from 0.230 to 0.999 for the 20%:40% censoring ratio, and are between 0.355 and 0.981 for the 40%:20% case. The proportions of perfectly detecting a true cluster are high for all three censoring situations across all of the datasets as compared to the logWeibull distribution, being least for the exponential model. The nonzero proportions of datasets generated under five probability distributions who do not identify the true cluster are between 0.000 and 0.997. The power values increase as the difference between the means of inside and outside the cluster increase and similar effects are seen for the strength of detection of the true cluster.
For the Weibull SSS, the values of TC_{a} and TC_{c} range from 0.090 to 0.478 and 0.226 to 0.489, respectively. This study shows that the Weibull SSS has more similar results for the spatial cluster detection based on power, proportions of cluster detection and global detection test regardless of the probability model used for the data generation, whereas the performance of the logWeibull SSS is best when the datasets are generated from the logWeibull distribution.
Discussion
The spatial scan statistic (SSS) is a widely used statistical technique for the identification of the spatial clusters of different data types by using various probability distributions. In the context of time to event data, the SSS has the ability to detect geographical clusters of cases with either longer and/or shorter than expected event times. These clusters can be adjusted for censoring, if the appropriate probability model is used.
We have proposed the SSS for the logWeibull distribution as a new approach for detecting spatial clusters for time to event data. The logWeibull distribution has wide applications in extreme value theory for modeling extreme and rare events. The new logWeibull method and the Weibull SSS are applied to administrative data from Alberta Health consisting of time from ED discharge for an AFF presentation to 1st specialist visit within 365 days in Alberta during 2010–2011. Results from the SSS show that the primary cluster is detected at the Peace Country, Northern Lights, and Aspen regional Health Authorities. The most likely cluster is comprised of rural areas in northern Alberta which have sparse or low population and have further distances to major metropolitan centres. The results suggest that people living in these northern rural areas may not have regular or quick access to the followup care to a specialist after an ED presentation. Our results are in agreement with the recognized issue of health care access for rural residents and strategies such as mobile services, telehealth, and rotating specialists have been suggested and/or implemented [25]. While we recognize that the censoring might be quite early for the patients with an ED visit in late 2011 and the methods may be effected by short followup, the effects would be across all areas of the province and we feel that the results are likely linked to real clustering and are plausible given the recognized issue of health care access.
The simulation studies indicate that the power of detecting the potential cluster is higher for the 20%:20% censoring ratio as compared to the 20%:40% and 40%:20% settings. This comparison is also true in the context of identification of a true cluster. When either the Weibull or logWeibull distributions is used for the SSS, the effect of the right differential censoring on power and detection of the true cluster is similar. For both of the probability models used under the SSS’s, as the difference between means of time to event data increase inside and outside the true cluster, the power and proportion of detection of the true cluster also increase. It can be observed from the overall results of both SSS’s that the Weibull SSS has good power for detecting a potential cluster for the datasets distributed with any of the five probability models used in this study. However, overall the logWeibull SSS’s performance is satisfactory for the data distributed as the logWeibull. For the identification of the true cluster, the Weibull SSS shows less variability on the simulated datasets than the logWeibull SSS. The logWeibull SSS shows the most power to detect a true cluster for the datasets generated from the logWeibull distribution. When various differential censoring situations are considered, the global performance indicators for the logWeibull SSS do not vary widely. Conversely, when there was less censoring inside the cluster than outside the cluster, the logWeibull SSS had highly variable performance that depended on the underlying data distribution.
The results based on the global indicator for performance assessment also support the above conclusions, identifying that the Weibull SSS detects the true cluster with more power and location accuracy both at the same time, whereas the logWeibull SSS shows high significant cluster detection accuracy for the datasets generated from logWeibull probability distribution. It is also observed that the logWeibull distribution has a good ability to detect a broader cluster including the true cluster instead of identifying exact true cluster. It is suggested that the logWeibull SSS can be used to detect a spatial cluster for the time to event data distributed as logWeibull. Based on the simulation study results for both SSSs, the logWeibull SSS proved to be less effective than the Weibull SSS when the dataset is generated from the exponential distribution. When the underlying data distribution is not exponential, the logWeibull SSS has slightly reduced performance than the Weibull SSS; however, the logWeibull SSS had similar performance across different underlying data distributions, especially when the censoring ratio is higher inside the true cluster than outside the true cluster.
There are many opportunities for future work. For example, the proposed methodology based on the SSS for the logWeibull distribution does not adjust for important factors such as age and gender. In future, such covariates can be adjusted in the analysis of the identification of potential clusters for time to event data. Furthermore, the new developed method can only be performed on a purely spatial setting. The space–time scan statistic has been developed by other authors in both retrospective [15] and prospective [16] ways. In the future, the SSS based on the logWeibull distribution can be extended to the space–time setting, and similar simulation studies can be performed to investigate power of detection of space–time clusters.
Conclusions
We have proposed a new SSS using the logWeibull distribution. The new method has been applied to specialist followup data in Alberta, and the SSS’s have been compared and contrasted for time to event data generated from simulations. The simulation studies suggest that the SSS based on the logWeibull distribution performs well for logWeibull data. The logWeibull distribution, being a specialized case of the generalized extreme value distribution, has a wide application in extreme value theory for modeling extreme and rare events.
Abbreviations
 AFF:

atrial fibrillation and flutter
 CARD:

cardiology
 CI:

confidence interval
 ED:

emergency department
 INMD:

internal medicine
 LLR:

log likelihood ratio
 MLE’s:

maximum likelihood estimators
 P:

p value
 PDF:

probability density function
 RHAs:

Regional Health Authorities
 SSS:

spatial scan statistics
 sRHAs:

subRegional Health Authorities
 PI:

perfect identification
 LC:

large cluster identification
 NI:

no identification
 TP:

true positive
 FP:

false positive
 FN:

false negative
 TN:

true negative
 TC_{a} :

average Tanimoto coefficient
 TC_{c} :

cumulated Tanimoto coefficient
References
Stevenson JR, Emrich CT, Mitchell JT, Cutter SL. Using building permits to monitor disaster recovery: a spatiotemporal case study of coastal Mississippi following Hurricane Katrina. Cartogr Geogr Inf Sci. 2010;37(1):57–68.
Coulston JW, Riitters KH. Geographic analysis of forest health indicators using spatial scan statistics. Environ Manag. 2003;31:764–73.
Marcos RDLF, Marcos CDLF. From star complexes to the field: open cluster families. Astrophys J. 2008;672:342–51.
Usher BM, Allen KL. Identifying kinship clusters: SatScan for genetic spatial analysis. Am J Phys Anthropol. 2005;126(Suppl 40):210–1.
Margai F, Henry N. A communitybased assessment of learning disabilities using environmental and contextual risk factors. Soc Sci Med. 2003;56:1073–85.
Kulldorff M, Nagarwalla N. Spatial disease clustersdetection and inference. Stat Med. 1995;14:799–810.
Costa MA, Assunção RM. A fair comparison between the spatial scan and the Besag–Newell disease clustering tests. Environ Ecol Stat. 2005;12:301–19.
Kulldorff M. A spatial scan statistic. Commun StatTheory Methods. 1997;26:1481–96.
Huang L, Kulldorff M, Gregorio D. A spatial scan statistic for survival data. Biometrics. 2007;63:109–18.
Bhatt V, Tiwari N. A spatial scan statistic for the survival data based on Weibull distribution. Stat Med. 2013;33:1867–76.
Jung I, Kulldorff M, Klassen A. A spatial scan statistic for ordinal data. Stat Med. 2007;26:1594–607.
Jung I, Kulldorff M, Richard OJ. A spatial scan statistic for multinomial data. Stat Med. 2010;29:1910–8.
Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. Int J Health Geogr. 2009;8:58.
Rosychuk RJ, Chang HM. A spatial scan statistic for compound Poisson data. Stat Med. 2013;32:5106–18.
Kulldorff M, Athas W, Feuer E, Miller B, Key C. Evaluating cluster alarms: a space–time scan statistic and brain cancer in Los Alamos. Am J Public Health. 1998;88:1377–80.
Kulldorff M. Prospective time periodic geographical disease surveillance using a scan statistic. J R Stat Soc. 2001;A164:61–72.
Reliablity HotWire: The emagazine for the reliability professional. 2005. http://www.weibull.com/hotwire/issue56/relbasics56.htm. Accessed 16 Sept 2015.
Knijnenburg TA, Wessels LFA, Reinders MJT, Shmulevich I. Fewer permutations, more accurate p values. Bioinformatics. 2009;25:i161–8.
Ellehoj E, Schopflocher D. Calculating small areas analysis: Definition of subregional geographic units in Alberta. Edmonton: Alberta Health and Wellness; 2003.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. http://www.Rproject.org/.
TIBCO Software Inc. SPLUS 8 Version 8.1.1. 2008.
Guttmann A, Li X, Feschet F, Gaudart J, Demongeot J, Boire J, Ouchchane L. Cluster detection tests in spatial epidemiology: a global indicator for performance assessment. PLoS ONE. 2015;10(6):e0130594.
Tanimoto TT. IBM internal report. IBM: Technical Report; 1957.
Rogers DJ, Tanimoto TT. A computer program for classifying plants. Science. 1960;132:1115–8.
Rural health services: Review final report. Understanding the concerns and challenges of Albertans who live in rural and remote communities. 2015. https://open.alberta.ca/dataset/18615231d9c247c783d206f24c099742/resource/df60d2407b024f428e626364b2ad4ba4/download/2015RuralHealthServicesReview.pdf March 2015.
Authors’ contributions
Both authors have contributed in the conception and design of the study, analysis, and interpretation of data. RR obtained the funding and directed the study. IU wrote the 1st draft and RR revised it critically for important intellectual content. Both authors read and approved the final manuscript.
Acknowledgements
Authors thank Alberta Health for providing the data. Disclaimer: This study is based in part on data provided by Alberta Health. The interpretation and conclusions contained herein are those of the researchers and do not necessarily represent the views of the Government of Alberta. Neither the Government nor Alberta Health express any opinion in relation to this study.
Competing interests
The authors declare that they have no competing interests
Availability of data and materials
Data is the property of Alberta Health and the authors are not allowed to provide the data. Requests can be made for the same data from Alberta Health for researchers who meet the criteria for access to confidential data. Researchers are welcome to inquire for further information at health.resdata@gov.ab.ca or visit http://www.health.alberta.ca/initiatives/healthresearch.html.
Consent for publication
Not applicable.
Ethics approval and consent to participate
The University of Alberta health research ethics board approved this study. Individual consent was not required.
Funding
This study is funded by a Discovery Grant held by Professor Rosychuk from the Natural Sciences and Engineering Council of Canada (NSERC; Ottawa, Canada). Sponsor had no role in the study design, analysis and interpretation of data, writing of the report, and in the decision to submit the article for publication.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Usman, I., Rosychuk, R.J. A logWeibull spatial scan statistic for time to event data. Int J Health Geogr 17, 20 (2018). https://doi.org/10.1186/s1294201801379
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1294201801379