A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring

Background Early detection of disease outbreaks enables public health officials to implement disease control and prevention measures at the earliest possible time. A time periodic geographical disease surveillance system based on a cylindrical space-time scan statistic has been used extensively for disease surveillance along with the SaTScan software. In the purely spatial setting, many different methods have been proposed to detect spatial disease clusters. In particular, some spatial scan statistics are aimed at detecting irregularly shaped clusters which may not be detected by the circular spatial scan statistic. Results Based on the flexible purely spatial scan statistic, we propose a flexibly shaped space-time scan statistic for early detection of disease outbreaks. The performance of the proposed space-time scan statistic is compared with that of the cylindrical scan statistic using benchmark data. In order to compare their performances, we have developed a space-time power distribution by extending the purely spatial bivariate power distribution. Daily syndromic surveillance data in Massachusetts, USA, are used to illustrate the proposed test statistic. Conclusion The flexible space-time scan statistic is well suited for detecting and monitoring disease outbreaks in irregularly shaped areas.


Background
The anthrax terrorist attacks in 2001, the severe acute respiratory syndrome (SARS) outbreak in 2002, and a concern about pandemic influenza have motivated many public health departments to develop early disease outbreak detection systems. Early detection of disease outbreaks enables public health officials to implement disease control and prevention measures at the earliest possible time. For an infectious disease, improvement in detection time by even one day might enable public health officials to control the disease before it becomes widespread. In many cities such as New York City [1], Washington, D.C. [2], Boston [3,4], Denver, and Minneapolis, real-time, geographic, early outbreak detection system have been implemented. For a well-defined geographical area, standard disease surveillance uses purely temporal methods that seek anomalies in time series data without using spatial information [5]. The increased need for geographical cluster detection has coincided with an increasing availability of spatial data [6]. Investigators ask whether the geographical cluster is unlikely to have arisen by chance given random variations from the background incidence, according for the multiple comparisons inherent in the many possible cluster locations and size evaluated. Scan statistics are tools to answer such questions [7,8]. Increasingly, there is interest in the prospective surveillance of new data as it becomes available in order to detect a localized disease outbreak as early as possible. Particularly in light of the perceived threat of bioterrorism and newly emerging infectious diseases, there has been a spate of recent interest in the development of geographic surveillance systems that can detect changes in spatial patterns of disease [9]. Recently, a time periodic geographical disease surveillance system based on a cylindrical space-time scan statistic was proposed by Kulldorff and colleagues [10,11].
Several different approaches to the statistical assessment of potential geographic clustering in either point-or areabased disease data have been developed [12,13]. Almost all of these purely spatial approaches are retrospective, in the sense that they describe statistical tests that are designed to be carried out once, on a set of data that has been collected from the recent past [9]. In particular, the circular spatial scan statistic [8] has been used extensively for the detections and evaluation of purely spatial disease clusters along with the SaTScan software [14]. For example, as part of their cancer surveillance initiative, the New York State Department of Health used the spatial scan statistic to look at the geographical variation of breast, lung, prostate, and colorectal cancer incidence in New York State, finding various statistically significant clusters but no local hotspots with greatly elevated risk [15]. However, as the statistic uses a circular scanning window with variable size to define the potential cluster area, it is difficult to correctly detect some non-circular clusters such as those along a river [16]. Recently, spatial scan statistics for irregular shaped clusters have been proposed, using the same likelihood ratio test formulation as before. The spatial scan statistics proposed by Duczmal and Assunção [17], Patil and Taillie [18], Tango and Takahashi [16], Assunção et al. [19] and Kulldorff et al. [20] are aimed at detecting irregularly shaped clusters which may not be detected by the circular spatial scan statistic. Due to the unlimited geometric freedom of cluster shapes, some of these statistics run the risk of detecting quite large and very peculiarly shaped clusters. The flexible spatial scan statistic [16], which has been used along with the FleXScan software [21], has a parameter K as the pre-set maximum length of neighbors to be scanned, to avoid detecting clusters with a very peculiar shape.
In this paper, we propose a flexibly shaped space-time scan statistic ("flexible space-time scan statistic" hereafter) for the early detection of disease outbreaks. It is based on the flexible purely spatial scan statistic [16] and the prospective space-time scan statistic [10]. The performance of our proposed space-time scan statistic is compared with that of the cylindrical scan statistic, using the benchmark data provided by Kulldorff et al. [22]. In order to evaluate its performance we propose a space-time power distribution by extending the purely spatial bivariate power distribution [16]. Daily syndromic surveillance data in Massachusetts, USA, are used to illustrate the proposed method with real data.

The flexible space-time scan statistic
Consider the situation where an entire study area is divided into m regions (for example, counties, ZIP codes, enumeration districts, etcetera), and each region is periodically reporting the number of cases of a disease or syndrome under study. We assume that, under the null hypothesis of no clustering, the number of cases N id is a Poisson random variable with the observed value n id and the expected values μ id in each region i(i = 1,...,m) at time d, where μ id is proportional to its population size, or a covariate-adjusted population at risk. Since we are only interested in detecting clusters that are alive (active) at the current time t P , we only consider 'alive' clusters that are present in the following T time intervals: where T is a pre-specified maximum temporal length of the cluster.
A time periodic geographical disease surveillance system based on a cylindrical space-time scan statistic has already been proposed by Kulldorff [10]. The cylindrical spacetime scan statistic uses a cylindrical window in three dimensions where the base of the cylinder represents space and the height represents time. As with the purely spatial scan statistic, the cylindrical space-time scan statistic imposes a circular base Z on each centroid of regions for each of T time intervals. For each of centroids, the radius of the circle is varied from zero up to a pre-set maximum radius, for example, so that the window never includes more than 50% of the total population at risk [8]. In this paper, we use a pre-set maximum number of regions K to be included in the cluster as an upperbound of the radius. If the base contains the centroid of a region, then that whole region is included in the base. In total, a very large number of different but overlapping circular bases are created, each with a different set of neighboring regions and each being a possible candidate area containing a disease outbreak. Let Z ik , k = 1,...,K, denote the base composed by the region i and the (k -1)-nearest neighbors to i. Then, all the cylindrical windows to be scanned by the cylindrical scan statistic are the cylinders with the base in the set which we propose in this paper imposes a three dimensional prismatic window with an arbitrarily shaped base Z. For any given region i, we create the set of arbitrarily shaped bases consisting of k connected regions (1 ≤ k ≤ K) including i. To avoid detecting a cluster of unlikely peculiar shape, the connected regions are restricted as the subset of the K-nearest neighbors to the region i, where K = 1 implies the region i itself. Let Z ik(j) , j = 1,...,j ik denote the jth window which is a set of k regions connected starting from the region i, where j ik is the number of j satisfying Z ik(j) ⊆ Z iK for k = 1,...,K. Then, all the windows to be scanned are the prisms whose base is included in the set with height in the set . In other words, for any given region i, the cylindrical scan statistic consider K concentric circles for the base, whereas the flexible scan statistic consider K concentric circles plus all the sets of connected regions including the single region i, whose centroids are located within the K-th largest concentric circle. where LLR v and LLR* is the value of the test statistic for the v-th Monte Carlo replicate and that for the observed data, respectively, and I(·) is the indicator function.

Syndromic surveillance in Massachusetts
We applied the prospective flexible space-time scan statistic to daily syndromic surveillance data in eastern Massachusetts mimicking a real time surveillance system. The data came from an electronic medical record system used by Harvard Vanguard Medical Associates [3,24]. We used the rash and respiratory data during August 1-30, 2005. The data are geographically aggregated to ZIP codes. The number of ZIP codes used were different for each syndrome, for example cases of the rash were analyzed in 252 ZIP codes and respiratory in 385. Note that for the flexible space-time scan statistic, the ZIP code whose data does not exist, was treated like a ravine. For example, assume that ZIP codes i 1 and i 2 , i 2 and i 3 are adjacent each other, respectively, but i 1 and i 3 are not adjacent. If the data of i 2 does not exist under the situation, then it is assumed that i 1 and i 3 are not directly connected.
Based on the prior daily data for over a year in MA, the expected number of cases were calculated as the predicted means from a generalized linear mixed model (GLMM) as developed by Kleinman et al, adjusted for seasonal effect, day of week, etc, these are the same expectations used in the actual real time surveillance system [25]. We set K = 20 as the maximum length of the geographical window, and the maximum temporal length to be T = 7 days. The number of replications for the Monte Carlo procedure was set to B = 999. In disease outbreak detection, the recurrence interval (RI) is often used as an alternative to the pvalue [14]. The measure reflects how often a cluster will be observed by chance, assuming that analyzes are repeated on a regular basis with a periodicity equal to the period of the study. For daily surveillance such as this analysis, the p-value of 0.001 corresponds to the RI of 1,000 days, i.e., 2.7 years, and an alpha level of 0.0027 corresponds to one expected false alarm every year.
The results of analysis during August 1-30 by the flexible and the cylindrical space-time scan statistics are given in Tables 1, 2 and Figure 1. The tables show results for the days with p < 0.0054, which corresponds to the RI of at least 6 months. When looking at rash outbreaks (Table 1), both tests detected the same cluster with a single ZIP code 01951 on August 7, with the same temporal length (6 days) and the same RI (2.7 years). Note that the clusters detected by both tests from August 8 to 10 are not signals of an outbreak because the number of cases on August 8 must be 0, and on August 9 and 10, the number of cases of the cluster was decreasing. For respiratory syndrome (  Table 2 because of shorter RIs), the cylindrical scan statistic kept detecting the same cluster, while the flexible scan statistic detected a similar but slightly different cluster each day. However, we should acknowledge the similar lack of evidence in Table 2 for a continued outbreak on August 13 to 14, because the number of additional cases on those days is very close to the expected number of additional cases. On the other hand, there is some evidence for an excess of cases on August 15 (23 additional cases), although the estimated relative risk is substantially reduced.

Statistical power, sensitivity and positive predictive value
In this section, we compare the flexible and cylindrical space-time scan statistics, using benchmark data from 176 New York City ZIP codes ( [14,22]). This benchmark data has been described in detail elsewhere [22], and here we only give a brief overview. Based on 2002 numbers, the total population is 8,003,510. The benchmark data sets contain a number of randomly located of cases of a hypothetical disease or syndrome, generated either under the null model with no outbreaks or under one of eight differ-ent alternative models with an outbreak in one of four different locations and with either a high or modest excess risk. For each of the null and alternative models, three different sets of data sets were generated, with 31, 32, and 33 days, respectively. For each of the null models, 9,999 random data sets were generated. For each of the alternative models, 1,000 random data sets were generated.
For each data set, the total number of randomly allocated cases was 100 times the number of days (i.e., 3,100 cases  as the maximum temporal length of the cluster. We did not use the options to include purely temporal clusters (see details in [14]).

Standard statistical power
First of all, we estimated the standard statistical power, which is the probability that the null hypothesis is rejected at the α = 0.05 significance level, without considering the overlap between the detected and real clusters. The random data sets generated under the null model were used to get the critical values of the scan statistics. For α = 0.05, this is defined as the 500th highest log likelihood ratio when raning those value from all the 9,999 simulated data sets. The estimated power was then calculated is the proportion of the 1,000 random data sets that had a higher log likelihood ratio than the critical value obtained from the null data sets. The results are shown in Table 3. In general, the cylindrical space-time scan statistic has higher power for the three more compact clusters, while the flexible space-time scan statistic have higher power for the long and narrow the Hudson River cluster. On Day 33 of the high excess risk outbreaks, both methods have very high power.

Space-time power distribution
In order to compare the performance of the cluster detection tests, the standard power has been derived in the same manner as for usual hypothesis tests. However, it should be noted that standard statistical power reflect the 'power to reject the null hypothesis for whatever reasons,' while the probability of both rejecting the null hypothesis and accurately identifying the true cluster is a different matter altogether.
In order to compare the performance of purely spatial cluster detection tests, Tango and Takahashi [16] proposed a spatial bivariate power distribution P 0 (l, s | s*) based on Monte Carlo simulation where l is the length of the significant MLC, while s is the number of regions identified out of the true cluster with s* regions. where U denotes the random variable of t and 1 ≤ t ≤ T.
The complexity of the three-dimensional tri-variate power distributions suggests that we need some summary measure. Since the temporal accuracy is very similar, we focus on the geographical accuracy. We will compute the extended power of spatial cluster detection tests, as developed by Takahashi and Tango [26]. We will also define and compute geographical sensitivity and false positive rates.

The extended power
We can consider two types of spatial misclassifications when applying the cluster detection test (CDT). One is a false negative test result (FN)         The extended power is based on the bivariate distribution P 0 (l, s | s*) and penalties introduced for the FPs and FNs of the geographical detection as where W(l, s; w -, w + ) is a weight function such that and wand w + are the predefined penalties for the FNs and FPs (per region), respectively. This power includes the following three special powers:

(l, s, t | s*, t*) for the Rockaways (s* = 5) on Day 33 (t* = 3) with high risk (RR = 8. 48), where t is a temporal length of detected cluster, and the raw all cells of which have zero powers of both tests is not shown. The mark "*" is
1. The standard power as I(0, 0).
2. The power to detect the geographical true cluster accurately as I(1, 1).

The power for which the MLC includes all the regions
within the true cluster as I(1, 0).
Takahashi and Tango [26] also proposed the profile of the extended power as Q(r | s*) = I(1/s*, r/s*), (0 ≤ r ≤ 1) where r = w + /wwith w -= 1/s*, because it is difficult to set the value of wand w + in advance. Figure 3 shows the plots of the profile Q(r | s*) against r (0 ≤ r ≤ 1) for flexible and cylindrical scan statistics applied to (a) the cluster A5 and (b) the Rockaways, both on Day 33 with high risk, based upon Tables 5 and 6. Based upon the population, we can define the following sensitivity TP 2 and positive predictive value PP 2 : All these summary measures are better the larger they are with 100 being the optimal. Table 7 shows the sensitivity and PPV of the flexible and cylindrical space-time scan statistics for each cluster with a high relative risk. For cluster A, the cylindrical scan statistic has higher PPV and higher sensitivity than the flexible one. For cluster A5 and the cylindrical has higher PPV on all days and higher sensitivity on day 31, but the flexible scan statistic has higher sensitivity on days 32 and 33. The same is true for the Rockaway cluster. For the Hudson River cluster, the flexible scan statistic has higher PPV than the cylindrical. The flexible scan has higher sensitivity than for the cylindrical with the same upper constant K = 20 on the number of regions in the detected cluster, but lower sensitivity compared to the cylindrical scan with a 50% upper limit on the cluster size. Note though, that this difference in sensitivity is less than the difference in PPV that goes the other way.

Conclusion
In this paper, we have proposed a flexible space-time scan statistic to detect arbitrarily shaped disease outbreaks. We have also presented a tri-variate power distribution which is useful for evaluating the performance of cluster detection tests, informing us about the spatial and temporal For the benchmark data evaluated in this paper, the cylindrical scan statistic performs better for the small single zip-code cluster, although by the third day of the outbreak both methods are almost perfect. For the small irregular shaped clusters, A5 and Rockaways, the cylindrical performs better on the first day of the outbreak, but as more data accumulates, the flexible scan statistic has certain advantages in determining the precise size and shape of the outbreak. For the large and narrow Hudson River cluster, the flexible scan statistic performs better than the cylindrical one, with slightly higher standard power, much higher PPV and slightly higher or lower sensitivity depending on the type of cylindrical method used. Results may be different for other types of regular and irregularly shaped disease outbreaks, but the four examples used in this paper gives some sense of the proposed methods performance.
For early detection, timeliness is much more important than geographical accuracy. When monitoring an occurring outbreak, on the other hand, geographical accuracy becomes critical and is then the key objective since we already know the outbreak is there. Our results suggest that we may use both the cylindrical and flexible scan statistic for disease outbreak detection, but for different purposes. Specifically, for detecting new outbreak that, one may want to use the cylindrical scan statistic. That is especially if we expect the outbreak to start locally, within a reasonably small and compact area containing only a few ZIP-codes. On the other hand, once the outbreak has spread to a larger area, and we want to monitor that spread, one may want to use the flexible scan statistic, with its ability to accuratly determine the precise geographical extent of irregular shaped outbreaks. This is especially true ones the outbreak has left its local area of origin.
To evaluate the performance of space-time scan statistic, we applied the extended power for purely spatial cluster detection test (8), which is defined as the weighted sum of the bivariate power distribution wherein the weight is given by the geometric mean of (1-penalty for the false negatives) and (1-penalty for the false positives), including the standard power as a special case. Also we applied the profile Q(r | s*) proposed by Takahashi and Tango [26]. This plot gave us a detailed description regarding power of cluster detection tests. Needless to say, it is possible to extend it to space-time version if we could consider the penalties for temporal false negatives and false positives, but we leave this problem for future work. Also, for the profile of the extended power, we chose to use a fixed cost of w -= 1/s* for false negatives and a smaller or equal cost for false positives. For more general situations, we could plot the full bivariate extended power function on the unit square.
Similarly to the flexible spatial scan statistic in the purely spatial situation, the flexible space-time scan statistics proposed in this paper has a limitation of cluster size, because of the limitation of the speed of computation. The proposed scan statistic works well for small to moderate sized clusters. Although we set the maximum length of the geographical window to K = 20, this is not large enough to detect the 20 ZIP codes of the Hudson River cluster accurately because this cluster is too long to be the subset of the 20-th nearest neighbors of any region. Computation time depends on the size of the data set and K. Indeed, for the August 11 analysis of respiratory syndrome data in Massachusetts, with 385 ZIP codes, a maximum temporal length of T = 7 days, a maximum spatial size of K = 20, and A limitation of length may also prevent the analysis to present large clusters of unlikely and very peculiar shapes. These undesirable properties produced by maximum likelihood ratio might suggest the use of different criterion for model selection, including some penalized likelihood [20,29]. Also, for larger cluster seizes, the method is not practically feasible and a more efficient algorithm is needed.
In this paper, we considered the right cylinder or right prism of the cluster model, as an expansion of the cylindrical space-time scan statistic for a prospective disease surveillance by Kulldorff [10]. This does not allow the scanning window to adjust itself as the disease outbreak grows or shrinks geographically over time. Recently, Iyengar has suggested using a square pyramid shape window which can model either growth (or shrinkage) and movement of the disease cluster [30]. For the proposed flexible space-time scan statistic, if we could consider the flexibility in both space and time, that is, evaluating all connected subsets within a cylinder instead of in (4), we can detect more arbitrarily shaped clusters in space-time. For such an expansion, an efficient computational algorithm will be needed for the scanning process, as well as a more sophisticated mechanism for the interpretation of such complicatedly shaped clusters. The implementation and importance of such methods for disease surveillance and monitoring, is an issue for future research.