A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring
- Kunihiko Takahashi^{1}Email author,
- Martin Kulldorff^{2},
- Toshiro Tango^{1} and
- Katherine Yih^{2}
https://doi.org/10.1186/1476-072X-7-14
© Takahashi et al; licensee BioMed Central Ltd. 2008
Received: 28 November 2007
Accepted: 11 April 2008
Published: 11 April 2008
Abstract
Background
Early detection of disease outbreaks enables public health officials to implement disease control and prevention measures at the earliest possible time. A time periodic geographical disease surveillance system based on a cylindrical space-time scan statistic has been used extensively for disease surveillance along with the SaTScan software. In the purely spatial setting, many different methods have been proposed to detect spatial disease clusters. In particular, some spatial scan statistics are aimed at detecting irregularly shaped clusters which may not be detected by the circular spatial scan statistic.
Results
Based on the flexible purely spatial scan statistic, we propose a flexibly shaped space-time scan statistic for early detection of disease outbreaks. The performance of the proposed space-time scan statistic is compared with that of the cylindrical scan statistic using benchmark data. In order to compare their performances, we have developed a space-time power distribution by extending the purely spatial bivariate power distribution. Daily syndromic surveillance data in Massachusetts, USA, are used to illustrate the proposed test statistic.
Conclusion
The flexible space-time scan statistic is well suited for detecting and monitoring disease outbreaks in irregularly shaped areas.
Keywords
Background
The anthrax terrorist attacks in 2001, the severe acute respiratory syndrome (SARS) outbreak in 2002, and a concern about pandemic influenza have motivated many public health departments to develop early disease outbreak detection systems. Early detection of disease outbreaks enables public health officials to implement disease control and prevention measures at the earliest possible time. For an infectious disease, improvement in detection time by even one day might enable public health officials to control the disease before it becomes widespread. In many cities such as New York City [1], Washington, D.C. [2], Boston [3, 4], Denver, and Minneapolis, real-time, geographic, early outbreak detection system have been implemented. For a well-defined geographical area, standard disease surveillance uses purely temporal methods that seek anomalies in time series data without using spatial information [5]. The increased need for geographical cluster detection has coincided with an increasing availability of spatial data [6]. Investigators ask whether the geographical cluster is unlikely to have arisen by chance given random variations from the background incidence, according for the multiple comparisons inherent in the many possible cluster locations and size evaluated. Scan statistics are tools to answer such questions [7, 8]. Increasingly, there is interest in the prospective surveillance of new data as it becomes available in order to detect a localized disease outbreak as early as possible. Particularly in light of the perceived threat of bioterrorism and newly emerging infectious diseases, there has been a spate of recent interest in the development of geographic surveillance systems that can detect changes in spatial patterns of disease [9]. Recently, a time periodic geographical disease surveillance system based on a cylindrical space-time scan statistic was proposed by Kulldorff and colleagues [10, 11].
Several different approaches to the statistical assessment of potential geographic clustering in either point-or area-based disease data have been developed [12, 13]. Almost all of these purely spatial approaches are retrospective, in the sense that they describe statistical tests that are designed to be carried out once, on a set of data that has been collected from the recent past [9]. In particular, the circular spatial scan statistic [8] has been used extensively for the detections and evaluation of purely spatial disease clusters along with the SaTScan software [14]. For example, as part of their cancer surveillance initiative, the New York State Department of Health used the spatial scan statistic to look at the geographical variation of breast, lung, prostate, and colorectal cancer incidence in New York State, finding various statistically significant clusters but no local hotspots with greatly elevated risk [15]. However, as the statistic uses a circular scanning window with variable size to define the potential cluster area, it is difficult to correctly detect some non-circular clusters such as those along a river [16]. Recently, spatial scan statistics for irregular shaped clusters have been proposed, using the same likelihood ratio test formulation as before. The spatial scan statistics proposed by Duczmal and Assunção [17], Patil and Taillie [18], Tango and Takahashi [16], Assunção et al. [19] and Kulldorff et al. [20] are aimed at detecting irregularly shaped clusters which may not be detected by the circular spatial scan statistic. Due to the unlimited geometric freedom of cluster shapes, some of these statistics run the risk of detecting quite large and very peculiarly shaped clusters. The flexible spatial scan statistic [16], which has been used along with the FleXScan software [21], has a parameter K as the pre-set maximum length of neighbors to be scanned, to avoid detecting clusters with a very peculiar shape.
In this paper, we propose a flexibly shaped space-time scan statistic ("flexible space-time scan statistic" hereafter) for the early detection of disease outbreaks. It is based on the flexible purely spatial scan statistic [16] and the prospective space-time scan statistic [10]. The performance of our proposed space-time scan statistic is compared with that of the cylindrical scan statistic, using the benchmark data provided by Kulldorff et al. [22]. In order to evaluate its performance we propose a space-time power distribution by extending the purely spatial bivariate power distribution [16]. Daily syndromic surveillance data in Massachusetts, USA, are used to illustrate the proposed method with real data.
The flexible space-time scan statistic
Consider the situation where an entire study area is divided into m regions (for example, counties, ZIP codes, enumeration districts, etcetera), and each region is periodically reporting the number of cases of a disease or syndrome under study. We assume that, under the null hypothesis of no clustering, the number of cases N_{ id }is a Poisson random variable with the observed value n_{ id }and the expected values μ_{ id }in each region i(i = 1,...,m) at time d, where μ_{ id }is proportional to its population size, or a covariate-adjusted population at risk. Since we are only interested in detecting clusters that are alive (active) at the current time t_{ P }, we only consider 'alive' clusters that are present in the following T time intervals:
[t_{ P }- T + 1, t_{ P }], [t_{ P }- T + 2, t_{ P }],..., [t_{ P }- 1,t_{ P }], [t_{ P }, t_{ P }]
where T is a pre-specified maximum temporal length of the cluster.
with height in the set $\mathcal{Y}$. In other words, for any given region i, the cylindrical scan statistic consider K concentric circles for the base, whereas the flexible scan statistic consider K concentric circles plus all the sets of connected regions including the single region i, whose centroids are located within the K-th largest concentric circle.
where LLR_{ v }and LLR* is the value of the test statistic for the v-th Monte Carlo replicate and that for the observed data, respectively, and I(·) is the indicator function.
Syndromic surveillance in Massachusetts
We applied the prospective flexible space-time scan statistic to daily syndromic surveillance data in eastern Massachusetts mimicking a real time surveillance system. The data came from an electronic medical record system used by Harvard Vanguard Medical Associates [3, 24]. We used the rash and respiratory data during August 1–30, 2005. The data are geographically aggregated to ZIP codes. The number of ZIP codes used were different for each syndrome, for example cases of the rash were analyzed in 252 ZIP codes and respiratory in 385. Note that for the flexible space-time scan statistic, the ZIP code whose data does not exist, was treated like a ravine. For example, assume that ZIP codes i_{1} and i_{2}, i_{2} and i_{3} are adjacent each other, respectively, but i_{1} and i_{3} are not adjacent. If the data of i_{2} does not exist under the situation, then it is assumed that i_{1} and i_{3} are not directly connected.
Based on the prior daily data for over a year in MA, the expected number of cases were calculated as the predicted means from a generalized linear mixed model (GLMM) as developed by Kleinman et al, adjusted for seasonal effect, day of week, etc, these are the same expectations used in the actual real time surveillance system [25]. We set K = 20 as the maximum length of the geographical window, and the maximum temporal length to be T = 7 days. The number of replications for the Monte Carlo procedure was set to B = 999. In disease outbreak detection, the recurrence interval (RI) is often used as an alternative to the p-value [14]. The measure reflects how often a cluster will be observed by chance, assuming that analyzes are repeated on a regular basis with a periodicity equal to the period of the study. For daily surveillance such as this analysis, the p-value of 0.001 corresponds to the RI of 1,000 days, i.e., 2.7 years, and an alpha level of 0.0027 corresponds to one expected false alarm every year.
Detected outbreaks of Rash based on daily syndromic surveillance data in eastern Massachusetts during August 1–30, 2005.
Day | zip codes | cluster period | cases | expected | llr | R.I.(p-value) |
---|---|---|---|---|---|---|
Rash: | ||||||
- flexible | ||||||
Aug.07 | 01951 | Aug.02–07 | 7 | 0.0427 | 27.949 | 2.7 years(0.001) |
Aug.08 | 01951 | Aug.02–08 | 7 | 0.0545 | 26.259 | 2.7 years(0.001) |
Aug.09 | 01951 | Aug.03–09 | 6 | 0.0545 | 21.562 | 2.7 years(0.001) |
Aug.10 | 01951 | Aug.04–10 | 5 | 0.0545 | 17.315 | 2.7 years(0.001) |
- cylindrical | ||||||
Aug.07 | 01951 | Aug.02–07 | 7 | 0.0427 | 27.949 | 2.7 years(0.001) |
Aug.08 | 01951 | Aug.02–08 | 7 | 0.0545 | 26.259 | 2.7 years(0.001) |
Aug.09 | 01951 | Aug.03–09 | 6 | 0.0545 | 21.562 | 2.7 years(0.001) |
Aug.10 | 01951 | Aug.04–10 | 5 | 0.0545 | 17.315 | 2.7 years(0.001) |
Detected outbreaks of Respiratory based on daily syndromic surveillance data in eastern Massachusetts during August 1–30, 2005.
Day | zip codes | cluster period | cases | expected | llr | R.I.(p-value) |
---|---|---|---|---|---|---|
Respiratory: | ||||||
- flexible | ||||||
Aug.12 | 01720, 01742, 01752, 01754, 01772, 01775, 01776, 01778, 02451, 02462, 02481, 02493 | Aug.11–12 | 42 | 12.452 | 17.635 | 2.7 years (0.001) |
Aug.13 | 01720, 01742, 01749, 01752, 01754, 01772, 01775, 01776, 01778, 02451, 02462, 02481, 02493 | Aug.11–13 | 46 | 14.950 | 16.634 | 333 days (0.003) |
Aug.14 | 01720, 01742, 01749, 01752, 01754, 01772, 01775, 01776, 01778, 02451, 02462, 02481, 02493 | Aug.11–14 | 49 | 16.957 | 15.927 | 250 days (0.004) |
Aug.15 | 01702, 01720, 01742, 01749, 01752, 01754, 01772, 01775, 01776, 01778, 02481, 02493 | Aug.10–15 | 72 | 29.975 | 16.726 | 1.4 years (0.002) |
- cylindrical | ||||||
Aug.12 | 01701, 01702, 01718, 01719, 01720, 01742, 01749, 01752, 01754, 01772, 01773, 01775, 01776, 01778, 02451, 02453, 02481, 02493 | Aug.11–12 | 51 | 20.036 | 12.688 | 2.7 years (0.001) |
Aug.13 | 01701, 01702, 01718, 01719, 01720, 01742, 01749, 01752, 01754, 01772, 01773, 01775, 01776, 01778, 02451, 02453, 02481, 02493 | Aug.11–13 | 55 | 23.768 | 10.945 | 91 days (0.011) |
Aug.14 | 01701, 01702, 01718, 01719, 01720, 01742, 01749, 01752, 01754, 01772, 01773, 01775, 01776, 01778, 02451, 02453, 02481, 02493 | Aug.11–14 | 59 | 26.959 | 10.221 | 30 days (0.033) |
Aug.15 | 01701, 01702, 01718, 01719, 01720, 01742, 01749, 01752, 01754, 01772, 01773, 01775, 01776, 01778, 02451, 02453, 02481, 02493 | Aug.11–15 | 82 | 40.981 | 11.662 | 200 days (0.005) |
Statistical power, sensitivity and positive predictive value
In this section, we compare the flexible and cylindrical space-time scan statistics, using benchmark data from 176 New York City ZIP codes ([14, 22]). This benchmark data has been described in detail elsewhere [22], and here we only give a brief overview. Based on 2002 numbers, the total population is 8,003,510. The benchmark data sets contain a number of randomly located of cases of a hypothetical disease or syndrome, generated either under the null model with no outbreaks or under one of eight different alternative models with an outbreak in one of four different locations and with either a high or modest excess risk. For each of the null and alternative models, three different sets of data sets were generated, with 31, 32, and 33 days, respectively. For each of the null models, 9,999 random data sets were generated. For each of the alternative models, 1,000 random data sets were generated.
For each data set, the total number of randomly allocated cases was 100 times the number of days (i.e., 3,100 cases in the data sets containing 31 days). The number 100 was chosen to reflect the occurrence rate of certain syndromes common to the NYC emergency department(ED)-based syndromic surveillance system. Under the null model, each person living in NYC is equally likely to contract the disease, and the time of each case is assigned with equal probability to any given day. Thus, each case was randomly assigned to ZIP code i and day d with probability proportional to μ_{ id }= pop_{ i }, where pop_{ i }is the population of ZIP code i. For the alternative models, one or more ZIP codes were assigned an increased risk on Day 31 and, when applicable, on Days 32 and 33 as well. For these ZIP code and day combinations, μ_{ id }was multiplied by an assigned relative risk. For all other ZIP code and day combinations, μ_{ id }did not change. Each case was then randomly assigned with probability proportional to the new set of μ_{ id }to generate data under the alternative models.
1. Cluster A: a single ZIP code area in Brooklyn (circular area)
s* = 1, pop* = 85, 089, RR: high = 9.91, medium = 5.66
2. Cluster A5: the same ZIP code with 4 neighboring ZIP codes (non-circular area)
s* = 5, pop* = 318, 754, RR: high = 4.47, medium = 3.06
3. The Rockaways, 5 ZIP codes area (non-circular area)
s* = 5, pop* = 106, 738, RR: high = 8.48, medium = 5.01
4. Hudson River: 20 ZIP codes areas along the shore of the Hudson River (non-circular area)
s* = 20, pop* = 827, 382, RR: high = 2.97, medium = 2.24
A maximum length of the geographic window K = 20 was used for the flexible scan statistic, while the cylindrical scan statistic used a maximum of either K = 20 or a 50 % of the population at risk. A period of T = 3 days was used as the maximum temporal length of the cluster. We did not use the options to include purely temporal clusters (see details in [14]).
Standard statistical power
Standard power of the prospective space-time scan statistics – flexible and cylindrical – at different days of the outbreak
Power on Day 31 | Power on Day 32 | Power on Day 33 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Outbreak areas | No. of zip codes s* | excess risk | flex. K = 20 | cylind. K = 20 | cylind. 50% pop | flex. K = 20 | cylind. K = 20 | cylind. 50% pop | flex. K = 20 | cylind. K = 20 | cylind. 50% pop |
Cluster A | 1 | high | 0.764 | 0.860 | 0.862 | 0.988 | 0.996 | 0.996 | 0.999 | 0.999 | 0.999 |
Cluster A5 | 5 | high | 0.797 | 0.850 | 0.847 | 0.994 | 0.996 | 0.996 | 1.000 | 1.000 | 1.000 |
The Rockaways | 5 | high | 0.769 | 0.855 | 0.840 | 0.992 | 0.997 | 0.997 | 1.000 | 1.000 | 1.000 |
Hudson River | 20 | high | 0.656 | 0.597 | 0.632 | 0.964 | 0.933 | 0.949 | 0.998 | 0.994 | 0.995 |
Cluster A | 1 | med. | 0.272 | 0.357 | 0.357 | 0.651 | 0.733 | 0.737 | 0.844 | 0.915 | 0.916 |
Cluster A5 | 5 | med. | 0.382 | 0.435 | 0.428 | 0.752 | 0.801 | 0.795 | 0.914 | 0.940 | 0.941 |
The Rockaways | 5 | med. | 0.261 | 0.373 | 0.348 | 0.648 | 0.768 | 0.759 | 0.848 | 0.924 | 0.917 |
Hudson River | 20 | med. | 0.290 | 0.257 | 0.297 | 0.631 | 0.582 | 0.610 | 0.845 | 0.782 | 0.803 |
Space-time power distribution
In order to compare the performance of the cluster detection tests, the standard power has been derived in the same manner as for usual hypothesis tests. However, it should be noted that standard statistical power reflect the 'power to reject the null hypothesis for whatever reasons,' while the probability of both rejecting the null hypothesis and accurately identifying the true cluster is a different matter altogether.
where U denotes the random variable of t and 1 ≤ t ≤ T.
Space-time power distribution P_{1}(l, s, t | s*, t*) for the Cluster A (s* = 1) on Day 31 (t* = 1) with high risk (RR= 9. 91), where t is a temporal length of detected cluster. The mark "*" is the powers of accurate detection.
(A) flexible (K = 20) | |||||||
---|---|---|---|---|---|---|---|
includes s assumed areas | |||||||
0 | 1 | ||||||
29- | 30- | 31- | 29- | 30- | 31- | ||
length l of areas | t = 3 | 2 | 1 | t = 3 | 2 | 1 | total |
1 | 0 | 0 | 0 | 0 | 6 | *315 | 321 |
2 | 0 | 0 | 0 | 0 | 1 | 50 | 51 |
3 | 0 | 0 | 0 | 1 | 2 | 34 | 37 |
4 | 0 | 0 | 0 | 1 | 6 | 34 | 41 |
5 | 0 | 0 | 0 | 2 | 5 | 48 | 55 |
6 | 1 | 0 | 0 | 2 | 2 | 51 | 56 |
7 | 1 | 0 | 0 | 4 | 13 | 35 | 53 |
8 | 0 | 1 | 0 | 1 | 12 | 28 | 42 |
9 | 1 | 1 | 2 | 5 | 6 | 28 | 43 |
10 | 0 | 0 | 0 | 2 | 7 | 22 | 31 |
11 | 1 | 0 | 0 | 1 | 4 | 8 | 14 |
12 | 1 | 1 | 2 | 1 | 0 | 10 | 15 |
13 | 0 | 0 | 0 | 1 | 1 | 2 | 4 |
14 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
total | 5 | 3 | 4 | 21 | 65 | 666 | 764 |
(B) cylindrical (K = 20) | |||||||
includes s assumed areas | |||||||
length l of areas | 0 | 1 | |||||
29- | 30- | 31- | 29- | 30- | 31- | ||
t = 3 | 2 | 1 | t = 3 | 2 | 1 | total | |
1 | 0 | 0 | 0 | 5 | 18 | *697 | 720 |
2 | 0 | 0 | 0 | 5 | 4 | 63 | 72 |
3 | 0 | 0 | 0 | 1 | 2 | 18 | 21 |
4 | 0 | 0 | 0 | 0 | 4 | 10 | 14 |
5 | 0 | 0 | 0 | 0 | 0 | 2 | 2 |
6 | 1 | 0 | 0 | 1 | 0 | 3 | 5 |
7 | 1 | 0 | 0 | 1 | 1 | 1 | 4 |
8 | 2 | 0 | 0 | 0 | 1 | 1 | 4 |
9 | 0 | 1 | 1 | 0 | 0 | 2 | 4 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 2 | 2 |
15 | 0 | 0 | 0 | 0 | 1 | 3 | 4 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 2 | 0 | 1 | 3 |
18 | 1 | 0 | 0 | 0 | 0 | 1 | 2 |
19 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
20 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
total | 5 | 1 | 1 | 15 | 31 | 807 | 860 |
Space-time power distribution P_{1}(l, s, t | s*, t*) for the Cluster A5 (s* = 5) on Day 33 (t* = 3) with high risk (RR = 4. 47), where t is a temporal length of detected cluster, and the raw all cells of which have zero powers of both tests is not shown. The mark "*" is the powers of accurate detection.
(A) flexible (K = 20) | |||||||
---|---|---|---|---|---|---|---|
includes s assumed areas | |||||||
3 | 4 | 5 | |||||
31- | 32- | 31- | 32- | 31- | 32- | ||
length l of areas | t = 3 | 2 | t = 3 | 2 | t = 3 | 2 | total |
1 | 0 | ||||||
2 | 0 | ||||||
3 | 12 | 0 | 12 | ||||
4 | 2 | 0 | 74 | 2 | 78 | ||
5 | 0 | 0 | 37 | 2 | *287 | 3 | 329 |
6 | 1 | 0 | 26 | 1 | 158 | 2 | 188 |
7 | 0 | 0 | 16 | 0 | 118 | 2 | 136 |
8 | 0 | 0 | 5 | 0 | 105 | 2 | 112 |
9 | 0 | 0 | 4 | 0 | 67 | 2 | 73 |
10 | 0 | 0 | 6 | 0 | 39 | 1 | 46 |
11 | 0 | 0 | 2 | 0 | 11 | 0 | 13 |
12 | 0 | 0 | 0 | 0 | 10 | 0 | 10 |
13 | 0 | 0 | 1 | 0 | 1 | 0 | 2 |
14 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
total | 15 | 0 | 171 | 5 | 797 | 12 | 1000 |
(B) cylindrical (K = 20) | |||||||
includes s assumed areas | |||||||
length l of areas | 3 | 4 | 5 | ||||
31- | 32- | 31- | 32- | 31- | 32- | ||
t = 3 | 2 | t = 3 | 2 | t = 3 | 2 | total | |
1 | 0 | ||||||
2 | 0 | ||||||
3 | 37 | 1 | 38 | ||||
4 | 2 | 0 | 301 | 7 | 310 | ||
5 | 2 | 0 | 32 | 0 | *0 | 0 | 34 |
6 | 0 | 0 | 5 | 0 | 516 | 10 | 521 |
7 | 0 | 0 | 0 | 0 | 64 | 1 | 65 |
8 | 0 | 0 | 0 | 0 | 5 | 0 | 5 |
9 | 0 | 0 | 0 | 0 | 3 | 0 | 3 |
10 | 0 | 0 | 0 | 0 | 3 | 0 | 3 |
11 | 0 | 0 | 0 | 0 | 4 | 0 | 4 |
12 | 0 | 0 | 0 | 0 | 2 | 0 | 2 |
13 | 0 | 0 | 0 | 0 | 3 | 1 | 4 |
14 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
total | 41 | 1 | 338 | 7 | 601 | 12 | 1000 |
Space-time power distribution P_{1}(l, s, t | s*, t*) for the Rockaways (s* = 5) on Day 33 (t* = 3) with high risk (RR = 8. 48), where t is a temporal length of detected cluster, and the raw all cells of which have zero powers of both tests is not shown. The mark "*" is the powers of accurate detection.
(A) flexible (K = 20) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
includes s assumed areas | |||||||||||
1 | 2 | 3 | 4 | 5 | |||||||
31- | 31- | 31- | 32- | 31- | 32- | 33 | 31- | 32- | 33 | ||
length l of areas | t = 3 | t = 3 | t = 3 | 2 | t = 3 | 2 | 1 | t = 3 | 2 | 1 | total |
1 | 0 | 0 | |||||||||
2 | 0 | 0 | 6 | ||||||||
3 | 0 | 2 | 6 | 0 | 22 | ||||||
4 | 0 | 0 | 20 | 0 | 181 | 1 | 0 | 204 | |||
5 | 0 | 0 | 22 | 0 | 50 | 0 | 0 | *571 | 2 | 0 | 626 |
6 | 0 | 0 | 3 | 0 | 26 | 1 | 0 | 23 | 1 | 0 | 52 |
7 | 0 | 0 | 1 | 0 | 9 | 0 | 0 | 54 | 1 | 1 | 66 |
8 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 11 | 0 | 0 | 14 |
9 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 6 | 0 | 0 | 8 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
total | 0 | 8 | 48 | 0 | 270 | 2 | 0 | 667 | 4 | 1 | 1000 |
(B) cylindrical (K = 20) | |||||||||||
includes s assumed areas | |||||||||||
length l of areas | 1 | 2 | 3 | 4 | 5 | ||||||
31- | 31- | 31- | 32- | 31- | 32- | 33 | 31- | 32- | 33 | ||
t = 3 | t = 3 | t = 3 | 2 | t = 3 | 2 | 1 | t = 3 | 2 | 1 | total | |
1 | 2 | 2 | |||||||||
2 | 0 | 8 | 8 | ||||||||
3 | 0 | 0 | 52 | 1 | 53 | ||||||
4 | 0 | 0 | 0 | 0 | 876 | 6 | 1 | 883 | |||
5 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | *0 | 0 | 0 | 4 |
6 | 0 | 0 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 32 |
7 | 0 | 0 | 0 | 0 | 14 | 1 | 0 | 1 | 0 | 0 | 16 |
8 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
total | 2 | 8 | 53 | 1 | 927 | 7 | 1 | 1 | 0 | 0 | 1000 |
This tri-variate power distribution provides us with a detailed description of the space-time cluster detection tests performance. For the outbreak in cluster A with a single ZIP code, the cylindrical scan statistic has higher power to detect the cluster with complete accuracy, with P_{1}(l = 1, s = 1, t = 1 | s*, t*) = 697/1000, compared to 315/1000 for the flexible. Moreover, the flexible scan statistic has a heavier tail in the (s, t) = (1, 3) column than the cylindrical one. However the cylindrical scan detected some large clusters including several with l ≥ 15. For outbreaks in the non-circular shaped A5 and Rockaway clusters, the flexible scan statistic has higher power for complete accurate detection. Indeed, the cylindrical scan statistic cannot detect these clusters with complete accuracy since they are not circular, so that the power for complete accuracy is zero. Moreover, note that for cluster A5, the flexible scan statistic is more likely to include all the five areas in the true cluster (797 + 12 = 809/1000 versus 601 + 12 = 613/1000), and it is also more likely to avoid including any of the ZIP codes outside the true cluster (12 + 74 + 2 + 287 + 3 = 378/1000 versus 37 + 1 + 301 + 7 = 346/1000). For the Rockaway cluster, the flexible scan statistic is again more likely to include all the five areas in the true cluster (667 + 4 + 1 = 672 versus 1 + 0 + 1 = 1), but the cylindrical scan statistic avoids the ZIP codes outside the cluster more often (2 + 8 + 52 + 1 + 876 + 6 + 1 + 0 + 0 + 0 = 946/1000 versus 0 + 0 + 6 + 0 + 181 + 1 + 0 + 571 + 2 + 0 = 761/1000). Tables 5 and 6 show that the temporal accuracy of the detected cluster is very good for both methods. For example, for cluster A5, the flexible scan has P_{1}(+, +, 3 | s*, t*) = ∑_{ l }∑_{ s }P_{1}(l, s, 3 | s*, t*) = (15 + 171 + 797)/1000 = 0.983 while the cylindrical scan has P_{1}(+, +, 3 | s*, t*) = (41 + 338 + 601)/1000 = 0.980.
The complexity of the three-dimensional tri-variate power distributions suggests that we need some summary measure. Since the temporal accuracy is very similar, we focus on the geographical accuracy. We will compute the extended power of spatial cluster detection tests, as developed by Takahashi and Tango [26]. We will also define and compute geographical sensitivity and false positive rates.
The extended power
We can consider two types of spatial misclassifications when applying the cluster detection test (CDT). One is a false negative test result (FN) in which the CDT misses a region included in the true cluster. Sensitivity is 1 - FN rate. The other is a false positive test result (FP) in which the CDT incorrectly detects a region that is not present in the true cluster. The numbers of FNs and FPs for geographical detection are s* - s and l - s, respectively.
and w^{-} and w^{+} are the predefined penalties for the FNs and FPs (per region), respectively. This power includes the following three special powers:
1. The standard power as I(0, 0).
2. The power to detect the geographical true cluster accurately as I(1, 1).
3. The power for which the MLC includes all the regions within the true cluster as I(1, 0).
Takahashi and Tango [26] also proposed the profile of the extended power as
Q(r | s*) = I(1/s*, r/s*), (0 ≤ r ≤ 1)
Sensitivity and positive predictive value
All these summary measures are better the larger they are with 100 being the optimal.
Sensitivity and positive predictive value (PPV) of the flexible and cylindrical space-time scan statistics.
zip codes | population | |||||
---|---|---|---|---|---|---|
traditional power | sensitivity (%) | PPV (%) | sensitivity (%) | PPV (%) | ||
Cluster A; s* = 1; high risk | ||||||
Day 31 | flexible (K = 20) | |||||
cylindrical (K = 20) | 0.860 | 85.30 | 89.45 | 85.30 | 91.50 | |
cylindrical (50% pop) | 0.862 | 85.90 | 88.79 | 85.90 | 90.84 | |
Day 32 | flexible (K = 20) | 0.988 | 98.80 | 84.50 | 98.80 | 87.80 |
cylindrical (K = 20) | 0.996 | 99.50 | 97.44 | 99.50 | 98.18 | |
cylindrical (50% pop) | 0.996 | 99.50 | 97.33 | 99.50 | 98.08 | |
Day 33 | flexible (K = 20) | 0.999 | 99.90 | 96.27 | 99.90 | 97.32 |
cylindrical (K = 20) | 0.999 | 99.90 | 99.48 | 99.90 | 99.65 | |
cylindrical (50% pop) | 0.999 | 99.90 | 99.48 | 99.90 | 99.65 | |
Cluster A5; s* = 5; high risk | ||||||
Day 31 | flexible (K = 20) | 0.797 | 66.08 | 55.93 | 67.29 | 63.00 |
cylindrical (K = 20) | 0.850 | 69.62 | 80.35 | 71.65 | 84.21 | |
cylindrical (50% pop) | 0.847 | 70.62 | 78.17 | 71.62 | 82.02 | |
Day 32 | flexible (K = 20) | 0.994 | 92.22 | 70.17 | 92.94 | 76.73 |
cylindrical (K = 20) | 0.996 | 88.78 | 85.14 | 90.86 | 89.41 | |
cylindrical (50% pop) | 0.996 | 88.96 | 84.81 | 91.05 | 89.11 | |
Day 33 | flexible (K = 20) | 1.000 | 95.88 | 80.02 | 96.64 | 85.25 |
cylindrical (K = 20) | 1.000 | 91.42 | 87.32 | 93.66 | 91.67 | |
cylindrical (50% pop) | 1.000 | 91.42 | 87.30 | 93.66 | 91.65 | |
The Rockaways; s* = 5; high risk | ||||||
Day 31 | flexible (K = 20) | 0.769 | 60.68 | 72.09 | 69.04 | 73.58 |
cylindrical (K = 20) | 0.855 | 62.32 | 91.63 | 74.45 | 91.76 | |
cylindrical (50% pop) | 0.840 | 61.40 | 91.04 | 73.65 | 91.15 | |
Day 32 | flexible (K = 20) | 0.992 | 86.76 | 87.36 | 94.17 | 89.86 |
cylindrical (K = 20) | 0.997 | 77.00 | 96.84 | 92.75 | 97.46 | |
cylindrical (50% pop) | 0.997 | 77.00 | 96.84 | 92.75 | 97.46 | |
Day 33 | flexible (K = 20) | 1.000 | 92.16 | 93.81 | 97.15 | 95.97 |
cylindrical (K = 20) | 1.000 | 78.50 | 98.06 | 94.51 | 98.59 | |
cylindrical (50% pop) | 1.000 | 78.50 | 98.06 | 94.51 | 98.59 | |
Hudson River, s* = 20; high risk | ||||||
Day 31 | flexible (K = 20) | 0.656 | 20.07 | 64.99 | 26.00 | 69.72 |
cylindrical (K = 20) | 0.597 | 14.23 | 61.10 | 18.16 | 65.18 | |
cylindrical (50% pop) | 0.632 | 26.15 | 50.70 | 31.26 | 53.71 | |
Day 32 | flexible (K = 20) | 0.964 | 32.17 | 73.59 | 41.81 | 78.36 |
cylindrical (K = 20) | 0.933 | 24.13 | 61.55 | 31.58 | 66.69 | |
cylindrical (50% pop) | 0.949 | 42.90 | 50.27 | 51.50 | 53.96 | |
Day 33 | flexible (K = 20) | 0.998 | 34.91 | 79.39 | 46.27 | 84.17 |
cylindrical (K = 20) | 0.994 | 27.23 | 60.56 | 36.75 | 66.20 | |
cylindrical (50% pop) | 0.995 | 48.14 | 47.34 | 58.54 | 51.34 |
Conclusion
In this paper, we have proposed a flexible space-time scan statistic to detect arbitrarily shaped disease outbreaks. We have also presented a tri-variate power distribution which is useful for evaluating the performance of cluster detection tests, informing us about the spatial and temporal accuracy of the detected clusters in addition to the standard statistical power.
For the benchmark data evaluated in this paper, the cylindrical scan statistic performs better for the small single zip-code cluster, although by the third day of the outbreak both methods are almost perfect. For the small irregular shaped clusters, A5 and Rockaways, the cylindrical performs better on the first day of the outbreak, but as more data accumulates, the flexible scan statistic has certain advantages in determining the precise size and shape of the outbreak. For the large and narrow Hudson River cluster, the flexible scan statistic performs better than the cylindrical one, with slightly higher standard power, much higher PPV and slightly higher or lower sensitivity depending on the type of cylindrical method used. Results may be different for other types of regular and irregularly shaped disease outbreaks, but the four examples used in this paper gives some sense of the proposed methods performance.
For early detection, timeliness is much more important than geographical accuracy. When monitoring an occurring outbreak, on the other hand, geographical accuracy becomes critical and is then the key objective since we already know the outbreak is there. Our results suggest that we may use both the cylindrical and flexible scan statistic for disease outbreak detection, but for different purposes. Specifically, for detecting new outbreak that, one may want to use the cylindrical scan statistic. That is especially if we expect the outbreak to start locally, within a reasonably small and compact area containing only a few ZIP-codes. On the other hand, once the outbreak has spread to a larger area, and we want to monitor that spread, one may want to use the flexible scan statistic, with its ability to accuratly determine the precise geographical extent of irregular shaped outbreaks. This is especially true ones the outbreak has left its local area of origin.
To evaluate the performance of space-time scan statistic, we applied the extended power for purely spatial cluster detection test (8), which is defined as the weighted sum of the bivariate power distribution wherein the weight is given by the geometric mean of (1-penalty for the false negatives) and (1-penalty for the false positives), including the standard power as a special case. Also we applied the profile Q(r | s*) proposed by Takahashi and Tango [26]. This plot gave us a detailed description regarding power of cluster detection tests. Needless to say, it is possible to extend it to space-time version if we could consider the penalties for temporal false negatives and false positives, but we leave this problem for future work. Also, for the profile of the extended power, we chose to use a fixed cost of w^{-} = 1/s* for false negatives and a smaller or equal cost for false positives. For more general situations, we could plot the full bivariate extended power function on the unit square.
Similarly to the flexible spatial scan statistic in the purely spatial situation, the flexible space-time scan statistics proposed in this paper has a limitation of cluster size, because of the limitation of the speed of computation. The proposed scan statistic works well for small to moderate sized clusters. Although we set the maximum length of the geographical window to K = 20, this is not large enough to detect the 20 ZIP codes of the Hudson River cluster accurately because this cluster is too long to be the subset of the 20-th nearest neighbors of any region. Computation time depends on the size of the data set and K. Indeed, for the August 11 analysis of respiratory syndrome data in Massachusetts, with 385 ZIP codes, a maximum temporal length of T = 7 days, a maximum spatial size of K = 20, and with 999 Monte Carlo replications, the flexible space-time scan statistic took 87.7 minutes to run on a 3.06-GHz Pentium 4 computer, while the cylindrical space-time scan statistic took only 9.8 minutes.
A limitation of length may also prevent the analysis to present large clusters of unlikely and very peculiar shapes. These undesirable properties produced by maximum likelihood ratio might suggest the use of different criterion for model selection, including some penalized likelihood [20, 29]. Also, for larger cluster seizes, the method is not practically feasible and a more efficient algorithm is needed.
In this paper, we considered the right cylinder or right prism of the cluster model, as an expansion of the cylindrical space-time scan statistic for a prospective disease surveillance by Kulldorff [10]. This does not allow the scanning window to adjust itself as the disease outbreak grows or shrinks geographically over time. Recently, Iyengar has suggested using a square pyramid shape window which can model either growth (or shrinkage) and movement of the disease cluster [30]. For the proposed flexible space-time scan statistic, if we could consider the flexibility in both space and time, that is, evaluating all connected subsets within a cylinder instead of $\mathcal{W}$ in (4), we can detect more arbitrarily shaped clusters in space-time. For such an expansion, an efficient computational algorithm will be needed for the scanning process, as well as a more sophisticated mechanism for the interpretation of such complicatedly shaped clusters. The implementation and importance of such methods for disease surveillance and monitoring, is an issue for future research.
Declarations
Acknowledgements
The authors thank Allyson Abrams for comments concerning the syndromic surveillance data from Massachusetts, and Dr. Tetsuji Yokoyama for advice about C++ programming.
This research was partly founded by a Modeling Infectious Disease Agent Study (MIDAS) grant (No. U01GM076672) from the National Institute of General Medical Science, National Institutes of Health, USA, and a scientific grant (No. H16-Kenkou-039) from the Ministry of Health, Labour and Welfare, Japan.
Authors’ Affiliations
References
- Heffernan R, Mostashari F, Das D, Karpati A, Kulldorff M, Weiss D: Syndromic surveillance in public health practice, New York City. Emerging Infectious Diseases. 2004, 10: 858-864.PubMedView ArticleGoogle Scholar
- Lombardo J, Burkom H, Elbert E, Magruder S, Lewis SH, Loschen W, Sari J, Sniegoski C, Wojcik R, Pavlin J: A systems overview of the electronic surveillance system for the early notification of community-based epidemics (ESSENCE II). Journal of Urban Health. 2003, 80 (2 suppl.1): i32-i42.PubMedPubMed CentralGoogle Scholar
- Lazarus R, Kleinman K, Dashevsky I, Adams C, Kludt P, DeMaria A, Platt R: Use of automated ambulatory-care encounter records for detection of acute illness clusters, including potential bioterrorism events. Emerg Infect Dis. 2002, 8 (8): 753-760.PubMedPubMed CentralView ArticleGoogle Scholar
- Platt R, Bocchino C, Caldwell B, Harmon R, Kleinman K, Lazarus R, Nelson AF, Nordin JD, Ritzwoller P: Syndromic surveillance using minimum transfer of identifiable data: the example of the National Bioterrorism Syndromic Surveilance Demonstration Program. Journal of Urban Health. 2003, 80 (2 suppl.1): i25-i31.PubMedPubMed CentralGoogle Scholar
- Sonesson C, Bock D: A review and discussion of prospective statistical surveillance in public health. Journal of the Royal Statistical Society, Series A. 2003, 166: 5-21.View ArticleGoogle Scholar
- Lawson AB, Kleinman K, Eds: Spatial & Syndromic Surveillance for Public Health. 2005, Chichester: WileyGoogle Scholar
- Naus J, Wallenstein S: Temporal surveillance using scan statistics. Statistics in Medicine. 2006, 25: 311-324.PubMedView ArticleGoogle Scholar
- Kulldorff M: A spatial scan statistic. Communications in Statistics – Theory and Methods. 1997, 26: 1481-1496.View ArticleGoogle Scholar
- Rogerson PA, Yamada I: Monitoring change in spatial patterns of disease: comparing univariate and multivariate cumulative sum approaches. Statistics in Medicine. 2004, 23: 2195-2214.PubMedView ArticleGoogle Scholar
- Kulldorff M: Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society, Series A. 2001, 164: 61-72.View ArticleGoogle Scholar
- Kulldorff M, Heffernan R, Hartman J, Assunção R, Mostashari F: A space-time permutation scan statistic for disease outbreak detection. PLoS Medicine. 2005, 2 (3): e59-PubMedPubMed CentralView ArticleGoogle Scholar
- Lawson AB, Biggeri A, Böhning D, Lesaffre E, Viel JF, Bertollini R, Eds: Disease Mapping and Risk Assessment for Public Health. 1999, New York: WileyGoogle Scholar
- Lawson AB: Statistical Methods in Spatial Epidemiology. 2006, Chichester: Wiley, 2View ArticleGoogle Scholar
- Kulldorff M, Information Management Services, Inc: SaTScan version 7.0: software for the spatial and space-time scan statistics. 2007,http://www.satscan.org/Google Scholar
- Kulldorff M: Scan statistics for geographical disease surveillance: an overview. Spatial & Syndromic Surveillance for Public Health. Edited by: Lawson AB, Kleinman K. 2005, Chichester: Wiley, 115-131. 2View ArticleGoogle Scholar
- Tango T, Takahashi K: A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005, 4 (11):Google Scholar
- Duczmal L, Assunção R: A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis. 2004, 45: 269-286.View ArticleGoogle Scholar
- Patil GP, Taillie C: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 2004, 11: 183-197.View ArticleGoogle Scholar
- Assunção R, Costa M, Tavares A, Ferreira S: Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine. 2006, 25: 723-742.PubMedView ArticleGoogle Scholar
- Kulldorff M, Huang L, Pickle L, Duczmal L: An elliptic spatial scan statistic. Statistics in Medicine. 2006, 25: 3929-3943.PubMedView ArticleGoogle Scholar
- Takahashi K, Yokoyama T, Tango T: FleXScan version 2.0: Software for the Flexible Spatial Scan Statistic. Japan. 2007, http://www.niph.go.jp/soshiki/gijutsu/index_e.htmlGoogle Scholar
- Kulldorff M, Zhang Z, Hartman J, Heffernan R, Huang L, Mostashari F: Benchmark data and power calculations for evaluating disease outbreak detection methods. Morbidity and Mortality Weekly Report. 2004, 53 (Supplement 1): 144-151.PubMedGoogle Scholar
- Dwass M: Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical Statistics. 1957, 28: 181-187.View ArticleGoogle Scholar
- Lazarus R, Kleinman K, Dashevsky I, DeMaria A, Platt R: Using automated medical records for rapid identification of illness syndromes (syndromic surveillance): the example of lower respiratory infection. BMC Public Health. 2001, 1: 9-PubMedPubMed CentralView ArticleGoogle Scholar
- Kleinman K, Lazarus R, Platt R: A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism. American Journal of Epidemiology. 2004, 159: 217-224.PubMedView ArticleGoogle Scholar
- Takahashi K, Tango T: An extended power of cluster detection tests. Statistics in Medicine. 2006, 25: 841-852.PubMedView ArticleGoogle Scholar
- Forsberg L, Bonetti M, Jeffery C, Ozonoff A, Pagano M: Distance-based methods for spatial and spatio-temporal surveillance. Spatial & Syndromic Surveillance for Public Health. Edited by: Lawson AB, Kleinman K. 2005, Chichester: Wiley, 115-131. 2Google Scholar
- Huang L, Kulldorff M, Gregorio D: A spatial scan statistic for survival data. Biometrics. 2007, 63: 109-118.PubMedView ArticleGoogle Scholar
- Duczmal L, Kulldorff M, Huang L: Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics. 2006, 15 (2): 428-442.View ArticleGoogle Scholar
- Iyengar VS: Space-time clusters with flexible shapes. Morbidity and Mortality Weekly Report. 2005, 54 (Supplement): 71-76.PubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.