### The spatial scan statistic

In the spatial disease surveillance setting, we monitor a set of spatial locations *s*_{
i
}, and are given an observed count (number of cases) *c*_{
i
}and an expected count *b*_{
i
}corresponding to each location. For example, each *s*_{
i
}may represent the centroid of a zip code, the corresponding count *c*_{
i
}may represent the number of Emergency Department visits with respiratory chief complaints in that zip code for some time period, and the corresponding expectation *b*_{
i
}may represent the expected number of respiratory ED visits in that zip code for that time period, estimated from historical data. We then wish to detect any spatial regions *S* where the counts are significantly higher than expected.

The spatial scan statistic [

11] detects clusters of increased counts by searching over a large number of spatial regions, where each region

*S* consists of some subset of the locations

*s*_{
i
}, and finding those regions which maximize some likelihood ratio statistic. Given a set of alternative hypotheses

*H*_{1}(

*S*) (each representing a cluster in some region

*S*) and a null hypothesis

*H*_{0} (representing no clusters), the likelihood ratio

*F*(

*S*) for a given region

*S* is the ratio of the data likelihoods under the alternative and null hypotheses:

If the null or alternative hypotheses have any free parameters, we can compute the likelihood ratio statistic using the maximum likelihood parameter values [

26]:

Once we have found the regions with the highest scores *F*(*S*), we must still determine which of these high-scoring regions are statistically significant, and which are likely to be due to chance. In spatial disease surveillance, the significant clusters are reported to the user as potential disease outbreaks which can then be further investigated. The regions with the highest values of the likelihood ratio statistic are those which are most likely to have been generated under the alternative hypothesis (cluster in region *S*) instead of the null hypothesis of no clusters. However, because we are maximizing the likelihood ratio over a large number of spatial regions, multiple hypothesis testing is a serious issue, and we are very likely to find many regions with high likelihood ratios even when the null hypothesis is true.

Kulldorff's original spatial scan approach [11] deals with this multiple testing issue by "randomization testing", generating a large number of replica datasets under the null hypothesis and finding the maximum region score for each replica dataset. The p-value of a region *S* is computed as
, where *R*_{
beat
}is the number of replica datasets with maximum region score higher than *F*(*S*), and *R* is the total number of replica datasets. In other words, a region *S* must score higher than approximately 95% of the replica datasets to be significant at *α* = .05. As discussed below, several other approaches exist for determining the statistical significance of detected regions, and these alternatives may be preferable in some cases.

### Variants of the spatial scan statistic

We consider three different variants of the spatial scan statistic: Kulldorff's original Poisson scan statistic [11] and the recently proposed expectation-based Poisson [21] and expectation-based Gaussian [22] statistics. We will refer to these statistics as KULL, EBP, and EBG respectively. Each of these statistics makes a different set of model assumptions, resulting in a different score function *F*(*S*). More precisely, they differ based on two main criteria: which distribution is used as a generative model for the count data (Poisson or Gaussian), and whether we adjust for the observed and expected counts outside the region under consideration.

The Poisson distribution is commonly used in epidemiology to model the underlying randomness of observed case counts, making the assumption that the variance is equal to the mean. If this assumption is not reasonable (i.e. counts are "overdispersed" with variance greater than the mean, or "underdispersed" with variance less than the mean), we should instead use a distribution which separately models mean and variance. One simple possibility is to assume a Gaussian distribution, and both the Poisson and Gaussian distributions lead to simple and easily computable score functions *F*(*S*). Other recently proposed spatial cluster detection methods have considered negative binomial [27], semi-parametric [28], and nonparametric [29] distributions, and these more complex model assumptions might be preferable in cases where neither Poisson nor Gaussian distributions fit the data.

A second distinction in our models is whether the score function *F*(*S*) adjusts for the observed and expected counts outside region *S*. The traditional Kulldorff scan statistic uses the ratio of observed to expected count (i.e. the observed relative risk) inside and outside region *S*, detecting regions where the risk is higher inside than outside. The expectation-based approaches (EBP and EBG) do not consider the observed and expected counts outside region *S*, but instead detect regions where the observed relative risk is higher than 1, corresponding to a higher than expected count.

All three methods assume that each observed count *c*_{
i
}is drawn from a distribution with mean proportional to the product of the expected count *b*_{
i
}and an unknown relative risk *q*_{
i
}. For the two Poisson methods, we assume *c*_{
i
}~ Poisson(*q*_{
i
}*b*_{
i
}), and for the expectation-based Gaussian method, we assume *c*_{
i
}~ Gaussian(*q*_{
i
}*b*_{
i
}, *σ*_{
i
}). The expectations *b*_{
i
}are obtained from time series analysis of historical data for each location *s*_{
i
}. For the Gaussian statistic, the variance
can also be estimated from the historical data for location *s*_{
i
}, using the mean squared difference between the observed counts
and the corresponding estimated counts
.

Under the null hypothesis of no clusters *H*_{0}, the expectation-based statistics assume that all counts are drawn with mean *equal* to their expectations, and thus *q*_{
i
}= 1 everywhere. Kulldorff's statistic assumes instead that all counts are drawn with mean *proportional* to their expectations, and thus *q*_{
i
}= *q*_{
all
}everywhere, for some unknown constant *q*_{
all
}. The value of *q*_{
all
}is estimated by maximum likelihood:
, where *C*_{
all
}and *B*_{
all
}are the aggregate observed count ∑ *c*_{
i
}and aggregate expected count ∑ *b*_{
i
}for all locations *s*_{
i
}respectively.

Under the alternative hypothesis

*H*_{1}(

*S*), representing a cluster in region

*S*, the expectation-based statistics assume that the expected counts inside region

*S* are multiplied by some constant

*q*_{
in
}> 1, and thus

*q*_{
i
}=

*q*_{
in
}inside region

*S* and

*q*_{
i
}= 1 outside region

*S*. The value of

*q*_{
in
}is estimated by maximum likelihood. For the expectation-based Poisson statistic, the maximum likelihood value of

*q*_{
in
}is

, where

*C*_{
in
}and

*B*_{
in
}are the aggregate count

and aggregate expectation

respectively. For the expectation-based Gaussian statistic, the maximum likelihood value of

*q*_{
in
}is

, where

and

. These sufficient statistics can be interpreted as weighted sums of the counts

*c*_{
i
}and expectations

*b*_{
i
}respectively, where the weighting of a location

*s*_{
i
}is inversely proportional to the coefficient of variation

. The resulting likelihood ratio statistics are:

for

*C*_{
in
}>

*B*_{
in
}, and

*F*_{
EBP
}(

*S*) = 1 otherwise.

for
, and *F*_{
EBG
}(*S*) = 1 otherwise. Note that these likelihood ratio statistics are only dependent on the observed and expected counts inside region *S*, since the data outside region *S* is assumed to be generated from the same distribution under the null and alternative hypotheses. Detailed derivations of these two statistics are provided in [22].

Kulldorff's scan statistic uses fundamentally different assumptions than the expectation-based statistics: under the alternative hypothesis

*H*_{1}(

*S*) of a cluster in region

*S*, it assumes that the expected counts inside and outside region

*S* are multiplied by some unknown constants

*q*_{
in
}and

*q*_{
out
}respectively, where

*q*_{
in
}>

*q*_{
out
}. In this case, the maximum likelihood estimate of

*q*_{
in
}is

as in the expectation-based Poisson statistic, and the maximum likelihood estimate of

*q*_{
out
}is

, where

and

respectively. The resulting likelihood ratio statistic is:

if
, and *F*_{
KULL
}(*S*) = 1 otherwise. Note that Kulldorff's statistic does consider the counts and expectations outside region *S*, and will only detect increased counts in a region *S* if the ratio of observed to expected count is higher inside the region than outside. Also, the term
is identical for all regions *S* for a given day of data, and can be omitted when computing the highest-scoring region. However, this term is necessary to calibrate scores between different days (e.g. when computing statistical significance). Detailed derivations of Kulldorff's statistic are found in [11] and [22].

It is an open question as to which of these three spatial scan statistics will achieve the highest detection performance in real-world outbreak detection scenarios. We hypothesize that EBG will outperform EBP for datasets which are highly overdispersed (since in this case the Poisson assumption of equal mean and variance is incorrect) and which have high average daily counts (since in this case the discrete distribution of counts may be adequately approximated by a continuous distribution). Furthermore, we note that Kulldorff's statistic will not detect a uniform, global increase in counts (e.g. if the observed counts were twice as high as expected for all monitored locations), since the ratio of risks inside and outside the region would remain unchanged. We hypothesize that this feature will harm the performance of KULL for outbreaks which affect many zip codes and thus have a large impact on the global risk
. In recent work, we have shown empirically that EBP outperforms KULL for detecting large outbreaks in respiratory Emergency Department visit data [30], and we believe that this will be true for the other datasets considered here as well. However, KULL may be more robust to misestimation of global trends such as day of week and seasonality, possibly resulting in improved detection performance.

### Using the spatial scan statistic for prospective disease surveillance

In the typical prospective surveillance setting [

31], we infer the expected counts for each location from the time series of historical data. Let us assume that for each spatial location

*s*_{
i
}, we have a time series of counts

for

*t* = 0 ...

*t*_{
max
}, where time 0 represents the present. Our goal is to find any region

*S* for which the most recent

*W* counts are significantly higher than expected, where

*W* denotes the temporal window size. In the more general, space-time setting, we must find the maximum of the score function

*F*(

*S*) over all spatial regions

*S* and all temporal windows

*W* = 1 ...

*W*_{
max
}. Space-time scan statistics are considered in detail in [

21,

31,

32]; here we consider the purely spatial setting with a fixed window size of

*W* = 1. In this simplified setting, we must compute the expected counts

for the current day from the time series of counts

, using some method of time series analysis, and then find spatial regions where the current day's counts are significantly higher than expected. Here we consider four variants of the 28-day moving average (MA), with and without adjustments for day of week and seasonality. The MA method computes expectations as follows:

The moving average may be adjusted for day of week trends (MA-DOW) by computing the proportion of counts

occurring in each location

*s*_{
i
}on each day of the week (

*j* = 1 ... 7), using 12 weeks of historical data:

When we predict the expected count for a given location on a given day, we choose the corresponding value of
and multiply our estimate by 7
. This method of static adjustment for day of week assumes that weekly trends have a constant and multiplicative effect on counts for each spatial location. This is similar to the log-linear model-adjusted scan statistic proposed by Kleinman et al. [24], with the difference that we use only the most recent data rather than the entire dataset to fit the model's parameters.

The 28-day moving average takes seasonality into account by only using the most recent four weeks of data, but it may lag behind fast-moving seasonal trends, causing many false positives (if it underestimates expected counts for an increasing trend) or false negatives (if it overestimates expected counts for a decreasing trend). Thus we can perform a simple seasonal adjustment by multiplying the 28-day moving average by the ratio of the "global" 7-day and 28-day moving averages:

This "moving average with current week adjustment" (MA-WK) method has the effect of reducing the lag time of our estimates of global trends. One potential disadvantage is that our estimates of the expected counts using the 7-day average may be more affected by an outbreak (i.e. the estimates may be contaminated with outbreak cases), but using global instead of local counts reduces the variance of our estimates and also reduces the bias resulting from contamination. We can further adjust for day of week (MA-WK-DOW) by multiplying by seven times the appropriate
, as discussed above.