An exact test to detect geographic aggregations of events

Background Traditional approaches to statistical disease cluster detection focus on the identification of geographic areas with high numbers of incident or prevalent cases of disease. Events related to disease may be more appropriate for analysis than disease cases in some contexts. Multiple events related to disease may be possible for each disease case and the repeated nature of events needs to be incorporated in cluster detection tests. Results We provide a new approach for the detection of aggregations of events by testing individual administrative areas that may be combined with their nearest neighbours. This approach is based on the exact probabilities for the numbers of events in a tested geographic area. The test is analogous to the cluster detection test given by Besag and Newell and does not require the distributional assumptions of a similar test proposed by Rosychuk et al. Our method incorporates diverse population sizes and population distributions that can differ by important strata. Monte Carlo simulations help assess the overall number of clusters identified. The population and events for each area as well as a nearest neighbour spatial relationship are required. We also provide an alternative test applicable to situations when only the aggregate number of events, and not the number of events per individual, are known. The methodology is illustrated on administrative data of presentations to emergency departments. Conclusions We provide a new method for the detection of aggregations of events that does not rely on distributional assumptions and performs well.


Background
In disease surveillance, statistical methods can be used to identify geographical areas that have statistically higher numbers of cases of disease than expected by chance. These geographical areas with aggregations of disease are called clusters. In some situations, the disease incidence or prevalence may not be the most or only relevant feature for analysis, and the analysis of events related to diseased individuals may be more appropriate. For the delivery of health services through emergency departments (EDs), the number of presentations to EDs can be more relevant than the number of distinct individuals seen in the ED. If there are many individuals that have multiple presentations, analysis based solely on the number of individuals, and not the number of presentations, will only be able to identify clusters with excess numbers of individuals and not clusters with excess presentations. Ignoring presentations prevents identification of clusters where more presentations, but not necessarily more individuals, are occurring than expected by chance. Surveillance of presentations to EDs can identify geographic areas with high presentations where access to other health care providers is limited and statistical detection of these areas necessitates incorporating repeated presentations by individuals.
There are several different tests for the identification of clusters of diseased individuals (cases) in geographic areas (see [1,2] for overviews). The geographic areas are generally administrative regions for which case and population counts are available. Besag and Newell [3] used the terms general and focused to distinguish between tests that identify clusters in a geographic region and those that identify clusters around specific geographic locations, such as hazardous waste sites. For both types of tests, the methods generally either examine areas with similar population size and compare the number of cases or examine areas with similar numbers of cases and compare the population sizes. Some methods can detect the location of clusters while others can identify areas with the tendency to cluster.
We focus on methods that can accommodate geographic areas with different population sizes. One exploratory method has been proposed by Openshaw et al. [4] and requires the user to specify a threshold. Their geographic analysis machine constructs overlapping circles and displays the circles with proportions of cases exceeding the threshold. The Turnbull et al. [5] method also uses overlapping circles but these circles require a user-specified population size. The maximum number of cases in each of the circles is determined and a Monte Carlo test assesses if the observed proportion is higher than expected. Kulldorff and Nagarwalla [6] generalize this approach and seek to identify a clustered circle, where the individuals inside the circle have greater risk than the individuals outside the circle, using a likelihood ratio test. Duczmal and Assunção [7] propose a similar approach that uses connected subgraphs to identify the cluster. Tango [8] takes a different strategy by using area proportions of cases and a measure of closeness. A chi-square test compares observed and expected proportions to detect clusters. Besag and Newell [3] look at similar numbers of cases and combine nearby areas to observe a userspecified number of cases. The population sizes of these areas with similar cases are then compared. More recently, Rosychuk et al. [9] have provided a similar method for events rather than cases by using a compound Poisson distribution and compared the two approaches [10].
Herein, we provide an approach to identify geographic areas with aggregations of events. These aggregations are called event clusters, or simply clusters when the distinction between aggregations of events and cases is clear. Our approach is analogous to the cluster detection test proposed by Besag and Newell [3] and does not require distributional assumptions, like the Rosychuk et al. [9] test. The approach is instead based on exact probabilities for the numbers of events in a tested geographic area. The method incorporates diverse population sizes and population distributions that can differ by important strata. This method requires the number of events per case is known. In addition, we provide an approach that can be used when the total number of events are known for the cases in a geographical area, but not the number of events per case. We describe the methodology and compare the methods using administrative data on emergency department (ED) presentations for self-inflicted injuries. We provide a simulation study that examines the false detection of clusters and conclude our discussion.

Results
We consider a geographic region divided into I cells where the population in cell i is denoted by n i , i = 1,..., I. The total population is labeled n n i i I = = ∑ 1 . The distances between cell centroids are calculated for each pair of cells. For each cell, the remaining cells are ordered based on increasing pairwise distances to determine the nearest neighbours. Specifically, we let i 1 be the cell that is closest to cell i, i 2 be the cell that is next closest to cell i, and so on. For convenience, we have i 0 = i. We first describe the Besag and Newell [3] approach for cases using a hypergeometric distribution. We provide our new exact method for events and review the compound Poisson approach developed by Rosychuk et al. [9]. We also describe the inclusion of strata variables and the choice of event cluster size.

Hypergeometric approach for cases
The Besag and Newell [3] method requires the number of cases and population in each cell. For each cell i, let C i be the random variable denoting the number of cases in the cell and let c i be the observed value. Further, let and the observed value of the total number of cases in the entire region, respectively. In practical applications for a rare disease, c i ≪ n i and generally c ≪ n i , for i = 1,..., I. Each cell i is tested separately and the test is based on combining nearest neighbours to obtain a minimum pre-specified number of cases, called the cluster size. For cell i, suppose that the cluster size is k. The test statistic, L i , is the smallest number of nearest neighbours that must be combined with i to have at least k cases, Smaller observed values of L i are indicative of higher numbers of cases in the vicinity of the tested cell.
Under a null hypotheses that all individuals are equally likely to be a case independent of other cases and geographic location, Besag and Newell [3] use a Poisson distribution as an approximation to the hypergeometric probability. Using the exact hypergeometric formulation, the probability that there are exactly x cases in a sample of m population is If we denote the population of the combined cells as We will refer to this testing approach as the hypergeometric case (HC) analysis in further discussions.
As each cell is tested separately and multiple testing is an issue, Besag and Newell [3] suggest Monte Carlo simulations to help assess the significance of overall clustering. Let R a denote the number of cells that are significant at significance level a. For each simulation, the c cases are randomly distributed to the I cells according to the proportion of population in each cell (i.e., cases are distributed to cell i based on probability n i /n). The testing is conducted and the number of significant cells for each simulation is recorded. The overall assessment of statistical significance is obtained as the proportion of simulations that are at least as extreme as the observed R a from the actual data set. We refer to this overall assessment as a p-value for overall clustering.

The proposed approaches for events
Our proposed method builds on the principles of the HC method and the approach of Rosychuk et al. [9] that concentrates on the number of disease-related events rather than the number of diseased individuals. As before, C i denotes the random variable corresponding to the number of cases in cell i. Let V i and v i be the random variable and observed value of the number of events in cell i, respectively. The total number of cases (i.e., individuals with at least one event) and events for the region are Each cell is tested separately as for the cases approach described above. The cluster size, now termed event cluster size, is the number of events rather than the number of cases used in the test. Let k* be the event cluster size for cell i. The test statistic is the smallest number of nearest neighbours that must be combined with cell i to have at least k* events, Smaller observed values of L i * are indicative of higher numbers of events in the vicinity of the tested cell.
The use of events requires changes to the null hypothesis and the exact probability. The null hypothesis is that every individual is equally likely to have events, independent of other individuals and geographic location. The hypergeometric probability is no longer appropriate if there is more than one event per case. If the number of events per case is known, then a multiple hypergeometric approach can be used as described below. If the number of events per case is unknown but the total number of cases and events are known, then a counting approach which employs occupancy numbers can be used.

Known number of events per case
Let C iy be the random variable denoting the number of cases in cell i that have exactly y events and let c iy denote its observed value. Suppose that the maximum number of events any subject has is Y, 1 ≤ Y. The total number of cases in cell i is C The probability of the number of events observed in a geographic area is based on a multiple hypergeometric distribution, since individuals with the same number of events can be thought of as classes and the subjects are sampled without replacement from these classes. The probability of observing x events among a sample of m individuals is where {r y } are non-negative integers from the set  , and we refer to this approach as the exact event (EE) analysis.
In practical application, the random variables representing events are replaced by the corresponding observed values and the expected number of events, n i:ℓ v/n, is helpful in describing the results. Simulated data sets are created analogously as described for the Besag and Newell approach for cases and a p-value for overall clustering is determined. We used a C ++ program for numerical results and M (x, n i:ℓ ) is quite easily obtained by convoluting binomial coefficients and storing the intermediate quantities. The terms ⎟ are common to each combination of (r 1 , ..., r Y ), and thus can be precomputed and simplified for better numerical stability by using logarithms.
Our new approach is similar to the test provided by Rosychuk et al. [9]. The difference in approaches is the way in which the key probabilities are calculated. In that work [9], instead of the probability in (4), probabilities from a compound Poisson distribution are used. Specifically, P (x, n i:ℓ ) replaces M (x, n i:ℓ ) in (5) and is defined by where Q(y) is the probability that a case has exactly y events. Those authors take Q(y) to be the observed frequency c •y /c and with this choice, the expected number of events becomes n i:ℓ (c/n) (v/c) = n i:ℓ v/n. This approach, referred to as the compound Poisson event (CPE), assumes that a compound Poisson distribution is appropriate and relies on estimates of the Q(y) that may not be very precise. Both methods have the same expected number of events but these events come from distinct distributions. As in the HC approach for cases, Monte Carlo simulations can be used to help assess overall clustering.

Aggregate number of events known only
In some situations, the total number of events, v i and v, and cases may be known but the exact number of events per case (e.g., c iy , c •y ) may not be known. Here, the relevant probability can be determined using a counting approach employing occupancy numbers (see [11] p. 38).
The number of ways that V events can be distributed among the n individuals is is the number of ways that x events can be distributed among the m individuals and is the number of ways to distribute the remaining Vx events among the remaining nm individuals. Putting these counts together, the probability of observing x events among a sample of m individuals is The significance level for the tested cell becomes and we refer to this approach as the aggregate event (AE) analysis. As in the EE approach, random variables are replaced by the corresponding observed values and simulation is used to assess overall significance.
A key difference with this new aggregate event method and the EE and CPE methods is that the latter two methods link the events to cases. The AE approach uses only aggregate events, assumes that events are indistinguishable, and does not include information on the number of cases or the chance that a particular number of events are observed from one subject. Hence a loss of information occurs with the use of the AE approach, however if the information available does not include the number of events per case then the EE and CPE methods are not applicable and the AE method permits analysis.

Stratification
Cells can have population distributions that differ on key characteristics such as gender and age. These strata can be added to the test to adjust for differing popula- The test statistic in (3) also applies when strata are added. The relevant probability is based on (4), but has to be modified to include the different strata and the multiple ways that the events may be distributed amongst the strata. The probability of observing where  = {(r 11 ,..., r Y1 ,..., r 1S ,..., r YS ), such that The incorporation of strata tends to reduce the computational load as compared to the calculations without strata. The number of cases with events in a particular stratum is smaller than the total number of cases with events and thus, fewer combinations of (r 1s , ..., r Ys ) will be required. These combinations can then be calculated through a convolution.
The AE approach can be similarly extended for strata. The total number of events in stratum s is Choosing k* The actual size of a real cluster will not be known. The tests are based on identifying a cluster of a particular size, k*. With the dependence on k*, its choice is key and will depend on the specific data context. If the population sizes are similar among cells, a pre-specified, meaningful k* may be chosen and used for testing each cell. If population sizes are quite different, then cells may be better tested at individually-chosen, pre-specified k*. Le et al. [12] proposed a testing algorithm for the Poisson version of the HC method and a similar scheme is used for CPE approach [9]. We follow the latter approach.
In our prescription, event cluster sizes k k k , , , are determined for cell i, i = 1, ..., I. Cell i is tested at k i0 * and if it is significant at a level a, testing is concluded for this cell. If it is not significant the cell is tested again at k i1 * . The testing algorithm proceeds in this matter until the tested cell is significant at an event cluster size or the set of event cluster sizes has been exhausted. This approach is similar to sequential analysis and the event cluster sizes are defined as for tests of size a, w = 0, 1, 2. The event cluster sizes are based on the 100 × (1 -a) percentile of the distribution of events with populations from the cell and up to three of its nearest neighbours. ¿From this definition, cells with significant tests will have the test statistic equal to w whereas cells without sufficient events will have to combine more than w neighbours (i.e., L i * > w). With potentially multiple event cluster sizes tested per cell, the Monte Carlo simulations are an important aspect to address the potentially increased testing. This approach to choosing the event cluster size can be easily extended to depend on strata and/or determined similarly for the AE approach.
We have chosen w to be at most 2 for our application because some of our geographic areas are quite large, either in population or geography, and w beyond 2 would potentially mean that large geographic areas and/ or large population areas would be tested together. It is unlikely that such a situation would yield a statistically significant cluster, since the population sizes become quite large. Other choices for w are possible and would depend on the particular geography involved and the user's preference. In particular, one could potentially have different numbers of tests per cell (i.e., w i ).

Self-inflicted injury data
We use data from an administrative database that includes all presentations to emergency departments (EDs) in the province of Alberta, Canada. We focus on presentations to EDs for self-inflicted injuries by adolescents (≥ 13 and < 18 years of age) during April 1, 1998, to March 31, 1999. Individuals with at least one ED presentation during the study period are the cases and the ED presentations are the events. Data are stratified by sex (male, female) and age group (13-14, 15-17 years).
The province of Alberta ( Figure 1) is divided into I = 68 sub-Regional Health Authorities (HAs) with very diverse population sizes (median 2975, range 599 to 10298). During the study period, adolescents numbered n = 223999 in the population and c = 764 individuals presented v = 852 times to the ED with self-inflicted injuries. Although most individuals presented once, the range of presentations was one to 18. For the HAs, the median number of cases was 7.5 (range 0 to 43) and the median number of events was 8 (range 1 to 51). Alberta Health and Wellness provided the administrative data and the distances between population-based centroids. For each HA, Alberta Health and Wellness has determined a population-based centroid that is the latitude and longitude of the centre of the geographic area weighted by population. The distances between centroids are calculated and the ordering is provided for each cell.
Key to this illustrative example is that individuals can make multiple ED presentations during the study period. If analysis is based on the cases alone and not the presentations, then a case with only one presentation and a case with 10 presentations are treated exactly the same. The information on the "extra presentations" is ignored and only areas with excess number of cases can be identified as a cluster. Areas that have excess numbers of presentations, but not necessarily excess numbers of cases, cannot be identified unless the analysis includes the presentation information. From the health services perspective, the presentations are the most appropriate unit of analysis and while identifying areas with excess cases is important, the identification of excess areas of presentations may suggest areas where disease severity may be greater or alternative health care options are not as available.
We compare the HC, CPE, and EE methods on the Alberta data set stratified by sex and age group ( Table 1). The table lists the size of the cluster tested (k or k*), observed test statistic (ℓ), and the observed (O) and expected (E) cases or events. Tests that were significant at a = 0.05 are indicated with an asterisk (*). HAs that are not significant were tested at all cluster sizes, from w = 0 to 2, with the results for the last test reported. For the significant HAs, the w is the same as the value of ℓ. For non-significant HAs, the number of cells (ℓ) that need to be combined to have at least the size of the cluster tested is larger than w. The HAs are grouped according to the Regional Health Authority that provides services to the HAs (e.g., HAs 1 to 5 are part of the same Regional Health Authority). Figures 2,  3, and 4 display the HAs that are significant, either alone or in combination with other HAs. For each method, the simulation-based overall tests suggest that the results are not likely to have occurred by chance. In the 1000 simulations for the HC case analysis, 3 had at least 15 clusters observed (overall p-value = 0.003). For the event analyses, the CPE approach had 5 simulations with at least 13 clusters (overall p-value = 0.005) and the EE approach had 0 simulations with at least 15 clusters (overall p-value = 0.000).
The HC and EE methods both identify 15 significant HAs as part of clusters, although the significant HAs are not the same in each analysis. The CPE method identifies 13 significant HAs, all of which are identified by the EE method. Generally, the HAs that were significant with all three methods had higher numbers of cases and events than expected by chance. Similarly, the HAs that were not found to be significant by any of the methods had fewer cases and events than expected by chance. The event cluster sizes are generally very close for the EE and CPE analyses, with the EE analyses having slightly lower sizes for some HAs. One thousand Monte Carlo simulations were done for each method and less than six simulated data sets in each approach were at least as extreme as the actual number of clusters observed. Hence, not all clusters are likely to be spurious.
We focus our discussion on the HAs where the methods have incongruous results. HAs 1, 5, 7, and 29 are significant as parts of clusters for the HC analysis, but not significant for the CPE and EE analyses. For these HAs, generally the number of cases is high and the number of events are just a bit lower than the CPE and EE cluster sizes. To be significant as a cluster alone in the CPE or EE analyses, HA 29 needed to have at least 30 events whereas it only had 28 events. HAs 3 and 4 were identified as clusters in the EE analysis, but not in the CPE analysis. In particular, the event cluster size required for the EE analysis was 43 based on the combination of HAs 3, 4, and 2. There were exactly 43 events in these combined cells and the EE test was significant. For the CPE analysis, the corresponding event cluster size was 44 and the combined HAs had too few events.  The fact that all of the significant HAs in the CPE analysis were all identified by the EE approach is not a feature that is likely to occur in all data sets. In particular, our event cluster sizes are relatively large and the relevant probabilities of the CPE and EE may be closer than in other situations. In addition, the compound Poisson distribution may not be an appropriate distribution for other data. Furthermore, our example had key HAs that were significant alone and when these HAs were combined with some other HAs, they too became significant. Typically, there would be some cells where the cells individually are not significant but when combined together do become significant.
In practice, a researcher would probably conduct both analyses based on cases and analyses based on events. Each analysis tests different aspects of data based on multiple disease-related events. If the median number of events is larger than one, then the analysis based on events is likely more informative. We would expect that users would prefer the EE approach over the CPE approach because it does not require distributional (parametric) assumptions, even if there is some power loss over parametric procedures. In addition, this approach can be very efficiently programmed even when there are strata variables. If there are no strata variables, then the CPE approach may be more timely.

Simulation
We examine the Type I Error of the EE and CPE approaches through simulation studies. In these studies, the Alberta cells and geographic relationship are used. The cell populations are set to be the Alberta population or the same population in each cell (1000, 5000, or 8000). Five settings for the probability of multiple events per case are considered ( Table 2) with varying means and skewness chosen for convenience. Setting S5 is based on the self-inflicted injury data in Alberta. The event rate is set to be 2 events per 1000 population. That is, the total number of events in each simulated data set was 136, 680, and 1088 for the settings with 1000, 5000, and 8000 population per cell, respectively. When the actual Alberta population is used, the total number of events is 447. With multiple event probabilities and crude event rates, the simulated data sets are created by randomly assigning the c •1 , c •2 , ... cases to the cells based on each cell's proportion of the total population (as in the Monte Carlo simulations for assessing overall clustering). These settings allow us to demonstrate that the detection of events is different than the detection of cases. In particular, clusters of events may be identified that are not also clusters of cases.
For each simulation setting, we generated 1000 data sets and applied the CPE and EE approaches to each     Table 3. With the discreteness of the distributions, the cluster size may coincide with a smaller significance level than the significance level setting of a = 0.05. We provide the effective significance level, a * , for each scenario based on the cluster size and provide the number of simulations that had at least one cluster detected. For the scenarios with constant cell populations the a * is the same for each simulation whereas for the data sets based on the Alberta population, the mean and standard deviation (SD) for a * are provided. As each simulated data set has different testing results, the mean and SD for the number of significant clusters are also given.
In general, the EE and CPE approaches have similar results and the results of each approach are close to the (effective) significance level. The CPE results tend to identify slightly fewer clusters than expected. With larger population sizes, the effective significance level is closer to 0.05. When the diverse population sizes of Alberta are used, the number of clusters detected is more variable. Both methods provide detection rates of false clusters that are close to what is expected by the (effective) significance level.

Conclusions
We have provided a new method for the detection of aggregations of events related to diseased individuals, called event clusters. This method builds on the approaches of Besag and Newell [3] for cases and of Rosychuk et al. [9] for events. Our approach uses the exact probability of observing events from a sample of the population. It is applicable to situations where cases may have multiple events and the number of events are of interest. The population sizes can differ from cell to cell and can be adjusted by strata. The method is easy to implement in computer code and requires a minimal amount of data from the administrative region. We used a testing algorithm similar to sequential testing to determine the size of clusters specific to each cell and compared the new method with two other approaches using data on presentations to emergency departments for self-inflicted injuries. In some contexts, like our emergency department presentations, the number of events can be more relevant than the number of cases and new method provides an exact approach for determining aggregations of these events. The use of cases only does not necessarily capture the relevant type of clustering and analyses based on events can also complement analyses based on cases.
Our approach is based on exact probabilities and does not require distribution assumptions as in the compound Poisson approach by Rosychuk et al. [9]. Those authors had to specify a probability distribution for the   Table 3 Simulation results for each cell size and scenario EE CPE n i Scenario a * (SD) Sim(SD) a * (SD) Sim(SD) Alberta S1 3.9(0.6) 4.0 (0.9) 3.9(0.6) 3.9 (0.9) The effective significance level, a * , is provided for each scenario and each approach (and standard deviations [SDs] for the Alberta scenarios). The mean number of significant cells (Sim) are provide along with SDs. Numbers are given as percentages (%).
situation when a case has exactly y events and use estimates in their application. The distribution could be misspecified, estimates may not be very precise, and the p-value does not capture the additional uncertainty of these estimates. Our method uses the multiple hypergeometric distribution that does require a distribution, or estimates, for the event probabilities. Our new approach is particularly appealing when analyses are conducted by strata variables. With more strata variables, there are fewer possible combinations to calculate and the use of convolutions makes the computational time less than that required by the compound Poisson approach.
One drawback of the testing algorithm is that the user must choose the number of cluster sizes to be tested. This choice is perhaps easier than choosing a particular cluster size to test, especially if areas have diverse population sizes. Furthermore, the structure of the algorithm dictates that statistically significant cells will have the unappealing feature that p-values will be close to the significance level. Additionally, overall clustering can be assessed through Monte Carlo simulations and the same approach could be used to examine individual cell tests (i.e., calculate the proportion of simulations for which a particular cell was statistically significant as described in Rosychuk et al. [9]).
Both our new method and the compound Poisson method require that the number of events per cases is known. We also introduced an approach for the situation when the number of cases and events are known, but not the number of events for each case. Administrative data sources may not record the number of events per case and we provided an analysis approach for this context. This approach considers events to be indistinguishable and does not specifically use the number of cases. While not preferable to the analysis with information on the events per case, this approach is applicable for less detailed administrative data.
Further work is necessary to compare the power of the different approaches to detect different sized clusters and to consider different data generating mechanisms. Most cases in our data example only had one event. Our event results were different than the case analysis but further examination should include larger numbers of events per case. Such investigations will give users a greater ability to decide which method is most appropriate for the surveillance of disease-related events in their jurisdiction.