# Detection of arbitrarily-shaped clusters using a neighbor-expanding approach: A case study on murine typhus in South Texas

- Zhijun Yao
^{1}Email author, - Junmei Tang
^{2}and - FBenjamin Zhan
^{3}

**10**:23

**DOI: **10.1186/1476-072X-10-23

© Yao et al; licensee BioMed Central Ltd. 2011

**Received: **8 November 2010

**Accepted: **31 March 2011

**Published: **31 March 2011

## Abstract

### Background

Kulldorff's spatial scan statistic has been one of the most widely used statistical methods for automatic detection of clusters in spatial data. One limitation of this method lies in the fact that it has to rely on scan windows with predefined shapes in the search process, and therefore it cannot detect cluster with arbitrary shapes. We employ a new neighbor-expanding approach and introduce two new algorithms to detect cluster with arbitrary shapes in spatial data. These two algorithms are called the maximum-likelihood-first (MLF) algorithm and non-greedy growth (NGG) algorithm. We then compare the performance of these two new algorithms with the spatial scan statistic (SaTScan), Tango's flexibly shaped spatial scan statistic (FlexScan), and Duczmal's simulated annealing (SA) method using two datasets. Furthermore, we utilize the methods to examine clusters of murine typhus cases in South Texas from 1996 to 2006.

### Result

When compared with the SaTScan and FlexScan method, the two new algorithms were more flexible and sensitive in detecting the clusters with arbitrary shapes in the test datasets. Clusters detected by the MLF algorithm are statistically more significant than those detected by the NGG algorithm. However, the NGG algorithm appears to be more stable when there are no extreme cluster patterns in the data. For the murine typhus data in South Texas, a large portion of the detected clusters were located in coastal counties where environmental conditions and socioeconomic status of some population groups were at a disadvantage when compared with those in other counties with no clusters of murine typhus cases.

### Conclusion

The two new algorithms are effective in detecting the location and boundary of spatial clusters with arbitrary shapes. Additional research is needed to better understand the etiology of the concentration of murine typhus cases in some counties in south Texas.

## Introduction

In recent years, there has been a significant increase in public concern about environmental hazards and disease events [1]. The necessity of identifying the spatial pattern and discovering its underlying causes has culminated in proposing a variety of methods to facilitate this task. Cluster detection methods have been playing an important role in modern epidemic research and public health practice, offering clues to the spatial location of emerging diseases and knowledge of their etiological and pathological causes [1]. A number of spatial statistical methods have been incorporated in cluster detection given the wide adoption of statistical methods since the early 1960s [2]. Many of these methods were developed from statistical indices such as Local Indicators of Spatial Association (LISA) [3] and local G statistic ( ) [4]. These statistical methods were incorporated into some spatial cluster detection methods, such as the Multidirectional Optimal Ecotope-Based Algorithm (AMOEBA) proposed by Aldstadt and Getis [5]. Among these spatial statistic methods, the spatial scan statistic model has been one of the most widely used methods [6, 7].

Inspired by the work of Openshaw et al. (1987) [8] and Turnbull et al (1990) [9], Kulldorff (1997) developed a spatial scan statistic that has the capacity to detect clusters of various sizes by placing and moving circular windows across the study area [7]. Rather than specifying the size of a potential cluster *a priori*, this method uses a scan window of varying sizes, corresponding to varying population and varying number of incidents. This method has been applied to many research fields. Examples of these applications include disease pattern analysis [10], criminology [6, 11], network [12], as well as ecology and the environment [13]. However, the spatial scan statistic and other similar approaches suffer from some restrictions in practice [14, 15]. Although this method can be adopted to include any shape for scan windows [7], it still has limitation in practice due to the predefined geometrical shapes of scan windows [15] which leave a large number of candidate clusters out of the test. It is therefore necessary for researchers to develop methods that can be used to detect clusters with arbitrary shapes.

Recently, many methods and strategies have been proposed to improve the detection of clusters with arbitrary shapes by constructing scanning windows of irregular shapes. Tango and Takahashi (2005) presented a "flexibly shaped spatial scan statistic" (FlexScan) which uses a limited exhaustive search to detect arbitrarily shaped clusters by aggregating its nearest circular neighboring areas [16]. The spatial scan statistic superimposes circular windows on the study area, while FlexScan generates irregularly shaped windows on each area by aggregating its nearest neighboring areas. To reduce the number of arbitrarily shaped scanning windows, Tango and Takahashi [16] limited the length of clusters referring to the relatively small number of areas contained in a scanning window. This method extends the spatial scan statistic to detect irregular shapes but is only applicable for detecting clusters of small or moderate sizes. In addition, the determination of the threshold size of a cluster is very subjective, though Tango and Takahashi (2000) suggested choosing about 10~15 percent of the size of the whole study area as a reasonable number.

One solution to this problem involves setting a constraint to guide the search process so as to reduce the number of candidate scan windows. Patil and Taillie (2004) introduced the concept of "upper level set" and developed an "upper level set scan statistic" [17]. Based on this statistic, a more generalized strategy named minimum spanning tree (also called a cheapest connecting network) was proposed by Assuncao et al (2006) to reduce the number of neighbors to be searched [18]. This method is called a cheapest connecting network or a greedy growth search (GGS) which only absorbs the neighboring areas to maximize the likelihood of a new window. This idea was further improved in the Density-Equalizing Euclidean Minimum Spanning Tree (DEEMST) method proposed by Wieland and her colleagues (2007) [19]. The Minimum Spanning Tree method offers two different functions: in a static minimum spanning tree, the weight refers to the difference of risk rate; in a dynamic minimum spanning tree, the variance of maximum likelihood ratio is taken into account. These methods are similar to GGS as they absorb only the neighboring areas in the search process to maximize the likelihood of a new window. It has the flexibility to start the search from any location in the study area.

GGS cannot avoid the local maximum problem [20]. Many algorithms were adopted or developed to improve the GGS. The genetic algorithm is employed to limit the irregular shape of most potential real clusters [21–23]. Yiannkoulias et al. (2007) presented two approaches to improve the greedy growth search: one is the non-connectivity penalty in order to limit the very irregular cluster shapes and another is the depth limit (*u*) to prevent the generation of large super-clusters from smaller clusters [1]. These approaches will terminate the search in GGS when it fails to increase the likelihood after the predefined steps.

Another famous improvement is a "simulated annealing strategy" proposed by Duczmal and Assuncao (2004). This method is based on graph theory in which nodes present centers of areas, and edges present the geographical relationships among areas [20]. The simulated annealing spatial scan statistic was improved by introducing a non-compactness penalty to reduce the chance that the cluster with extremely irregular shapes would be found [24]. Most of the recent proposed methods try to detect the globally most likely cluster [20, 23] and this is critical in cluster detection since the search process of some methods frequently leads to or sticks on the locally most likely clusters.

In this article, we report the development of two algorithms that use a new neighbor-expanding approach based on the assumption that any subset of adjacent areas could make up a potential cluster, and that the shape of this cluster might not be circular or rectangular. These two algorithms are called the maxima -likelihood-first (MLF) algorithm and non-greedy growth (NGG) algorithm. These two algorithms build upon the existing cluster detect techniques, and adopt neighbor-expanding tactics to construct a set of scan windows instead of just using the scan windows in some predefined shapes. Furthermore, the proposed algorithms improve the arbitrarily-shape cluster detection method in avoiding the local maximum problem since the algorithms search for the globally most likely cluster at each step in the search process.

### Two New Algorithms

#### Kulldorff's Spatial Scan Statistic

Because the two algorithms were built upon the spatial scan statistic, it is necessary to review the spatial scan statistic first. Kulldorff's scan statistic method starts from choosing an appropriate probability model of data to compute the likelihood ratio test statistic λ(z) for any scan window z. After identifying primary cluster candidates with the maximum λ(z), a Monte Carlo hypothesis procedure tests the statistical significance and obtains a p-value [25].

*H*

_{ 0 }(constant probability for all area) and the alternative hypotheses

*H*

_{ 1 }(the specific area z has a larger probability than outside areas) using either a Bernoulli model or a Poisson model. For a given region

**z**, the likelihood function based on the Bernoulli model can be expressed using expression (1):

**μ(G)**and

**μ(Z)**are the total population of the study area and population in region

**Z**;

**nG**and

**nZ**are the total number of observed cases in the study area and in region

**Z**;

*p*is the probability that an incident falls in region

**Z**, and q is the probability that an incident falls in the rest of the study area. The likelihood of observing n(Z) in region

**z**is given by the function shown below:

Once the most likely cluster has been identified, the next step is to test the statistical significance of the detected clusters. To do so, *p*-value, derived from the Monte Carlo simulation, is used to assess the statistical significance of the detected clusters. The Monte Carlo simulation, proposed by Dwass in 1957 [26], was first introduced to cluster detection tests by Turnbull et al. [9]. In a Monte Carlo simulation, a large number of random replications can be generated under a chosen distribution model, conditioned on that the simulated case number will be the same as the real data. In this study, we used the real population counts in each area in the Monte Carlo replication. The disease events in each area are drawn from a non-homogeneous Poisson distribution with mean
. The likelihood ratio for each region is calculated using the replica data as well as the real data during the simulation process. Each simulated dataset has a maximum likelihood ratio in the same way as the real data. Then *p*-values can be calculated based on the sorted likelihood ratio of the real data and simulated data. For example, if there are *N* simulated datasets and one real dataset and the total number of datasets will be *N+1*. Within these total datasets, there are *n* simulations having a larger or equal maximum likelihood ratio compared to the one obtained from the real data. That is, the rank of the real data is *n* when we sort the data by their maximum likelihood ratios. The *p-* value for the significant testing in this example will be equal to *n*/(*N*+1). Theoretically, the smaller the p-value, the more likely the cluster is not due to chance. Due to the uncertainty associated with cluster validation, it is suggested that the proposed approach be used as an exploratory rather than a deterministic cluster detection tool.

#### A New Neighbor-expanding Approach

If we choose {16} as a seed region at first length, we find it highlighted by red color in Figure 1a, we can then get its seven neighbors, areas 10, 11, 12, 15, 18, 22, and 23 (Figure 1b). Thus the seven regions can be obtained at the second length based on region {16}. These seven regions are {10, 16}, {11, 16}, {12, 16}, {15, 16}, {18, 16}, {22, 16}, and {23, 16}. Furthermore, in order to obtain the third length regions, we can choose region {15, 16} and get its neighbor areas: 14, 10, 11, 12, 19, 18, 21, 22, and 23. Now we can get 9 regions at the third length: {14, 15, 16}, {10, 15, 16}, {11, 15, 16}, {12, 15, 16}, {19, 15, 16}, {18, 15, 16}, {21, 15, 16}, {22, 15, 16}, and {23, 15, 16}.

While this search process continues, the number of regions increases exponentially as we aggregate more areas. This process is computationally very intensive. In order to reduce the number of regions, we developed two alternative algorithms for the construction of regions or scan windows: maxima-likelihood-first (MLF) algorithm and non-greedy growth (NGG) algorithm.

##### The Maxima-likelihood-first Algorithm

The principal goal of this algorithm was to direct the new region construction process to obtain a global maximum. This maximum refers to the highest value we were able to obtain by the proposed approach. After analyzing equations (4) and (5), we found that it is hard to determine which of the following factors make the most contribution to the likelihood ratio: the number of cases, population size, or the relationship between them. Thus, there is no clear guidance that could help us construct scan windows which would have the highest likelihood ratios. Rather than construct scan windows randomly, we try to focus on the generation of windows for the most promising clusters. We name this approach as the maximum-likelihood-first (MLF) approach because it always constructs new promising clusters by expanding from the current best candidate, yielding the maximum likelihood ratio.

When we detect the cluster using the neighbor-expanding approach described above, it is very likely that the procedure may stick to some areas with high LLRs and unable to search the entire study area. Usually, LLRs of candidate clusters depend on the risk rates of their neighbors [19]. That is, areas with higher risk rates are more likely to have higher LLRs than those with lower risk rates since LLRs of clusters do not vary a lot if they contain the same subset of areas [7]. It means if a candidate cluster overlaps largely with another candidate cluster with a high LLR, it may have a higher LLR than other areas which have not been explored. This observation leads to proposed search procedure to stick with one area and its neighbors if their LLRs increase fast at the beginning and decrease slowly. Therefore, it is necessary to set a threshold to stop the search around a particular area and its neighbors when the LLRs of the newly generated clusters fail to increase in certain steps. This arrangement allows the search to move to other unexplored areas to detect other potential cluster centers. Originally suggested by Yiannakoulias, Rosychuk, and Hodgson (2007) as a depth limit adaptation [1], this idea is incorporated into the MLF algorithm.

As shown in Figure 2, this procedure is repeated until half of the total population or study area is covered. The cluster with the highest LLR is selected as the most likely cluster while the secondary cluster is the cluster having both the second highest LLR with no overlap area with the most likely cluster. Since this approach does not focus on one or some particular areas, it is expected to avoid the local maximum problem.

##### The Non-greedy Growth Algorithm

The non-greedy growth (NGG) algorithm is an improved version of greedy growth algorithm [1]. Several researchers have described how greedy growth approaches perform in searching clusters with irregular shapes [1, 24]. The greedy growth search starts with areas having high log likelihood ratio as seed areas for potential clusters. The search is only interested in a neighboring area that has the maximum LLR or has the capability to maximize the LLR when aggregated to form a new potential cluster. Similar to the procedure described above, the greedy growth algorithm joins other areas until a given population size or other thresholds are reached. The same procedure is repeated from other seed areas.

The greedy growth approach sounds tempting, but it has an inherent deficiency in that it does not guarantee to find either the best solution or the global maximum. This method easily falls into the trap of local maximum since it excludes some areas which might potentially form a more promising cluster when they combine with other areas.

To solve this problem, we propose a new algorithm to minimize the impact of the local maximum problem. To distinguish it from traditional greedy growth approaches, we name it "the non-greedy growth algorithm". The algorithm allows not only the neighboring area with the local maximum to be included but also includes many other neighboring areas in the search procedure. Usually the number of newly formed regions relies on the number of candidate regions and the number of neighbors of each region. With this method, we can set a constraint on each of these two numbers control the number of newly formed regions at the next step of the search process. Previous studies suggest that the number of candidate regions increase exponentially, while the number of neighbors of each region does not change dramatically. Therefore, it is more reasonable to set a threshold on the number of candidate regions. Theoretically, if we choose only one candidate and one of its neighbors with the highest LLR each time, this method degrades to the traditional greedy growth search method. The inverse extreme of this approach is the naïve exhaustive approach where no limitation is set.

In the NGG algorithm, we set a threshold (M) on the maximum expected number of new regions at each iteration. Given that threshold and the average number of neighbors, we could easily determine how many candidate regions should be chosen to participate in the aggregation process. There are a few options in the choice of candidate regions. One is to choose M most promising regions, directly from the pool of candidates, or to choose them randomly. In the actual implementation reported in this paper, we used a combination of the two, that is, part of M candidates are from the top regions and the rest are chosen randomly.

An initial comparison of MLF and NGG

Advantage | Disadvantage | Favored Situation | |
---|---|---|---|

MLF | • results might be more significant with higher LLRs • it is faster than NGG when there are few clusters | • it is hard to control when most clusters have relative similar LLRs • only the cluster with the highest LLR is kept into the next search | • data containing few extreme clusters • small number of units |

NGG | • the maximum number of candidate cluster is controllable • it is simple to be implemented | • the search procedure will continue until it reaches the criteria | • large number of units |

#### Study Case and Data Preparation

The data used in the present study include geographic boundary shapefiles, population data, and disease data issued by the Texas Department of State Health services. In this study, the cluster detection was performed at the census tract block group level and the geographic boundary shapefiles are obtained from Environmental Systems Research Institute (ESRI) website and ESRI Data DVD [28]. There are a total 1,068 census tract block groups and 1,728,393 inhabitants in the study area. The population and socioeconomic data were derived from the 2000 Census Summary File 1 (SF1) and Census Summary File 3 (SF3) [29] and joined to the geographic boundary shapefile to allow for spatial cluster analysis. The disease data used in this study consist of 555 murine typhus cases reported to the Texas Department of State Health Services from 1996 to 2006. Although these cases are reported throughout a year during the period, 44% of cases were found in May, June, and July. The raw disease data were stored in an Excel file, containing the geographical location of cases (latitude and longitude), the onset time of cases (year, month, and day), age, gender, and race of patients, zip code and street name of cases. The disease data have been spatially joined to the boundary file using ArcGIS 9.3.

## Results and Discussions

### Performance Test Using Simulated Data and Benchmark Data

The comparison between the MLF method, NNG method, SA method, Tango's FlexScan method and Kulldorff's SaTScan method using the synthesized data

Clusters | Observed # | Expected # | LLR | p-value | |
---|---|---|---|---|---|

MLF | 95 | 41.646 | 27.396 | 0.001 | |

Compact shape | NNG | 95 | 41.646 | 27.396 | 0.001 |

SA | 95 | 41.464 | 27.396 | 0.001 | |

FlexScan | 95 | 41.646 | 27.396 | 0.001 | |

SaTScan | 95 | 41.646 | 27.396 | 0.001 | |

MLF | 90 | 39.273 | 26.083 | 0.001 | |

NNG | 90 | 39.273 | 26.083 | 0.001 | |

Ring shape | SA | 90 | 39.273 | 26.083 | 0.001 |

FlexScan | 32 | 15.273 | 7.165 | 0.836 | |

SaTScan | 128 | 80.730 | 13.756 | 0.001 | |

MLF | 50 | 21.010 | 15.069 | 0.001 | |

NNG | 50 | 21.010 | 15.069 | 0.001 | |

Long shape | SA | 50 | 21.010 | 15.069 | 0.001 |

FlexScan | 30 | 12.606 | 8.866 | 0.432 | |

SaTScan | 28 | 16.810 | 3.202 | 0.993 | |

MLF | 115 | 51.343 | 32.513 | 0.001 | |

NNG | 115 | 51.343 | 32.513 | 0.001 | |

Extreme shape | SA | 115 | 51.343 | 32.513 | 0.001 |

FlexScan | 65 | 29.020 | 17.477 | 0.003 | |

45 | 20.091 | 11.877 | 0.081 | ||

SaTScan | 86 | 49.110 | 12.425 | 0.001 | |

35 | 15.630 | 9.143 | 0.024 | ||

MLF | 70 | 30.970 | 19.367 | 0.001 | |

35 | 15.485 | 9.343 | 0.016 | ||

NNG | 70 | 30.970 | 19.367 | 0.001 | |

35 | 15.485 | 9.343 | 0.016 | ||

Two-cluster | SA | 70 | 30.970 | 19.367 | 0.001 |

35 | 15.485 | 9.343 | 0.016 | ||

FlexScan | 70 | 30.970 | 19.367 | 0.001 | |

25 | 11.061 | 6.599 | 0.954 | ||

SaTScan | 78 | 39.820 | 15.470 | 0.001 | |

28 | 17.700 | 2.627 | 0.998 |

A comparison of the MLF method, NNG method, Duczmal's SA method, Tango's FlexScan method, and Kulldorff's SaTScan method using the benchmark data

MLF | NGG | SA | FlexScan | SaTScan | ||
---|---|---|---|---|---|---|

Circular | Elliptic | |||||

Population | 29,535,210 | |||||

Total case | 58,943 | |||||

Observed # | 17,002 | 17,743 | 15,122 | 6,980 | 21,039 | 15,122 |

Expected # | 14,166 | 15,383 | 12,988 | 6,005 | 19,734 | 12,988 |

LLR | 237.24 | 85.97 | 227.11 | 84.11 | 44.95 | 44.71 |

p-value | 0.001 | 0.001 | 0.001 | 0.001 | 0.01 | 0.001 |

### Detection of Cluster with Arbitrary Shapes

Cluster detection analysis result for Murine Typhus case in the south Texas from 1996 ~ 2000 at the census block group level

MLF | NGG | FlexScan | SaTScan | SA | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Most Likely Cluster | Secondary Cluster | Most Likely Cluster | Secondary Cluster | Most Likely Cluster | Secondary Cluster | Most Likely Cluster | Secondary Cluster | Most Likely Cluste | Secondary Cluster | |||

Circular | Elliptic | Circular | Elliptic | |||||||||

Population | 1,728,393 | |||||||||||

Total case | 391 | |||||||||||

LLR | 186.43 | 9.33 | 197.51 | 6.15 | 42.95 | 36.95 | 97.60 | 124.69 | 6.67 | 6.49 | 177.15 | N/A |

# of zones | 71 | 11 | 94 | 1 | 16 | 9 | 127 | 121 | 27 | 3 | 164 | N/A |

Observed # | 142 | 12 | 167 | 3 | 30 | 25 | 145 | 138 | 2518 | 6 | 220 | N/A |

Expected # | 18.96 | 2.53 | 26.5 | 0.15 | 3.01 | 2.37 | 33.54 | 28.99 | 6.69 | 0.87 | 50.5 | N/A |

p-value | 0.01 | 0.25 | 0.01 | 0.32 | 0.01 | 0.01 | 0.01 | 0.01 | 0.42 | 0.74 | 0.01 | N/A |

### Spatial Distribution of Clusters and Socioeconomic Factors

An examination of Figures 8-13 reveals that the presence of the most likely clusters is mainly distributed in the coastal counties, particularly in Nueces County. Caused by two organisms, *Rickettsia typhi* and *R. felis*[32], murine typhus is easily carried and transmitted by small mammals such as mice, domestic cats, and opossums and the associated fleas. Theoretically, the spreading of murine typhus requires a warm and humid environment. This is probably why most of the detected clusters are distributed in the coastal area.

Relation between the number of most likely cluster and high density population

Density > 1000 | Density > 2000 | ||||
---|---|---|---|---|---|

# of cluster | # of cluster | Percent (%) | # of cluster | Percent (%) | |

MLF | 71 | 66 | 92.96 | 42 | 59.15 |

NGG | 94 | 77 | 81.91 | 44 | 46.81 |

SA | 164 | 107 | 65.24 | 54 | 32.93 |

FlexScan | 16 | 16 | 100 | 13 | 81.25 |

Elliptic SaTScan | 112 | 101 | 90.18 | 59 | 52.68 |

Circular SaTScan | 127 | 116 | 91.34 | 67 | 52.76 |

We can also find the similarity between the distribution of cluster patterns and the environmental factors. Most of reported cases are found in urban areas with very high population densities. Usually, the high density population brings problems, such as increasing amounts of urban garbage and commensal rodents. These will also increase the likely exposure of opossums, a peridomestic animal, to the cat fleas and rickettsial pathogens due to their frequent visiting of human habitation to search for both food and harborage [25]. Moreover, the high population densities also enlarge the number of household pet, which is another common host of cat fleas. Besides the rats and mice, the cat flea is easily switched from the parasitized cats and opossums to other animals of the same size.

To further verify and explain the detected cluster patterns, we collected and analyzed four other socioeconomic factors at both county level and census block group level: median household income, the rate of population with their poverty status below poverty, median house built year, and median value of owner-occupied house units. Nueces County, with the majority of the most likely clusters, has a relative higher median household income ($35,959) and median house value ($70,100) than the average value (median household income $27,026 and median house value $48,467) for all 18 counties. Driven mainly by tourism and the petrochemical industry, the main economic support of Nueces County depends upon its largest coastal city, Corpus Christi, which also drives the development of related commercial real estate and other industries.

Socioeconomic data of the most likely cluster within the Nueces County

Socioeconomic | All block groups | The block groups in the most like cluster detected by | |||||
---|---|---|---|---|---|---|---|

MLF | NGG | SA | FlexScan | Elliptic SaTScan | Circular SaTScan | ||

Median house income ($) | 35,959 | 31,167 | 30,469 | 33,521 | 26,427 | 28,419 | 30,580 |

Poverty rate (%) | 18 | 21 | 19 | 19 | 24 | 26 | 23 |

Median house built year | 1967 | 1958 | 1919 | 1963 | 1953 | 1957 | 1959 |

Median house value ($) | 70,100 | 58,857 | 63,074 | 63,648 | 49,363 | 56,033 | 58,048 |

## Conclusion

There is an important difference among the performance of traditional SaTScan, FlexScan, SA, and the two algorithms (MLF and NGG) introduced in this paper. Kulldorff's method tries to search the maximum likelihood ratio using a predefined geometrical shape (circle or ellipse) while the FlexScan method would search for the nearest maximum. For most circular-shape clusters, the spatial scan statistic method will promise fast and efficient cluster detection in many applications. That is why this method is popular in providing an initial analysis for most cluster studies. The two new algorithms make it easy to find out the exact location and boundaries of clusters with arbitrary shapes. Moreover, by adopting the idea of global-optimization strategies, the two new algorithms reduce the effects of the local maximum problem by searching for the global maximum of the likelihood ratios at each step.

We compared the detected clusters from the two new algorithms and those from SaTScan, FlexScan, and SA and found the performance of the neighbor-expanding method has been significantly improved in the cluster with arbitrary shapes. However, the computation time of the NGG algorithm was much longer than that of the MLF algorithm. This might be caused by the no-constraint rule when the NGG selects the seed to detect the next level cluster in the search process. Without any penalty on the shape of the result, the NGG allows more detected clusters than the MLF and SA. One possible solution for this problem is to set the degree allowing irregular shape in the detected cluster according to some appropriate criteria, minimizing the occurrence of false clusters. Or we could post-process the entire detected result after cluster analysis to remove the highly irregular ones. But this solution will require more detection time and expert knowledge in selecting an appropriate threshold.

One of the most critical components of environment epidemiology is to estimate the associations between human exposures and health outcomes [33, 34]. In order to further understand the etiology of a disease, we need to explore the proximity, frequency, and magnitude of potential environmental hazards and their effects to humans. Obviously, this cluster analysis will help us understand the geographic distribution of murine typhus in Texas. From this cluster analysis, we can easily conclude that the most likely cluster of murine typhus is mostly distributed in warm and humid areas - notably eastern Nueces County along coastal Texas. Moreover, at the census block group level, most of the detected clusters (> 80% or 90%) are in high population density areas (population > 1000 per square kilometer) with lower household incomes and home values. These findings prove that the distribution of murine typhus is controlled by both environmental and socio-economic factors.

The choice of scale/resolution in cluster analysis deserves some attention. In most of case studies, we would prefer to choose a resolution small enough to represent most disease distribution in a relatively homogeneous area. Furthermore, the spatial aggregation of areal data may change the pattern of disease and bring some difficulty in validating the results due to effects of the modifiable areal unit problem (MAUP). A possible solution to this problem involves performing the cluster analyses at different scales of area units to estimate the effects of MAUP and this issue will be addressed in future research. If possible, it would be much better to conduct an analysis of scale effect before conducting a cluster analysis. The choice of scale/resolution for specific cases or specific diseases at different regions should be treated differently. Although there is no specific rule to follow, users of the algorithms should be very familiar with the characteristics of the disease in question as well as the study area before the cluster detection is conducted.

## Declarations

### Acknowledgements

This article is based on part of Zhijun Yao's dissertation research under the supervision of F. Benjamin Zhan. Benjamin Zhan's work was in part supported by Wuhan University and the Chang Jiang Scholar Awards Program. The Chang Jiang Scholar Awards Program is jointly sponsored by China Ministry of Education and the Li Ka Shing Foundation (Hong Kong, China). The authors wish to thank the Texas Department of State Health Services for providing the data about Murine Typhus.

## Authors’ Affiliations

## References

- Yiannakoulias N, Rosychuk RJ, Hodgson J: Adaptations for finding irregularly shaped disease clusters. International Journal of Health Geographics. 2007, 6: 28-54. 10.1186/1476-072X-6-28.PubMed CentralView ArticlePubMed
- Burton I: The quantitative revolution and theoretical geography. The Canadian Geographer. 1963, 7: 151-162. 10.1111/j.1541-0064.1963.tb00796.x.View Article
- Anselin L: Local Indicators of Spatial Association - LISA. Geographical Analysis. 1995, 27: 93-115. 10.1111/j.1538-4632.1995.tb00338.x.View Article
- Getis A, Ord JK: The Analysis of Spatial Association by Use of Distance Statistics. Geographical Analysis. 1992, 24: 189-206. 10.1111/j.1538-4632.1992.tb00261.x.View Article
- Aldstadt J, Getis A: Using AMOEBA to create a spatial weights matrix and identify spatial clusters. Geographical Analysis. 2006, 38: 327-343. 10.1111/j.1538-4632.2006.00689.x.View Article
- Nakaya T, Yano K: Visualising crime clusters in a space-time cube: an exploratory data-analysis approach using space-time kernel density estimation and scan statistics. Transactions in GIS. 2010, 14: 223-239. 10.1111/j.1467-9671.2010.01194.x.View Article
- Kulldorff M: A spatial scan statistic. Communications in Statistics-Theory and Methods. 1997, 26: 1481-1496. 10.1080/03610929708831995.View Article
- Openshaw S, Charlton ME, Wymer C, Craft A: Mark I geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information Systems. 1987, 1: 335-358. 10.1080/02693798708927821.View Article
- Turnbull BW, Wano EJ, Burnett WS, Howe HL, Clark LC: Monitoring for clusters of disease: application to leukemia incidence in upstate New York. American Journal of Epidemiology. 1990, 132: 136-143.
- Fischer EAJ, Pahan D, Chowdhury SK, Oskamv L, Richardus JH: The spatial distribution of leprosy in four villages in Bangladesh: An observational study. BMC Infectious Disease. 2008, 8: 125-131. 10.1186/1471-2334-8-125.View Article
- Minamisava R, Nouer SS, Morais NOL, Melo LK, Andrade ALS: Spatial clusters of violent deaths in a newly urbanized region of Brail: highlight the social disparities. International Journal of Health Geograhpics. 2009, 8: 66-76. 10.1186/1476-072X-8-66.View Article
- Duczmal L, Moreira GJP, Ferreira SJ, Takahashi RHC: Dual graph spatial cluster detection for syndromic surveillance in networks. Advances in Disease Surveillance. 2007, 4: 88-92.
- Tonini M, Tuia D, Ratle F: Detection of clusters using space-time scan statistics. International Journal of Wildland Fires. 2009, 18: 830-836. 10.1071/WF07167.View Article
- Chen J, Roth RE, Naito AT, Lengerich EJ, MacEachren AM: Geovisual analytics to enhance spatial scan statistic interpretation: An analysis of u.s. cervical cancer mortality. International of Health Geographics. 2008, 7: 57-75. 10.1186/1476-072X-7-57.View Article
- Neill DB, Moore A, Sabhanani M: Detecting elongated disease cluster. Morbidity and Mortality Weekly Report. 2005, 54: 197-205.
- Tango T, Takahashi K: A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005, 4: 11-26. 10.1186/1476-072X-4-11.PubMed CentralView ArticlePubMed
- Patil GP, Taillie C: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 2004, 11: 183-197. 10.1023/B:EEST.0000027208.48919.7e.View Article
- Assuncao R, Costa M, Tavares A, Ferreira S: Fast detection of arbitrarily shaped disease clusters. Statistical in Medicine. 2006, 25: 723-742. 10.1002/sim.2411.View Article
- Wieland SC, Brownstein JS, Berger B, Mandl KD: Density-equalizing Euclidean minimum spanning trees for the detection of all disease cluster shapes. PNAS. 2007, 104: 904-909. 10.1073/pnas.0609457104.View Article
- Duczmal L, Assuncao R: A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics and Data Analysis. 2004, 45: 269-286. 10.1016/S0167-9473(02)00302-X.View Article
- Conley J, Gahegan M, Macgill J: A genetic approach to detecting clusters in point data sets. Geographical Analysis. 37: 286-317. 10.1111/j.1538-4632.2005.00617.x.
- Sahajpal R, Ramaraju GV, Bhatt V: Applying niching genetic algorithms for multiple cluster discovery in spatial analysis. Conference on Knowledge Discovery in Data Mining. 2005
- Duczmal L, Cancado ALF, Takahashi RHC, Bessegato LF: A genetic algorithm for irregularly shaped spatial scan staitistics. Computational Statistics & Data Analysis. 2007, 52: 43-52.View Article
- Duczmal L, Kulldorff M, Huang L: Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics. 2006, 15: 428-442. 10.1198/106186006X112396.View Article
- Wen S, Kedem B: A semiparametric cluster detection method - a comprehensive power comparison with Kulldorff's method. International Journal of Health Geographics. 2009, 8: 73-89. 10.1186/1476-072X-8-73.PubMed CentralView ArticlePubMed
- Dwass D: Modified randomization tests for nonparametric hypotheses. Annuals of Mathematical Statistics. 1957, 28: 181-187. 10.1214/aoms/1177707045.View Article
- Boostrom A, Beier MS, Macaluso JA, Macaluso KR, Sprenger D, Hayes J, Radulovic S, Azad AF: Geographic association of rickettsia felis-infected opossums with human murine typhus, Texas. Emerging Infectious Disease. 2002, 8: 549-554.View Article
- ESRI: Download Census 2000 Tiger/line data. 2008,http://arcdata.esri.com/data/tiger2000/tiger_download.cfmhttp://arcdata.esri.com/data/tiger2000/tiger_download.cfm
- U. S. Census Bureau: Your gateway to census 2000. 2000,http://en.wikipedia.org/wiki/urbanizationhttp://en.wikipedia.org/wiki/urbanization
- Moura FR, Duczmal L, Tavares R, Takahashi RHC: Exploring multi-cluster structures with the multi-objective circular scan. Advances in Disease Surveillance. 2007, 2: 48-56.
- Demattei C, Molinari N, Daures JP: Arbitrarily shaped multiple spatial cluster detection for case event data. Computational Statistics and Data Analysis. 2007, 51: 3931-3945. 10.1016/j.csda.2006.03.011.View Article
- Azad AF: Epidemiology of Murine Typhus. Annual Review of Entomology. 1990, 35: 553-569. 10.1146/annurev.en.35.010190.003005.View ArticlePubMed
- Nuckols JR, Ward MH, Jarup L: Using geographic information systems for exposure assessment in environmental epidemiology studies. Environmental Health Perspectives. 2004, 1121: 1007-1015. 10.1289/ehp.6738.View Article
- Ozkaynak H, Palma T, Touma JS, Thurman J: Modeling population exposures to outdoor sources of hazardous air pollutants. Journal of Exposure Science and Environmental Epidemiology. 2008, 18: 45-58. 10.1038/sj.jes.7500612.View ArticlePubMed

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.