 Methodology
 Open Access
 Published:
Differentiating anomalous disease intensity with confounding variables in space
International Journal of Health Geographics volume 19, Article number: 37 (2020)
Abstract
Background
The investigation of perceived geographical disease clusters serves as a preliminary step that expedites subsequent etiological studies and analysis of epidemicity. With the identification of disease clusters of statistical significance, to determine whether or not the detected disease clusters can be explained by known or suspected risk factors is a logical next step. The models allowing for confounding variables permit the investigators to determine if some risk factors can explain the occurrence of geographical clustering of disease incidence and to investigate other hidden spatially related risk factors if there still exist geographical disease clusters, after adjusting for risk factors.
Methods
We propose to develop statistical methods for differentiating incidence intensity of geographical disease clusters of peak incidence and low incidence in a hierarchical manner, adjusted for confounding variables. The methods prioritize the areas with the highest or lowest incidence anomalies and are designed to recognize hierarchical (in intensity) disease clusters of respectively highrisk areas and lowrisk areas within close geographic proximity on a map, with the adjustment for known or suspected risk factors. The data on spatial occurrence of sudden infant death syndrome with a confounding variable of race in North Carolina counties were analyzed, using the proposed methods.
Results
The proposed Poisson model appears better than the one based on SMR, particularly at facilitating discrimination between the 13 counties with no cases. Our study showed that the difference in racial distribution of live births explained, to a large extent, the 3 previously identified hierarchical highintensity clusters, and a small region of 4 mutually adjacent counties with the higher raceadjusted rates, which was hidden previously, emerged in the southwest, indicating that unobserved spatially related risk factors may cause the elevated risk. We also showed that a large geographical cluster with the low raceadjusted rates, which was hidden previously, emerged in the mideast.
Conclusion
With the information on hierarchy in adjusted intensity levels, epidemiologists and public health officials can better prioritize the regions with the highest rates for thorough etiologic studies, seeking hidden spatially related risk factors and precisely moving resources to areas with genuine highest abnormalities.
Introduction
An important issue in spatial and temporal statistics is whether a set of discrete points are distributed randomly or they show a variety of signs of clustering. One of its major applications is in epidemiology; in particular, detecting and, more importantly, characterizing spatial and temporal clusters of adverse health events using existing health data collected on a basis of geographic units such as counties. The investigation of perceived incidence clusters or paucity of a certain disease is interesting in mathematics and probability per se. More importantly, we are interested in whether a finding indicating the presence of incidence anomalies (including clustering and paucity of disease incidence) will lead to a greater understanding of the etiology and underlying causal mechanism of disease or the identification of a common causal exposure for disease. With the identification of disease clusters of statistical significance, to determine whether or not the detected disease clusters can be explained by known or suspected risk factors is a logical next step.
The purpose of this paper is to develop and illustrate new statistical methods for differentiating incidence intensity of geographical disease clusters of peak incidence and low incidence, adjusted for covariates that are known or hypothesized risk factors, as well as testing for the presence of clustering. The methods are designed to recognize and construct hierarchical (in intensity) disease clusters of respectively highrisk areas and lowrisk areas within close geographic proximity or contiguity on a map, including confounding variables as covariates. The hierarchy in covariateadjusted intensity permits to occur between and within distinct geographical disease clusters. We propose to adjust for covariates by enumerating expected incidence of disease in each county through indirect standardization, conditioning on the total number of disease observed. The basic analysis is the one with no covariates. By including exactly one covariate at one time in further analyses, we can examine how the incorporated covariate affects the geographical disease incidence pattern. With information on geographical covariateadjusted incidence clustering patterns on a map, we can determine whether or not the previously detected geographical disease clusters of peak incidence or paucity of incidence can be explained by the covariates incorporated. We are further interested if there still exist geographical disease clusters of incidence anomalies, after adjusting for known or hypothesized risk factors, which could lead to further investigation into other spatially related risk factors that are hidden otherwise.
The existing statistical methods for epidemiologic incidence anomaly patterns generally focus on detecting and characterizing large or peak incidence over a time, space, or space–time series. The statistical methods that we introduce in this paper focus on geographical incidence paucity of as well as peak incidence of adverse health events, including confounding variables as covariates. In epidemiology, testing for disease aggregations is used to identify the association between the possible risk factors and the incidence of disease. In contrast, the detection of an unusually low incidence of disease indicates the presence of protective factors or the absence of risk factors associated with the disease. We and others previously proposed and formulated statistical methods for detecting an unusually low incidence of disease in a unit of time in a discrete time series and evaluated their sensitivity, power, and applicability, using a temporal series of data on adolescent suicide from the US National Center for Health Statistics and on childhood Langerhans cell histiocytosis patients in Taiwan [1, 2]. We articulated that statistical methods that are sensitive to incidence paucity in a unit of time characterize opposite aspects of an observed incidence pattern and can be as meaningful and useful in epidemiology as the methods that focus on incidence clustering in our articles. The same rationales hold true for statistical methods for spatial epidemiology and spatial statistics, as proposed here.
We illustrate proposed statistical methods, using the data on the spatial occurrence of sudden infant death syndrome (SIDS) in North Carolina counties over a 4 year period, 1974–1978. One possible confounding variable for SIDS is race. Information on racial distribution of live births is available. This data set has been analyzed in a variety of statistical and epidemiologic reports [3,4,5,6,7,8,9]. We choose this data set for orientation in order to readily illustrate the difference in utilities and applicability among these spatial statistical methods. Atkinson provides an early review of the SIDS incidence and notes the statistically nonsignificant clustering of incidence in time within county by single years of calendar time, using the EdererMyersMantel test [3]. Cressie and his colleagues develop methods to model the spatial trend (largescale variation) and spatial interdependence and autocorrelation (smallscale variation) in exploratory and confirmatory incidence analyses of this data [4,5,6]. Specifically, Cressie and Chan (1989) perform both unweighted and weighted linear regression and logistic regression of FreemanTukey transformed SIDS incidence on 5 explanatory variables, including population density, percentage urban, number of hospital beds, median family income, and FreemanTukey transformed nonwhite live birth rates, in modeling largescale variation. They use the Markov random field in modeling smallscale variation [4]. Instead of classifying the counties at high risk or medium risk to SIDS, according to the magnitude and ranking of the observed rate of incidence alone, Symons, Grimson, and Yuan propose to determine the classification of the incidence risk level, accounting for the variance of the estimate of the rate in a Poisson process [9]. They employ a mixture of Poisson distributions to model the disease incidence and determine criteria for classification, using maximum likelihood and Bayesian methods. These authors all focus on relevant but different respects of the spatial statistics problems from what we propose here.
Both the mapbased pattern recognition procedure [7] and the spatial scan statistic [8] can be used to identify disease clustering or disease clusters in a spatial point process in general. But they are designed for evaluating and characterizing different respects of spatially characteristic incidence clustering patterns and provide different information on spatial clustering. The spatial scan statistic is widely used for spatial cluster detection analysis and has been extended to a variety of models for detecting spatial, temporal, and space–time clusters, retrospectively or prospectively, using ordinal, survivaltime, multinominal, normal, and longitudinal data etc.. The spatial scan statistic searches for spatial disease clusters not explained by a baseline spatial point process without specifying their size or location a priori. It is able to identify the approximate location and range of the most likely disease clusters and secondary disease clusters and to perform a significance test for each cluster, based on the maximum likelihood ratio and using Monte Carlo hypothesis testing. While the identified most likely disease clusters may not be the areas with the highest incidence by the spatial scan statistic. In contrast, the pattern recognition procedure prioritizes the areas with the highest rates and is designed to determine hierarchical incidence intensity levels of mutually adjacent areas with the highest rates geographically. It is noted that this procedure exclusively focuses on peak incidence and does not allow for covariates in determining the hierarchical intensity levels. We previously used the pattern recognition procedure to investigate the spatial clustering patterns of dengue outbreaks in Taiwan [10].
Our proposed methods generalize the pattern recognition procedure in several respects and have following features to address important problems:

1.
We introduce the method for differentiating intensity of geographical disease clusters of low incidence in a hierarchical manner as well as testing for the presence of clustering.

2.
We propose the methods for taking into account covariates that are known or hypothesized risk factors of disease in constructing hierarchical (in adjusted intensity) clusters of highrisk areas and lowrisk areas close within geographic proximity, respectively.

3.
The use of indirectly covariateadjusted expected incidence permits the incorporation of covariates without the need of information on covariatespecific numbers of incidence in each area under study.

4.
Two distinct probability models are proposed to assess the deviation between the observed incidence and covariateadjusted expected incidence in each area.

5.
Two distinct neighborhood systems, adjacencybased and distancebased in the definition of close geographical proximity, are used for proposed models.
Instead of dividing the counties into high and mediumrisk categories on the basis of the incidence rates used in the practice and in the existing literature, we propose to divide the counties into high, medium, and lowrisk categories, then proceed to further differentiate incidence level of counties close within geographic proximity in the high and lowrisk categories respectively with and without the adjustment for confounding variables in a hierarchical manner.
The statistical methods that we propose in this report are not limited to advancing and generating studies of etiology of disease of interest with unknown causes. With the information on hierarchy in adjusted intensity levels, epidemiologists and public health officials can better prioritize the regions with the highest rates for thorough etiologic studies, seeking hidden spatially related risk factors, and precisely moving resources to areas with genuine highest abnormalities.
Methods
In this section, we first introduce the method for differentiating intensity of geographical disease clusters of low incidence as well as testing for the presence of clustering. It is based on the extension of the existing pattern recognition procedure, which focuses on hierarchical clusters of high incidence [7]. Next, we generalize these methods by taking into account covariates that are known or hypothesized risk factors of the disease. We consider a covariate of race for SIDS in this application. Thirdly, we consider two distinct neighborhood systems for the North Carolina counties in the application of the proposed spatial statistical models, which are adjacencybased and distancebased in the definition of geographical proximity. We illustrate proposed statistical methods, using the data on the spatial occurrence of SIDS in North Carolina counties in 1974–1978.
Study population
The data on SIDS patients with a confounding variable of race in 100 North Carolina counties provide an opportunity to illustrate the applications of the methods that we propose for geographical disease anomalies. SIDS is the third leading cause of all infant mortality in the US and remains the leading cause of death in infants aged from 1 month to 1 year. Its exact cause remains unknown. The frequency of SIDS appears to be influenced by social, economic, and cultural factors, such as maternal education, race or ethnicity, and poverty. Racial disparity in infants who died of SIDS has persisted. The rate of SIDS in nonHispanic African American infants and American Indian/Alaskan Native infants remains more than twice that of nonHispanic white infants in 2016 [11].
The data on the spatial occurrence of SIDS in North Carolina counties over a 4 year period, from July 01, 1974 to June 30, 1978, is used for illustration of our proposed methods. The information contained in this data set include the number of SIDS and the number of live births during this period for each of the 100 counties of North Carolina, as well as the countyseat locations. The number of live births was stratified into whites and nonwhites for each of the 100 counties. The total number of live births was 329,962, in which the numbers of white and nonwhite births were 224,881 and 105,081, respectively. The total number of SIDS was 667, in which the numbers of white and nonwhite SIDS were 268 and 399, respectively. The statewide incidence rate was 2.021 in deaths per 1000 live births. The overall incidence rates for the entire state by race were 1.192 for white children and 3.797 for nonwhite children per 1000 live births. Nonwhite SIDS rate was more than 3 times higher than that of whites, with the result that although nonwhites accounted for only 31.85% of the live births in the state during the study period, they accounted for 59.82% of all the SIDS cases reported. Details of the data sources and data collection methods have been described elsewhere [4].
Two distinct neighborhood systems for the North Carolina counties systems, distancebased and adjacencybased in the definition of close geographical proximity, are used for proposed models. They were determined by the criteria of being within 30 miles between the seats of 2 counties [4] and of sharing common geographical boundaries between 2 counties [5], respectively. The map on the 100 counties of North Carolina with county names is presented in Fig. 1a.
Existing mapbased pattern recognition procedure
The method developed by Mantel [12] was generalized by Cliff and Ord, who proposed the test statistic B_{1} = (1/2) Σ ω^{1}_{ij} x_{i} x_{j} where x_{i} = 1 if area i is a highrisk area for some disease and 0 otherwise, and where ω^{1}_{ij} = 1 if areas i and j are mutually adjacent geographically and 0 otherwise, ω^{1}_{ij} = ω^{1}_{ji}, ω^{1}_{ii} = 0 [13]. The sum ranges over all pairs of areas. It is an adjacencybased test statistic that measures spatial autocorrelation for binary data and uses the distribution of the number of adjacencies of geographic units. When highrisk areas tend to be geographically adjacent to each other, the value of B_{1} tends to be large. Using the test statistic B_{1}, one can test the null hypothesis of the random allocation of highrisk areas over the geographical region; that is, highrisk areas do not cluster. Cliff and Ord derived the expressions for the mean and variance of B_{1} under the assumptions of binomial and hypergeometric distributions [14]. We note that lower p values of B_{1} indicate high degrees of clustering or tight clustering, which conform to the adjacencybased definition of a cluster rather than being interpreted in the usual sense in this context.
Instead of selecting a specific threshold rate of incidence, the procedure proposes to first list the areas under study in rank order based on the disease intensity rates. It starts with classifying the 2 top ranking areas as highrisk areas and calculates the value of B_{1}. The procedure proceeds, letting the threshold rate of incidence vary downwards continuously, in which case it includes exactly one area with high rate at one time, up to an upper limit, such that at most we include, say, 25% of the total areas under study. Thus, the procedure subsequently includes the area with the 3rd highest rate and the other 2 areas with higher rates as highrisk areas and calculates the corresponding value of B_{1}. The p value is the probability that B_{1} is equal to or higher than the observed number of adjacencies involved between these 3 areas with the highest disease intensity rates. Therefore, the procedure provides the p value of B_{1} when the k top ranking areas among all areas under study are classified as highrisk areas for each k where k = 2, 3, 4….. The main feature of the procedure is to determine the hierarchical incidence intensity pattern through the distribution of p values of B_{1} for k = 2, 3, 4…., which is illustrated in an application to the SIDS data in North Carolina in 1974–1978 [7]. Figure 1a presents the countyspecific SIDS incidence intensity map in this period.
Instead of relying on the assumptions associated with the asymptotically normal distribution [14] or using Monte Carlo method [7], we propose to use simulationbased permutations using 1 million replicates based on the exact county boundary map, defined by ω^{1}, to obtain the null distribution of B_{1}. The basic geographic unit used in this report is a county. The distribution was simulated by randomly selecting exactly k counties among the 100 counties of North Carolina 1 million times and counting the number of the adjacent pairs appearing among the k counties for each of the 1 million replicates. The information on sharing common boundaries among North Carolina counties that we use here is available in the literature [5]. This process was applied for k = 2, 3, 4… 25. Each of the 24 distributions of B_{1} for k = 2, 3, 4… 25 is given in Table 1. We also used simulationbased permutations using 1 million replicates based on the exact district boundary map under study to obtain the null distribution of B_{1} in a study of dengue fever in Taiwan previously [10].
Hierarchical clusters of neighboring lowrisk areas
In the existing mapbased pattern recognition procedure, k indicates the number of top ranking counties of North Carolina with the higher intensity rates, classified as highrisk counties. While focusing on geographical disease clusters of incidence paucity, we propose to use k to indicate the number of top ranking counties with the lower intensity rates, classified as lowrisk counties. Using the pattern recognition procedure, we can determine the hierarchical (in intensity) spatial clusters of incidence paucity correspondingly.
Methods with adjustment of covariates
We propose to adjust for the effects of confounding variables by enumerating expected incidence (or number) of disease in each county through indirect standardization, conditioning on the total number of disease observed. We define a measure to be the expected number of patients with disease of interest in a spatial unit of the study region, a county of North Carolina in this application, with μ(A_{i}) = expected number of patients in county A_{i}. Assuming that the incidence for subpopulation j equals E_{j} in the entire study region and the size of subpopulation j in county A_{i} is a_{ij}, the expected incidence in county A_{i} is equal to the summation of the products of E_{j} and a_{ij}, across all j, μ(A_{i}) = \({\sum }_{j}{a}_{ij}{E}_{j}\). This indirectly covariateadjusted expected incidence for a county permits us to incorporate covariates with no need of information on covariatespecific numbers of incidence in every county. With the use of indirect standardization, the generalization to multiple covariates, adjusted for multiple confounding variables, is immediate.
The overall incidence rates for the state of North Carolina by race were 1.192 for white children and 3.797 for nonwhite children per 1000 live births in 1974–1978. With the information on the white and nonwhite numbers of live births for each of the 100 North Carolina counties [4], an expression of the raceadjusted expected incidence in county A_{i} is μ(A_{i}) = 1.192 × white births + 3.797 × nonwhite births, which is proportional to the expected number of SIDS.
We propose two models for assessing the discrepancy between the observed incidence and the countyspecific expected covariateadjusted incidence: the standardized morbidity ratio (SMR) and a Poisson model. The SMR, the ratio of the observed number of incidence to the expected number of incidence, provides a direct quantitative measure of the overall discrepancies between the observed incidence and expected covariateadjusted incidence per county.
Alternatively, we propose a Poisson probability model for assessing the probability of the deviation of a particular frequency to be attributed to sampling fluctuations. We propose to adjust for covariates under the assumption of Poisson random variables, which provide a crude way to account for the unequal variances of the county rates. The rank order is based on the probability associated with the assessment of the discrepancy between the observed incidence and countyspecific expected covariateadjusted incidence in a county. The Poisson distributional assumptions were previously employed by several investigators in studies of SIDS data [5, 6, 8, 9]. The existing literature generally assume a batch of independently and identically distributed Poisson random variables for distinct counties with the null hypothesis of homogeneity of statewide incidence [5, 6, 9]. In this report, we assume countyspecific Poisson probability models whose mean depends on the expected covariateadjusted incidence for each county.
Using the same expression of the raceadjusted expected incidence μ(A_{i}) for county A_{i}, we calculate the probability of departure from expected raceadjusted incidence for each county, based on the assumption of Poisson distribution. It is defined as follows, N_{i} = the number of incidence in county A_{i},
A small value of this probability D_{i} indicates that the SIDS incidence at county A_{i} is unusually high beyond the effect of race, if the observed incidence large than expected raceadjusted incidence, or that the SIDS incidence at county A_{i} is unusually low beyond the effect of race, if the observed incidence smaller than expected raceadjusted incidence.
The stepbystep guidelines for the use of the proposed SMR models are:

1.
Compute covariateadjusted expected incidence for each area under study.

2.
List the areas with the highest (lowest) rates of SMR in rank order, up to an upper limit, say, 25 percent of the total areas under study.

3.
Like the ordinary mapbased pattern recognition procedure, start with classifying the 2 top ranking areas as highrisk (lowrisk) areas and calculate the value of B_{1}.

4.
Proceed successively, including exactly one area with the high (low) rate of SMR next according to the rank order and the other areas with higher (lower) rates as highrisk (lowrisk) areas at each step with the use of B_{1}.

5.
Determine the areas with the lowest p values of B_{1} relative to surrounding p values in the ranking.

6.
The hierarchical characterization is constructed, based on the use of the inclusion points of these areas determined in the previous step.
The stepbystep guidelines for the use of the proposed Poisson models are:

1.
Compute covariateadjusted expected incidence for each area under study.

2.
Calculate the probability of departure from expected covariateadjusted incidence for each area based on the assumption of Poisson distribution.

3.
List the areas with the smallest probabilities in rank order and the observed incidence larger (smaller) than expected incidence, up to an upper limit, say, 25 percent of the total areas under study.

4.
Like the ordinary mapbased pattern recognition procedure, start with classifying the 2 top ranking areas as highrisk (lowrisk) areas and calculate the value of B_{1}.

5.
Proceed successively, including exactly one area with the small probability of Poisson distribution next according to the rank order and the other areas with smaller probabilities and the observed incidence larger (smaller) than expected incidence as highrisk (lowrisk) areas at each step with the use of B_{1}.

6.
Determine the areas with the lowest p values of B_{1} relative to surrounding p values in the ranking.

7.
The hierarchical characterization is constructed, based on the use of the inclusion points of these areas determined in the previous step.
Results
In this section, we first present the analysis of highrisk areas and lowrisk areas close within geographic proximity without adjustment of race, respectively. Secondly, we present the analysis with adjustment of race, using the proposed SMR model and Poisson model. Thirdly, in addition to an adjacencybased definition of geographical proximity, we consider and present the analysis based on the use of a distancebased neighborhood system.
Analysis without adjustment for race
We first repeated the analysis, which was previously performed [7]. We present the cluster statistics for 25 counties with the higher rates in Table 2. Our results were slightly different from those presented in Table 3 of their article: (1) the test statistic B_{1} in Table 2 was smaller than the corresponding one in Table 3 by 1 between Warren (12th in rank) and Lenoir (20th in rank) because the recognition of sharing common boundaries among North Carolina counties was different in certain counties by the authors; (2) the p values of B_{1} in Table 2 were obtained by the simulationbased permutations using 1 million replicates shown in Table 1, rather than by the 2000 replicates of the Monte Carlo method in their article.
The largest downward peaks in p values relative to surrounding p values occurred at the inclusion points of Bertie (8th in rank), Robeson (14th in rank), Pender (18th in rank), and Wayne (24th in rank) counties, shown in the p value^{1} column of Table 2. Robeson and Pender are close in location and on the rank scale. So the construction of the hierarchical characterization used the downward peaks at inclusion points of Bertie, Pender, and Wayne counties. Correspondingly, we determined the 3 groups of counties to use in constructing hierarchical clusters of mutually neighboring highrisk counties with 3 different levels of intensity. LevelH1 counties are the 8 top ranking counties; LevelH2, 10 counties ranking from 9 to 18; LevelH3, 6 counties ranking from 19 to 24. The overall incidence of the 8 LevelH1, 10 LevelH2, and 6 LevelH3 counties combined are 5.57, 3.95, and 2.79 per 1000 live births, respectively.
When the levelspecific intensity is placed on the map, 3 hierarchical intensity clusters of high SIDS emerge and are respectively located in the northeast (6 counties: 5 LevelH1 and 1 LevelH2 counties) with incidence of 4.98, the south (6 counties: 1 LevelH1 and 5 LevelH2 counties) with incidence of 4.06, and the mideast (6 counties: 1 LevelH1 and 5 LevelH3 counties) with incidence of 3.09 per 1000 live births, as shown in Fig. 1b. This hierarchical characterization is identical to the one in the previous study [7].
Previously, k indicates the number of top ranking counties of North Carolina with the higher intensity rates, classified as highrisk counties, given in Table 2. While focusing on geographical disease clusters of incidence paucity, k indicates the number of top ranking counties with the lower intensity rates, classified as lowrisk counties. The 25 counties with the lowest rates were listed by rank according to their rates in Table 3. There were 13 counties with 0 SIDS but different numbers of live births. Because the larger the population base for the rate, the more stable and reliable the rate will be, we discriminated between these 13 counties according to the numbers of live births: the ones with larger numbers of live births were considered higher in ranking. The original test statistic B_{1} = (1/2) Σ ω^{1}_{ij} x_{i} x_{j} was revised to be where x_{i} = 1 if area i is a lowrisk county for disease and 0 otherwise, and where ω^{1}_{ij} = 1 if counties i and j are mutually adjacent geographically and 0 otherwise, ω^{1}_{ij} = ω^{1}_{ji}, ω^{1}_{ii} = 0. It is noted that, with each k, the test statistic B_{1} in Table 3 was smaller than the corresponding one in Table 2, indicating that the highrisk counties show higher degrees of clustering geographically than the lowrisk counties.
Similarly, the method starts with classifying the 2 top ranking counties as lowrisk counties and calculates the value of B_{1}. We proceed successively, including exactly one county with the low rate next according to the rank order and the other counties with lower rates as lowrisk counties at each step with the use of B_{1}. We observed the large downward peaks in p values relative to surrounding p values at including Tyrrell (88^{th} in rank), Ashe (79th in rank), and Iredell (77th in rank) counties, shown in the p value^{1} column of Table 3. Ashe and Iredell are close on the rank scale. So the hierarchical characterization was determined using the downward peaks at inclusion points of Tyrrell and Iredell counties.
The use of the lowest p values at various counties in the rankings means to capture the groups of counties with highly tight clustering. The hierarchical (in intensity) spatial clusters of incidence paucity were optimally determined by using the inclusion points of Tyrrell and Iredell counties. We, therefore, determined the 2 groups of counties to use in constructing hierarchical clusters of mutually neighboring lowrisk counties with 2 different levels of intensity. LevelL1 counties are the 13 top ranking counties with 0 SIDS; LevelL2, 11 counties ranking from 87 to 77. The overall incidence of the 13 LevelL1 and 11 LevelL2 counties combined are 0 and 0.81 per 1000 live births, respectively. The 3 hierarchical lowintensity clusters appear respectively in the northwestern region (6 counties: 4 LevelL1 and 2 LevelL2) with incidence of 0.28, the midwestern region (9 counties: 1 LevelL1 and 8 LevelL2) with incidence of 0.70, and the eastern coastal region (3 counties: 3 LevelL1) with incidence of 0 per 1000 live births. Figure 1b presents the countyspecific SIDS incidence intensitylevel map.
Standardized morbidity ratio with adjustment for race
Table 4 presents the rank order and cluster statistics for 25 counties with the highest rates of SMR. The test statistic B_{1} in Table 4 was smaller than the B_{1} in Table 2 for each on rank order, suggesting that the counties with the higher SMR appear relatively disperse, compared with the counties with the higher raw incidence rates. Figure 2a presents the countyspecific SMRadjusted intensity map.
Previously detected geographical highintensity clusters in the northeast and south appeared less intensive in terms of rank order and smaller in size, after adjusting for race. It indicates that the high percentage of nonwhite births alone is not sufficient to explain the excess of SIDS risk thoroughly in the northeastern and southern regions. In addition, the previous cluster in the mideast disappeared after adjusting for race. On the other side, a small area, which was hidden previously, emerged in the southwest, comprising Rutherford (2nd in rank), McDowell (10th in rank), Transylvania (12th in rank), and Henderson (21st in rank) counties. The raw incidence and SMR of these 4 counties combined are 2.88 per 1000 live births and 1.977, respectively. It indicates that an excess of SIDS risk is observed in this region, after adjusting for race. The p values of B_{1} for the 25 top ranking counties do not appear cyclic over the rates, and no hierarchy in high SMRadjusted intensity would be recognized.
In the investigation of the geographical disease clustering pattern of low SMRadjusted intensity, we present the rank order and cluster statistics for 25 counties with the lowest SMR in Table 5. The discrimination between the 13 counties with 0 SIDS was based on the raceadjusted expected numbers of SIDS. The larger the covariateadjusted expected incidence, the higher in rank order the county will be. The large downward peaks in p values relative to surrounding p values occurred at including Clay (88th in rank) and Davie (79th in rank) counties, shown in the p value^{1} column of Table 5. We, thus, determined the 2 groups of counties to use in constructing hierarchical clusters of mutually neighboring raceadjusted lowrisk counties beyond the effect of race: LevelLS1 counties for the 13 top ranking counties with 0 SIDS; LevelLS2 for the 9 counties ranking from 87 to 79, as shown in Fig. 3b. The geographical disease clustering pattern of low SMRadjusted intensity, shown in Fig. 2a, b, does not appear very different from the geographical clustering pattern of low incidence intensity without adjustment, shown in Fig. 1a, b. One major reason was that the 13 counties with 0 SIDS were among the 13 top ranking counties in both settings. The use of SMR does not seem to effectively discriminate between the 13 counties with 0 SIDS.
Poissonmodel with adjustment for race
We list the counties in rank order based on D_{i} and present the cluster statistics for 25 top ranking counties with the smallest D_{i} and N_{i} ≥ μ(A_{i}) in Table 6. The 10 top ranking counties in Table 6 had significantly higher incidence than the expected raceadjusted incidence each at a nominal significance level of 0.05 under the assumption of Poisson distribution, suggesting that the departure from expected raceadjusted incidence for each of these 10 counties is too large to be attributed to chance alone. All but 2 top ranking counties in Table 6 also appear on the list of the 25 top ranking counties with the highest SMR in Table 4, except Robeson (18th in rank) and Wayne (24th in rank) in Table 6 vs. Pender (23rd in rank) and Stanly (25th in rank) in Table 4. So the geographical raceadjusted highintensity clustering patterns, characterized by the Poisson model and SMR, are similar, shown in Figs. 2a and 3a. The smaller the value of D_{i} with N_{i} ≥ μ(A_{i}), the darker in red the county is, shown in Fig. 3a. The conclusion based on the Poisson model was similar to the one based on the higher SMR rates. The p values of B_{1} for these 25 top ranking counties do not appear cyclic. No hierarchy in high raceadjusted intensity would be recognized beyond the effect of race by the Poisson model.
While the geographical raceadjusted lowintensity clustering pattern based on the Poisson model appears very different from the one based on the low SMR rates. The smaller the value of D_{i} with N_{i} ≤ μ(A_{i}), the darker in blue the county is, shown in Fig. 3a. There were only 3 counties (Alexander, Gates, and Macon) with 0 SIDS among the 25 top ranking counties with the smallest D_{i} and N_{i} ≤ μ(A_{i}) in Table 7. Although only the 4 top ranking counties had significantly lower incidence than the raceadjusted expected incidence each at a nominal significance level of 0.05, it remained meaningful and useful to search for and determine regions of several (or many) mutually adjacent counties with the low raceadjusted rates geographically, beyond the effect of race. Regions of similar low incidence may reveal information on the presence of protective factors or the absence of risk factors associated with SIDS.
The large downward peaks in p values of B_{1} relative to surrounding p values occurred at including Alexander (92nd in rank), Stokes (80th in rank), and Johnson (77th in rank) counties, shown in the p value^{1} column of Table 7. We determined the 3 groups of counties to use in constructing hierarchical clusters of mutually neighboring raceadjusted lowrisk counties with 3 different levels of intensity, after adjusting for the effect of race. LevelLP1 counties are the 9 top ranking counties (92nd–100th in rank); LevelLP2, 12 counties ranking from 80 to 91; LevelLP3, 3 counties ranking from 77 to 79. The raw incidence of the 9 LevelLP1, 12 LevelLP2, and 3 LevelLP3 counties are 0.85, 1.24, and1.65 per 1000 live births, respectively. The 3 hierarchical raceadjusted lowintensity clusters appear in the north (3 counties: 2 LevelLP1 and 1 LevelLP2) with raw incidence of 0.96, the midwest (6 LevelLP1 counties) with raw incidence of 0.71, and the mideast (10 counties: 1 LevelLP1, 6 LevelLP2, and 3 LevelLP3) with raw incidence of 1.40 per 1000 live births. Figure 3b presents this countyspecific intensitylevel map, by the Poisson model.
Our analysis showed that the difference in racial distribution of live births across North Carolina explained, to a large extent, the 3 previously identified hierarchical highintensity clusters in the northeast, south, and mideast, shown in Fig. 1a, b. None p value of B_{1} for all k was statistically significant at a nominal significance level of 0.05 in the testing for the presence of clustering of the counties with the higher raceadjusted incidence intensity, characterized by both the models. It was because these 25 top ranking counties with the highest adjusted intensity rates in the settings of either models were allocated sequentially and alternatively across the 3 relatively high adjustedintensity regions in the northeast, south, and southwest, shown in Figs. 2a and 3a. It indicates that the counties with the higher raceadjusted rates did not show high degree of clustering geographically. No hierarchy in high raceadjusted intensity would be recognized.
Intriguingly, a small region of 4 mutually adjacent counties with the higher raceadjusted rates, which was hidden previously, emerged in the southwest. The combined raw incidence and SMR of this region are 2.88 per 1000 live births and 1.977, respectively. The combined SMR of 1.977 in this small region was higher than the 95 of the 100 counties, shown in Table 4. Unobserved spatially related risk factors may cause the elevated risk in this region.
In contrast, the Poisson model appeared more appropriate than the model based on SMR in the study of geographical raceadjusted lowintensity clustering patterns, particularly at facilitating discrimination between the 13 counties with 0 SIDS but different numbers of live births. In the comparison of Fig. 1a, b with Fig. 3a, b, uneven distribution of the racespecific live births substantially changed the geographical raceadjusted lowintensity clustering patterns, too. The previously detected geographical lowintensity clusters in the northwest and the eastern coast disappeared, indicating that high percentage of white births alone was sufficient to explain the excess of SIDS risk in these 2 regions. The previously detected geographical lowintensity cluster in the midwest remained, but divided into 2 smaller distinct geographical clusters. A large geographical cluster of 10 mutually adjacent counties with the low raceadjusted rates, which was hidden previously, emerged in the mideast. It suggests that unidentified spatially related protective factors could explain the unusually lowrisk clusters in the midwest and the mideast. A summary of hierarchical cluster analysis based on different models is presented in Table 8.
Analysis with different neighborhood systems
The definition of neighborhood systems may govern the analysis outcomes. Because of the irregularity in shape, contour, size, and location among distinct areas under study, special spatial configurations of certain sampled areas may be intuitively clustering, but not all of which are actually adjacent geographically. Due to the inherent irregular nature of most spatial data, we consider two distinct neighborhood systems for the North Carolina counties in illustrating the spatial statistical models that we propose in this report. In addition to the one based on geographical adjacency that we have used previously, we extended the proposed methods by using a distancebased neighborhood system, in which the neighbor of counties is defined with being within 30 miles between the seats of the 2 counties in this application. The neighborhood information based on this criterion is available in the existing literature [4].
We propose the use of a distancebased definition of neighbors as follows: within 30 miles between the seats of the 2 counties, denoted by ω^{2}, in this application. We carried out the analysis with the test statistic B_{2} = (1/2) Σ ω^{2}_{ij} x_{i} x_{j}, where x_{i} = 1 if county i is a highrisk (or lowrisk) county for some disease and 0 otherwise, and where ω^{2}_{ij} = 1 if counties i and j whose seats are closer less than 30 miles and 0 otherwise, ω^{2}_{ij} = ω^{2}_{ji}, ω^{2}_{ii} = 0. The sum ranges over all pairs of counties. Again, we used simulationbased permutations using 1 million replicates based on ω^{2} and obtained the null distribution of B_{2}. Each of the 24 distributions of B_{2} for k = 2, 3, 4… 25 is given in Additional file 1. Table S1. The values of the test statistic B_{2} and the associated p value, denoted by p value^{2}, are presented on the right panel of Tables 2, 3, 4, 5, 6 and7.
The geographical incidence intensity clustering patterns characterized by B_{2} and p value^{2} were generally in close agreement with those by B_{1} and p value^{1} with and without the adjustment for race. It is not surprising in this particular application as the neighborhood system, defined using the 30mile criterion, corresponds nearly precisely to the one, defined by those counties mutually sharing common geographical boundaries, in North Carolina county system [4, 5].
Comparison with analysis using spatial scan statistic
Both our proposed statistical methods and the spatial scan statistic allow for confounding variables and are used to identify disease clustering or detect disease clusters in a spatial point process in general. They are sensitive to different respects of spatially characteristic incidence clustering patterns and structured to provide different spatial clustering information. The spatial scan statistic determines the most likely disease clusters and secondary disease clusters based on the maximum likelihood ratio, whose statistical significance is evaluated, using Monte Carlo hypothesis testing. The spatial scan statistic tends to detect relatively broad spatial clusters, and the detected most likely disease clusters may not be the regions with the highest rates. In the analysis of SIDS in North Carolina counties, the spatial scan statistic detected the most likely cluster in the south with incidence of 3.8 and the secondary cluster in the northeast with incidence of 4.1 per 1000 live births, with the statewide incidence of 2.0, shown in Table 1 [8].
In contrast, our proposed methods are designed to prioritize the counties with the highest or lowest, adjusted or unadjusted, intensity rates in the testing for the presence of spatial clustering. We let the threshold rate of incidence vary downwards continuously, in which case it includes exactly next one area in the rank order at one time, up to a certain upper limit. As shown in Table 2 and Figs. 1a, b, the highest intensive clustering region was recognized in the northeast with incidence of 4.98, and secondary intensive clustering region, in the south with incidence of 4.06 per 1, 00 live births.
After adjusting for race, the spatial scan statistic detected an emerging broad secondary cluster in the west, which was previously hidden and comprised 18 counties, with the SMR of 1.357, presented in Table 2 [8]. In contrast, our method, either based on the Poisson model or SMR, recognized a small highintensity region, which was previously hidden and emerged in the southwest, comprising only 4 counties: Rutherford, McDowell, Transylvania, and Henderson, shown in Tables 4 and 6 and Figs. 2a and 3a. This small regions in the southwest was a much smaller subset of the broad region identified by the spatial scan statistic and had a higher SMR of 1.977 per 1000 live births.
Discussion
In this paper, we have presented a general framework for differentiating intensity of geographical disease clusters of peak incidence and low incidence in a hierarchical manner with the adjustment for covariates as well as testing for the presence of disease clustering. The first method is structured for recognizing and constructing hierarchical (in intensity) disease clusters of low incidence. The second method generalizes to take into account covariates that are known or hypothesized risk factors of the disease in constructing hierarchical (in adjusted intensity) clusters of highrisk areas and lowrisk areas close within geographic proximity, respectively. We formulated the adjustment for covariates by calculating the expected number of cases in each county through indirect standardization. We proposed two probability models, a Poissondistributionmodel and SMR, to facilitate discrimination between the 100 North Carolina counties based on the deviation between the observed incidence and covariateadjusted expected incidence in each county, through which the hierarchy in adjusted intensity is recognized, beyond the effect of covariates.
The application to the data on North Carolina SIDS, using the proposed methods, shows that the two probability models performed similarly in the geographical raceadjusted intensity clustering analysis of counties with the highest rates. While the analysis was very different in the investigation of the mutually adjacent counties with the lowest adjusted rates. The Poisson model that can account for the unequal variances of the county rates performed better particularly at facilitating discrimination between the 13 counties with zero SIDS but different numbers of live births than the model based on SMR. The hierarchical raceadjusted lowintensity clusters, characterized by the Poisson model, should be more reliable, shown in Tables 6 and 7 and Fig. 3a, b.
With the information on hierarchy in adjusted intensity levels, provided by the application of our proposed methods, epidemiologists can best prioritize the regions with the highest rates within which to conduct thorough etiologic investigations and search for hidden spatially related risk factors. Similar research designs are commonly applied in studies of human genetics, in which a group of affected sibships with extreme traits are used for detecting commonly shared genetic defects of a disease of interest in gene mappings [15]. Meanwhile, public health officials can better prioritize the highrisk regions precisely and promptly move resources to areas with genuine highest abnormalities.
The identification of geographical and temporal disease clusters serves as a preliminary step that expedites subsequent etiological investigation and analysis of epidemicity. Most reports of perceived clusters do not lead to the identification of a common casual exposure for the events of interest [16]. The reasons for this are many. As Rothman and many others pointed out that vast resources spent on the investigation of possible alarms of disease clustering are often in vain. We should not be aiming to detect clustering, but to understand why clusters occur [17]. The four stages in the guidelines for the investigation of disease clusters issued by the US Centers for Disease Control in 1990 are (1) initial contact with and response to the individual who reported the cluster; (2) a preliminary assessment, including evaluations of whether an excess has occurred; (3) a formal feasibility study; and (4) a full etiologic investigation [18]. The ordinary statistical methods for detecting temporal and spatial clustering in disease incidence frequency alone are useful in the second stage. While, the utilities of our proposed statistical methods contribute to the third and fourth stages.
In addition to the focus on peak incidence, we extended the proposed methods to investigate geographical disease clusters of low incidence. We exemplified the utilities of recognizing and constructing the geographical hierarchical (in intensity) disease clusters of low incidence and peak incidence without and with the adjustment for covariates. The studies of incidence paucity and incidence clustering characterize opposite aspects of an observed geographical incidence pattern by using different parts of information from the data. In this report, we show that statistical methods that focus on geographical incidence paucity can be as meaningful and useful in spatial epidemiology and spatial statistics as the methods that focus on peak incidence in space. We articulate the difference in sensitivity, power, and applicability between the studies of incidence paucity and incidence clustering, using a temporal series of data, in our previous articles [1, 2].
Availability of data and materials
Information on data on the spatial occurrence of SIDS in North Carolina counties from July 01, 1974 to June 30, 1978, is available on the paper: Cressie N, Chan NH (1989). Spatial Modeling of Regional Variables. Journal of the American Statistical Association, 84(406):393–401.
References
 1.
Wu CC, Grimson RC, Amos CI, Shete S. Statistical methods for anomalous discrete time series based on minimum cell count. Biom J. 2008;50(1):86–96.
 2.
Wu CC, Grimson RC, Shete S. Exact statistical tests for heterogeneity of frequencies based on extreme values. Commun Stat Simul Comput. 2010;39(3):612–23.
 3.
Atkinson D. Epidemiology of sudden infant death in North Carolina: do cases tend to cluster?. Chapel Hill: University of North Carolina; 1979.
 4.
Cressie N, Chan NH. Spatial modeling of regional variables. J Am Stat Assoc. 1989;84(406):393–401.
 5.
Cressie N, Read TRC. Do sudden infant deaths come in clusters? Stat Decis. 1985;3(Supplement Issue No. 2):333–49.
 6.
Cressie N, Read TRC. Spatial data analysis of regional counts. Biom J. 1989;31:699–719.
 7.
Grimson RC, Wang KC, Johnson PWC. Search for hierarchical clusters of disease: spatial patterns of sudden infant death syndrome. Soc Sci Med. 1981;15(D):287–93.
 8.
Kulldorff M. A spatial scan statistic. Commun Stat Theory Methods. 1997;26(2):1481–96.
 9.
Symons MJ, Grimson RC, Yuan YC. Clustering of rare events. Biometrics. 1983;39(1):193–205.
 10.
Lai WT, Chen CH, Hung H, Chen RB, Shete S, Wu CC. Recognizing spatial and temporal clustering patterns of dengue outbreaks in Taiwan. BMC Infect Dis. 2018;18(1):256.
 11.
Carlin RF, Moon RY. Risk factors, protective factors, and current recommendations to reduce sudden infant death syndrome: a review. JAMA Pediatr. 2017;171(2):175–80.
 12.
Mantel N. The detection of disease clustering and a generalized regression approach. Can Res. 1967;27(2):209–20.
 13.
Cliff AD, Ord JK. Spatial autocorrelation. London: Pion Press; 1973.
 14.
Cliff AD, Ord JK. Spatial Processes: Models & Applications. Milton Park: Taylor & Francis; 1981.
 15.
Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science. 1995;268(5217):1584–9.
 16.
Rothman KJ, Greenland S. Modern epidemiology, 2nd edn. Philadelphia: Maple Press; 1998.
 17.
Rothman KJ. A sobering start for the cluster busters' conference. Am J Epidemiol. 1990;132(1 Suppl):S6–13.
 18.
Centers for Disease Control and Prevention (CDC). Guidelines for investigating clusters of health events. Morbidity and mortality weekly report (MMWR), vol. 39. Atlanta: Centers for Disease Control and Prevention (CDC); 1990. p. 1–23.
Acknowledgements
Not applicable.
Funding
The research presented in this manuscript was partially supported by the Taiwan Ministry of Science and Technology grant MOST 1092118M006005 to C.C.W.
Author information
Affiliations
Contributions
CCW designed the study, performed analyses, and drafted the manuscript. SS participated in project conception and discussion of results of the original manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1: Table S1.
Frequency distributions of the number of neighbors (seats of counties within 30 miles) simulated on the basis of 1 million random selections in North Carolina counties.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wu, CC., Shete, S. Differentiating anomalous disease intensity with confounding variables in space. Int J Health Geogr 19, 37 (2020). https://doi.org/10.1186/s12942020002313
Received:
Accepted:
Published:
Keywords
 Geographical disease cluster
 Hierarchical
 Incidence clustering
 Sudden infant death Syndrome