- Methodology
- Open access
- Published:
Differentiating anomalous disease intensity with confounding variables in space
International Journal of Health Geographics volume 19, Article number: 37 (2020)
Abstract
Background
The investigation of perceived geographical disease clusters serves as a preliminary step that expedites subsequent etiological studies and analysis of epidemicity. With the identification of disease clusters of statistical significance, to determine whether or not the detected disease clusters can be explained by known or suspected risk factors is a logical next step. The models allowing for confounding variables permit the investigators to determine if some risk factors can explain the occurrence of geographical clustering of disease incidence and to investigate other hidden spatially related risk factors if there still exist geographical disease clusters, after adjusting for risk factors.
Methods
We propose to develop statistical methods for differentiating incidence intensity of geographical disease clusters of peak incidence and low incidence in a hierarchical manner, adjusted for confounding variables. The methods prioritize the areas with the highest or lowest incidence anomalies and are designed to recognize hierarchical (in intensity) disease clusters of respectively high-risk areas and low-risk areas within close geographic proximity on a map, with the adjustment for known or suspected risk factors. The data on spatial occurrence of sudden infant death syndrome with a confounding variable of race in North Carolina counties were analyzed, using the proposed methods.
Results
The proposed Poisson model appears better than the one based on SMR, particularly at facilitating discrimination between the 13 counties with no cases. Our study showed that the difference in racial distribution of live births explained, to a large extent, the 3 previously identified hierarchical high-intensity clusters, and a small region of 4 mutually adjacent counties with the higher race-adjusted rates, which was hidden previously, emerged in the southwest, indicating that unobserved spatially related risk factors may cause the elevated risk. We also showed that a large geographical cluster with the low race-adjusted rates, which was hidden previously, emerged in the mid-east.
Conclusion
With the information on hierarchy in adjusted intensity levels, epidemiologists and public health officials can better prioritize the regions with the highest rates for thorough etiologic studies, seeking hidden spatially related risk factors and precisely moving resources to areas with genuine highest abnormalities.
Introduction
An important issue in spatial and temporal statistics is whether a set of discrete points are distributed randomly or they show a variety of signs of clustering. One of its major applications is in epidemiology; in particular, detecting and, more importantly, characterizing spatial and temporal clusters of adverse health events using existing health data collected on a basis of geographic units such as counties. The investigation of perceived incidence clusters or paucity of a certain disease is interesting in mathematics and probability per se. More importantly, we are interested in whether a finding indicating the presence of incidence anomalies (including clustering and paucity of disease incidence) will lead to a greater understanding of the etiology and underlying causal mechanism of disease or the identification of a common causal exposure for disease. With the identification of disease clusters of statistical significance, to determine whether or not the detected disease clusters can be explained by known or suspected risk factors is a logical next step.
The purpose of this paper is to develop and illustrate new statistical methods for differentiating incidence intensity of geographical disease clusters of peak incidence and low incidence, adjusted for covariates that are known or hypothesized risk factors, as well as testing for the presence of clustering. The methods are designed to recognize and construct hierarchical (in intensity) disease clusters of respectively high-risk areas and low-risk areas within close geographic proximity or contiguity on a map, including confounding variables as covariates. The hierarchy in covariate-adjusted intensity permits to occur between and within distinct geographical disease clusters. We propose to adjust for covariates by enumerating expected incidence of disease in each county through indirect standardization, conditioning on the total number of disease observed. The basic analysis is the one with no covariates. By including exactly one covariate at one time in further analyses, we can examine how the incorporated covariate affects the geographical disease incidence pattern. With information on geographical covariate-adjusted incidence clustering patterns on a map, we can determine whether or not the previously detected geographical disease clusters of peak incidence or paucity of incidence can be explained by the covariates incorporated. We are further interested if there still exist geographical disease clusters of incidence anomalies, after adjusting for known or hypothesized risk factors, which could lead to further investigation into other spatially related risk factors that are hidden otherwise.
The existing statistical methods for epidemiologic incidence anomaly patterns generally focus on detecting and characterizing large or peak incidence over a time, space, or space–time series. The statistical methods that we introduce in this paper focus on geographical incidence paucity of as well as peak incidence of adverse health events, including confounding variables as covariates. In epidemiology, testing for disease aggregations is used to identify the association between the possible risk factors and the incidence of disease. In contrast, the detection of an unusually low incidence of disease indicates the presence of protective factors or the absence of risk factors associated with the disease. We and others previously proposed and formulated statistical methods for detecting an unusually low incidence of disease in a unit of time in a discrete time series and evaluated their sensitivity, power, and applicability, using a temporal series of data on adolescent suicide from the US National Center for Health Statistics and on childhood Langerhans cell histiocytosis patients in Taiwan [1, 2]. We articulated that statistical methods that are sensitive to incidence paucity in a unit of time characterize opposite aspects of an observed incidence pattern and can be as meaningful and useful in epidemiology as the methods that focus on incidence clustering in our articles. The same rationales hold true for statistical methods for spatial epidemiology and spatial statistics, as proposed here.
We illustrate proposed statistical methods, using the data on the spatial occurrence of sudden infant death syndrome (SIDS) in North Carolina counties over a 4 year period, 1974–1978. One possible confounding variable for SIDS is race. Information on racial distribution of live births is available. This data set has been analyzed in a variety of statistical and epidemiologic reports [3,4,5,6,7,8,9]. We choose this data set for orientation in order to readily illustrate the difference in utilities and applicability among these spatial statistical methods. Atkinson provides an early review of the SIDS incidence and notes the statistically non-significant clustering of incidence in time within county by single years of calendar time, using the Ederer-Myers-Mantel test [3]. Cressie and his colleagues develop methods to model the spatial trend (large-scale variation) and spatial interdependence and autocorrelation (small-scale variation) in exploratory and confirmatory incidence analyses of this data [4,5,6]. Specifically, Cressie and Chan (1989) perform both unweighted and weighted linear regression and logistic regression of Freeman-Tukey transformed SIDS incidence on 5 explanatory variables, including population density, percentage urban, number of hospital beds, median family income, and Freeman-Tukey transformed non-white live birth rates, in modeling large-scale variation. They use the Markov random field in modeling small-scale variation [4]. Instead of classifying the counties at high risk or medium risk to SIDS, according to the magnitude and ranking of the observed rate of incidence alone, Symons, Grimson, and Yuan propose to determine the classification of the incidence risk level, accounting for the variance of the estimate of the rate in a Poisson process [9]. They employ a mixture of Poisson distributions to model the disease incidence and determine criteria for classification, using maximum likelihood and Bayesian methods. These authors all focus on relevant but different respects of the spatial statistics problems from what we propose here.
Both the map-based pattern recognition procedure [7] and the spatial scan statistic [8] can be used to identify disease clustering or disease clusters in a spatial point process in general. But they are designed for evaluating and characterizing different respects of spatially characteristic incidence clustering patterns and provide different information on spatial clustering. The spatial scan statistic is widely used for spatial cluster detection analysis and has been extended to a variety of models for detecting spatial, temporal, and space–time clusters, retrospectively or prospectively, using ordinal, survival-time, multi-nominal, normal, and longitudinal data etc.. The spatial scan statistic searches for spatial disease clusters not explained by a baseline spatial point process without specifying their size or location a priori. It is able to identify the approximate location and range of the most likely disease clusters and secondary disease clusters and to perform a significance test for each cluster, based on the maximum likelihood ratio and using Monte Carlo hypothesis testing. While the identified most likely disease clusters may not be the areas with the highest incidence by the spatial scan statistic. In contrast, the pattern recognition procedure prioritizes the areas with the highest rates and is designed to determine hierarchical incidence intensity levels of mutually adjacent areas with the highest rates geographically. It is noted that this procedure exclusively focuses on peak incidence and does not allow for covariates in determining the hierarchical intensity levels. We previously used the pattern recognition procedure to investigate the spatial clustering patterns of dengue outbreaks in Taiwan [10].
Our proposed methods generalize the pattern recognition procedure in several respects and have following features to address important problems:
-
1.
We introduce the method for differentiating intensity of geographical disease clusters of low incidence in a hierarchical manner as well as testing for the presence of clustering.
-
2.
We propose the methods for taking into account covariates that are known or hypothesized risk factors of disease in constructing hierarchical (in adjusted intensity) clusters of high-risk areas and low-risk areas close within geographic proximity, respectively.
-
3.
The use of indirectly covariate-adjusted expected incidence permits the incorporation of covariates without the need of information on covariate-specific numbers of incidence in each area under study.
-
4.
Two distinct probability models are proposed to assess the deviation between the observed incidence and covariate-adjusted expected incidence in each area.
-
5.
Two distinct neighborhood systems, adjacency-based and distance-based in the definition of close geographical proximity, are used for proposed models.
Instead of dividing the counties into high- and medium-risk categories on the basis of the incidence rates used in the practice and in the existing literature, we propose to divide the counties into high-, medium-, and low-risk categories, then proceed to further differentiate incidence level of counties close within geographic proximity in the high- and low-risk categories respectively with and without the adjustment for confounding variables in a hierarchical manner.
The statistical methods that we propose in this report are not limited to advancing and generating studies of etiology of disease of interest with unknown causes. With the information on hierarchy in adjusted intensity levels, epidemiologists and public health officials can better prioritize the regions with the highest rates for thorough etiologic studies, seeking hidden spatially related risk factors, and precisely moving resources to areas with genuine highest abnormalities.
Methods
In this section, we first introduce the method for differentiating intensity of geographical disease clusters of low incidence as well as testing for the presence of clustering. It is based on the extension of the existing pattern recognition procedure, which focuses on hierarchical clusters of high incidence [7]. Next, we generalize these methods by taking into account covariates that are known or hypothesized risk factors of the disease. We consider a covariate of race for SIDS in this application. Thirdly, we consider two distinct neighborhood systems for the North Carolina counties in the application of the proposed spatial statistical models, which are adjacency-based and distance-based in the definition of geographical proximity. We illustrate proposed statistical methods, using the data on the spatial occurrence of SIDS in North Carolina counties in 1974–1978.
Study population
The data on SIDS patients with a confounding variable of race in 100 North Carolina counties provide an opportunity to illustrate the applications of the methods that we propose for geographical disease anomalies. SIDS is the third leading cause of all infant mortality in the US and remains the leading cause of death in infants aged from 1Â month to 1Â year. Its exact cause remains unknown. The frequency of SIDS appears to be influenced by social, economic, and cultural factors, such as maternal education, race or ethnicity, and poverty. Racial disparity in infants who died of SIDS has persisted. The rate of SIDS in non-Hispanic African American infants and American Indian/Alaskan Native infants remains more than twice that of non-Hispanic white infants in 2016 [11].
The data on the spatial occurrence of SIDS in North Carolina counties over a 4Â year period, from July 01, 1974 to June 30, 1978, is used for illustration of our proposed methods. The information contained in this data set include the number of SIDS and the number of live births during this period for each of the 100 counties of North Carolina, as well as the county-seat locations. The number of live births was stratified into whites and non-whites for each of the 100 counties. The total number of live births was 329,962, in which the numbers of white and non-white births were 224,881 and 105,081, respectively. The total number of SIDS was 667, in which the numbers of white and non-white SIDS were 268 and 399, respectively. The state-wide incidence rate was 2.021 in deaths per 1000 live births. The overall incidence rates for the entire state by race were 1.192 for white children and 3.797 for non-white children per 1000 live births. Non-white SIDS rate was more than 3 times higher than that of whites, with the result that although non-whites accounted for only 31.85% of the live births in the state during the study period, they accounted for 59.82% of all the SIDS cases reported. Details of the data sources and data collection methods have been described elsewhere [4].
Two distinct neighborhood systems for the North Carolina counties systems, distance-based and adjacency-based in the definition of close geographical proximity, are used for proposed models. They were determined by the criteria of being within 30 miles between the seats of 2 counties [4] and of sharing common geographical boundaries between 2 counties [5], respectively. The map on the 100 counties of North Carolina with county names is presented in Fig. 1a.
Existing map-based pattern recognition procedure
The method developed by Mantel [12] was generalized by Cliff and Ord, who proposed the test statistic B1 = (1/2) Σ ω1ij xi xj where xi = 1 if area i is a high-risk area for some disease and 0 otherwise, and where ω1ij = 1 if areas i and j are mutually adjacent geographically and 0 otherwise, ω1ij = ω1ji, ω1ii = 0 [13]. The sum ranges over all pairs of areas. It is an adjacency-based test statistic that measures spatial autocorrelation for binary data and uses the distribution of the number of adjacencies of geographic units. When high-risk areas tend to be geographically adjacent to each other, the value of B1 tends to be large. Using the test statistic B1, one can test the null hypothesis of the random allocation of high-risk areas over the geographical region; that is, high-risk areas do not cluster. Cliff and Ord derived the expressions for the mean and variance of B1 under the assumptions of binomial and hypergeometric distributions [14]. We note that lower p values of B1 indicate high degrees of clustering or tight clustering, which conform to the adjacency-based definition of a cluster rather than being interpreted in the usual sense in this context.
Instead of selecting a specific threshold rate of incidence, the procedure proposes to first list the areas under study in rank order based on the disease intensity rates. It starts with classifying the 2 top ranking areas as high-risk areas and calculates the value of B1. The procedure proceeds, letting the threshold rate of incidence vary downwards continuously, in which case it includes exactly one area with high rate at one time, up to an upper limit, such that at most we include, say, 25% of the total areas under study. Thus, the procedure subsequently includes the area with the 3rd highest rate and the other 2 areas with higher rates as high-risk areas and calculates the corresponding value of B1. The p value is the probability that B1 is equal to or higher than the observed number of adjacencies involved between these 3 areas with the highest disease intensity rates. Therefore, the procedure provides the p value of B1 when the k top ranking areas among all areas under study are classified as high-risk areas for each k where k = 2, 3, 4….. The main feature of the procedure is to determine the hierarchical incidence intensity pattern through the distribution of p values of B1 for k = 2, 3, 4…., which is illustrated in an application to the SIDS data in North Carolina in 1974–1978 [7]. Figure 1a presents the county-specific SIDS incidence intensity map in this period.
Instead of relying on the assumptions associated with the asymptotically normal distribution [14] or using Monte Carlo method [7], we propose to use simulation-based permutations using 1 million replicates based on the exact county boundary map, defined by ω1, to obtain the null distribution of B1. The basic geographic unit used in this report is a county. The distribution was simulated by randomly selecting exactly k counties among the 100 counties of North Carolina 1 million times and counting the number of the adjacent pairs appearing among the k counties for each of the 1 million replicates. The information on sharing common boundaries among North Carolina counties that we use here is available in the literature [5]. This process was applied for k = 2, 3, 4… 25. Each of the 24 distributions of B1 for k = 2, 3, 4… 25 is given in Table 1. We also used simulation-based permutations using 1 million replicates based on the exact district boundary map under study to obtain the null distribution of B1 in a study of dengue fever in Taiwan previously [10].
Hierarchical clusters of neighboring low-risk areas
In the existing map-based pattern recognition procedure, k indicates the number of top ranking counties of North Carolina with the higher intensity rates, classified as high-risk counties. While focusing on geographical disease clusters of incidence paucity, we propose to use k to indicate the number of top ranking counties with the lower intensity rates, classified as low-risk counties. Using the pattern recognition procedure, we can determine the hierarchical (in intensity) spatial clusters of incidence paucity correspondingly.
Methods with adjustment of covariates
We propose to adjust for the effects of confounding variables by enumerating expected incidence (or number) of disease in each county through indirect standardization, conditioning on the total number of disease observed. We define a measure to be the expected number of patients with disease of interest in a spatial unit of the study region, a county of North Carolina in this application, with μ(Ai) = expected number of patients in county Ai. Assuming that the incidence for sub-population j equals Ej in the entire study region and the size of sub-population j in county Ai is aij, the expected incidence in county Ai is equal to the summation of the products of Ej and aij, across all j, μ(Ai) = \({\sum }_{j}{a}_{ij}{E}_{j}\). This indirectly covariate-adjusted expected incidence for a county permits us to incorporate covariates with no need of information on covariate-specific numbers of incidence in every county. With the use of indirect standardization, the generalization to multiple covariates, adjusted for multiple confounding variables, is immediate.
The overall incidence rates for the state of North Carolina by race were 1.192 for white children and 3.797 for non-white children per 1000 live births in 1974–1978. With the information on the white and non-white numbers of live births for each of the 100 North Carolina counties [4], an expression of the race-adjusted expected incidence in county Ai is μ(Ai) = 1.192 × white births + 3.797 × non-white births, which is proportional to the expected number of SIDS.
We propose two models for assessing the discrepancy between the observed incidence and the county-specific expected covariate-adjusted incidence: the standardized morbidity ratio (SMR) and a Poisson model. The SMR, the ratio of the observed number of incidence to the expected number of incidence, provides a direct quantitative measure of the overall discrepancies between the observed incidence and expected covariate-adjusted incidence per county.
Alternatively, we propose a Poisson probability model for assessing the probability of the deviation of a particular frequency to be attributed to sampling fluctuations. We propose to adjust for covariates under the assumption of Poisson random variables, which provide a crude way to account for the unequal variances of the county rates. The rank order is based on the probability associated with the assessment of the discrepancy between the observed incidence and county-specific expected covariate-adjusted incidence in a county. The Poisson distributional assumptions were previously employed by several investigators in studies of SIDS data [5, 6, 8, 9]. The existing literature generally assume a batch of independently and identically distributed Poisson random variables for distinct counties with the null hypothesis of homogeneity of state-wide incidence [5, 6, 9]. In this report, we assume county-specific Poisson probability models whose mean depends on the expected covariate-adjusted incidence for each county.
Using the same expression of the race-adjusted expected incidence μ(Ai) for county Ai, we calculate the probability of departure from expected race-adjusted incidence for each county, based on the assumption of Poisson distribution. It is defined as follows, Ni = the number of incidence in county Ai,
A small value of this probability Di indicates that the SIDS incidence at county Ai is unusually high beyond the effect of race, if the observed incidence large than expected race-adjusted incidence, or that the SIDS incidence at county Ai is unusually low beyond the effect of race, if the observed incidence smaller than expected race-adjusted incidence.
The step-by-step guidelines for the use of the proposed SMR models are:
-
1.
Compute covariate-adjusted expected incidence for each area under study.
-
2.
List the areas with the highest (lowest) rates of SMR in rank order, up to an upper limit, say, 25 percent of the total areas under study.
-
3.
Like the ordinary map-based pattern recognition procedure, start with classifying the 2 top ranking areas as high-risk (low-risk) areas and calculate the value of B1.
-
4.
Proceed successively, including exactly one area with the high (low) rate of SMR next according to the rank order and the other areas with higher (lower) rates as high-risk (low-risk) areas at each step with the use of B1.
-
5.
Determine the areas with the lowest p values of B1 relative to surrounding p values in the ranking.
-
6.
The hierarchical characterization is constructed, based on the use of the inclusion points of these areas determined in the previous step.
The step-by-step guidelines for the use of the proposed Poisson models are:
-
1.
Compute covariate-adjusted expected incidence for each area under study.
-
2.
Calculate the probability of departure from expected covariate-adjusted incidence for each area based on the assumption of Poisson distribution.
-
3.
List the areas with the smallest probabilities in rank order and the observed incidence larger (smaller) than expected incidence, up to an upper limit, say, 25 percent of the total areas under study.
-
4.
Like the ordinary map-based pattern recognition procedure, start with classifying the 2 top ranking areas as high-risk (low-risk) areas and calculate the value of B1.
-
5.
Proceed successively, including exactly one area with the small probability of Poisson distribution next according to the rank order and the other areas with smaller probabilities and the observed incidence larger (smaller) than expected incidence as high-risk (low-risk) areas at each step with the use of B1.
-
6.
Determine the areas with the lowest p values of B1 relative to surrounding p values in the ranking.
-
7.
The hierarchical characterization is constructed, based on the use of the inclusion points of these areas determined in the previous step.
Results
In this section, we first present the analysis of high-risk areas and low-risk areas close within geographic proximity without adjustment of race, respectively. Secondly, we present the analysis with adjustment of race, using the proposed SMR model and Poisson model. Thirdly, in addition to an adjacency-based definition of geographical proximity, we consider and present the analysis based on the use of a distance-based neighborhood system.
Analysis without adjustment for race
We first repeated the analysis, which was previously performed [7]. We present the cluster statistics for 25 counties with the higher rates in Table 2. Our results were slightly different from those presented in Table 3 of their article: (1) the test statistic B1 in Table 2 was smaller than the corresponding one in Table 3 by 1 between Warren (12th in rank) and Lenoir (20th in rank) because the recognition of sharing common boundaries among North Carolina counties was different in certain counties by the authors; (2) the p values of B1 in Table 2 were obtained by the simulation-based permutations using 1 million replicates shown in Table 1, rather than by the 2000 replicates of the Monte Carlo method in their article.
The largest downward peaks in p values relative to surrounding p values occurred at the inclusion points of Bertie (8th in rank), Robeson (14th in rank), Pender (18th in rank), and Wayne (24th in rank) counties, shown in the p value1 column of Table 2. Robeson and Pender are close in location and on the rank scale. So the construction of the hierarchical characterization used the downward peaks at inclusion points of Bertie, Pender, and Wayne counties. Correspondingly, we determined the 3 groups of counties to use in constructing hierarchical clusters of mutually neighboring high-risk counties with 3 different levels of intensity. Level-H1 counties are the 8 top ranking counties; Level-H2, 10 counties ranking from 9 to 18; Level-H3, 6 counties ranking from 19 to 24. The overall incidence of the 8 Level-H1, 10 Level-H2, and 6 Level-H3 counties combined are 5.57, 3.95, and 2.79 per 1000 live births, respectively.
When the level-specific intensity is placed on the map, 3 hierarchical intensity clusters of high SIDS emerge and are respectively located in the northeast (6 counties: 5 Level-H1 and 1 Level-H2 counties) with incidence of 4.98, the south (6 counties: 1 Level-H1 and 5 Level-H2 counties) with incidence of 4.06, and the mid-east (6 counties: 1 Level-H1 and 5 Level-H3 counties) with incidence of 3.09 per 1000 live births, as shown in Fig. 1b. This hierarchical characterization is identical to the one in the previous study [7].
Previously, k indicates the number of top ranking counties of North Carolina with the higher intensity rates, classified as high-risk counties, given in Table 2. While focusing on geographical disease clusters of incidence paucity, k indicates the number of top ranking counties with the lower intensity rates, classified as low-risk counties. The 25 counties with the lowest rates were listed by rank according to their rates in Table 3. There were 13 counties with 0 SIDS but different numbers of live births. Because the larger the population base for the rate, the more stable and reliable the rate will be, we discriminated between these 13 counties according to the numbers of live births: the ones with larger numbers of live births were considered higher in ranking. The original test statistic B1 = (1/2) Σ ω1ij xi xj was revised to be where xi = 1 if area i is a low-risk county for disease and 0 otherwise, and where ω1ij = 1 if counties i and j are mutually adjacent geographically and 0 otherwise, ω1ij = ω1ji, ω1ii = 0. It is noted that, with each k, the test statistic B1 in Table 3 was smaller than the corresponding one in Table 2, indicating that the high-risk counties show higher degrees of clustering geographically than the low-risk counties.
Similarly, the method starts with classifying the 2 top ranking counties as low-risk counties and calculates the value of B1. We proceed successively, including exactly one county with the low rate next according to the rank order and the other counties with lower rates as low-risk counties at each step with the use of B1. We observed the large downward peaks in p values relative to surrounding p values at including Tyrrell (88th in rank), Ashe (79th in rank), and Iredell (77th in rank) counties, shown in the p value1 column of Table 3. Ashe and Iredell are close on the rank scale. So the hierarchical characterization was determined using the downward peaks at inclusion points of Tyrrell and Iredell counties.
The use of the lowest p values at various counties in the rankings means to capture the groups of counties with highly tight clustering. The hierarchical (in intensity) spatial clusters of incidence paucity were optimally determined by using the inclusion points of Tyrrell and Iredell counties. We, therefore, determined the 2 groups of counties to use in constructing hierarchical clusters of mutually neighboring low-risk counties with 2 different levels of intensity. Level-L1 counties are the 13 top ranking counties with 0 SIDS; Level-L2, 11 counties ranking from 87 to 77. The overall incidence of the 13 Level-L1 and 11 Level-L2 counties combined are 0 and 0.81 per 1000 live births, respectively. The 3 hierarchical low-intensity clusters appear respectively in the northwestern region (6 counties: 4 Level-L1 and 2 Level-L2) with incidence of 0.28, the mid-western region (9 counties: 1 Level-L1 and 8 Level-L2) with incidence of 0.70, and the eastern coastal region (3 counties: 3 Level-L1) with incidence of 0 per 1000 live births. Figure 1b presents the county-specific SIDS incidence intensity-level map.
Standardized morbidity ratio with adjustment for race
Table 4 presents the rank order and cluster statistics for 25 counties with the highest rates of SMR. The test statistic B1 in Table 4 was smaller than the B1 in Table 2 for each on rank order, suggesting that the counties with the higher SMR appear relatively disperse, compared with the counties with the higher raw incidence rates. Figure 2a presents the county-specific SMR-adjusted intensity map.
Previously detected geographical high-intensity clusters in the northeast and south appeared less intensive in terms of rank order and smaller in size, after adjusting for race. It indicates that the high percentage of non-white births alone is not sufficient to explain the excess of SIDS risk thoroughly in the northeastern and southern regions. In addition, the previous cluster in the mid-east disappeared after adjusting for race. On the other side, a small area, which was hidden previously, emerged in the southwest, comprising Rutherford (2nd in rank), McDowell (10th in rank), Transylvania (12th in rank), and Henderson (21st in rank) counties. The raw incidence and SMR of these 4 counties combined are 2.88 per 1000 live births and 1.977, respectively. It indicates that an excess of SIDS risk is observed in this region, after adjusting for race. The p values of B1 for the 25 top ranking counties do not appear cyclic over the rates, and no hierarchy in high SMR-adjusted intensity would be recognized.
In the investigation of the geographical disease clustering pattern of low SMR-adjusted intensity, we present the rank order and cluster statistics for 25 counties with the lowest SMR in Table 5. The discrimination between the 13 counties with 0 SIDS was based on the race-adjusted expected numbers of SIDS. The larger the covariate-adjusted expected incidence, the higher in rank order the county will be. The large downward peaks in p values relative to surrounding p values occurred at including Clay (88th in rank) and Davie (79th in rank) counties, shown in the p value1 column of Table 5. We, thus, determined the 2 groups of counties to use in constructing hierarchical clusters of mutually neighboring race-adjusted low-risk counties beyond the effect of race: Level-LS1 counties for the 13 top ranking counties with 0 SIDS; Level-LS2 for the 9 counties ranking from 87 to 79, as shown in Fig. 3b. The geographical disease clustering pattern of low SMR-adjusted intensity, shown in Fig. 2a, b, does not appear very different from the geographical clustering pattern of low incidence intensity without adjustment, shown in Fig. 1a, b. One major reason was that the 13 counties with 0 SIDS were among the 13 top ranking counties in both settings. The use of SMR does not seem to effectively discriminate between the 13 counties with 0 SIDS.
Poisson-model with adjustment for race
We list the counties in rank order based on Di and present the cluster statistics for 25 top ranking counties with the smallest Di and Ni ≥ μ(Ai) in Table 6. The 10 top ranking counties in Table 6 had significantly higher incidence than the expected race-adjusted incidence each at a nominal significance level of 0.05 under the assumption of Poisson distribution, suggesting that the departure from expected race-adjusted incidence for each of these 10 counties is too large to be attributed to chance alone. All but 2 top ranking counties in Table 6 also appear on the list of the 25 top ranking counties with the highest SMR in Table 4, except Robeson (18th in rank) and Wayne (24th in rank) in Table 6 vs. Pender (23rd in rank) and Stanly (25th in rank) in Table 4. So the geographical race-adjusted high-intensity clustering patterns, characterized by the Poisson model and SMR, are similar, shown in Figs. 2a and 3a. The smaller the value of Di with Ni ≥ μ(Ai), the darker in red the county is, shown in Fig. 3a. The conclusion based on the Poisson model was similar to the one based on the higher SMR rates. The p values of B1 for these 25 top ranking counties do not appear cyclic. No hierarchy in high race-adjusted intensity would be recognized beyond the effect of race by the Poisson model.
While the geographical race-adjusted low-intensity clustering pattern based on the Poisson model appears very different from the one based on the low SMR rates. The smaller the value of Di with Ni ≤ μ(Ai), the darker in blue the county is, shown in Fig. 3a. There were only 3 counties (Alexander, Gates, and Macon) with 0 SIDS among the 25 top ranking counties with the smallest Di and Ni ≤ μ(Ai) in Table 7. Although only the 4 top ranking counties had significantly lower incidence than the race-adjusted expected incidence each at a nominal significance level of 0.05, it remained meaningful and useful to search for and determine regions of several (or many) mutually adjacent counties with the low race-adjusted rates geographically, beyond the effect of race. Regions of similar low incidence may reveal information on the presence of protective factors or the absence of risk factors associated with SIDS.
The large downward peaks in p values of B1 relative to surrounding p values occurred at including Alexander (92nd in rank), Stokes (80th in rank), and Johnson (77th in rank) counties, shown in the p value1 column of Table 7. We determined the 3 groups of counties to use in constructing hierarchical clusters of mutually neighboring race-adjusted low-risk counties with 3 different levels of intensity, after adjusting for the effect of race. Level-LP1 counties are the 9 top ranking counties (92nd–100th in rank); Level-LP2, 12 counties ranking from 80 to 91; Level-LP3, 3 counties ranking from 77 to 79. The raw incidence of the 9 Level-LP1, 12 Level-LP2, and 3 Level-LP3 counties are 0.85, 1.24, and1.65 per 1000 live births, respectively. The 3 hierarchical race-adjusted low-intensity clusters appear in the north (3 counties: 2 Level-LP1 and 1 Level-LP2) with raw incidence of 0.96, the mid-west (6 Level-LP1 counties) with raw incidence of 0.71, and the mid-east (10 counties: 1 Level-LP1, 6 Level-LP2, and 3 Level-LP3) with raw incidence of 1.40 per 1000 live births. Figure 3b presents this county-specific intensity-level map, by the Poisson model.
Our analysis showed that the difference in racial distribution of live births across North Carolina explained, to a large extent, the 3 previously identified hierarchical high-intensity clusters in the northeast, south, and mid-east, shown in Fig. 1a, b. None p value of B1 for all k was statistically significant at a nominal significance level of 0.05 in the testing for the presence of clustering of the counties with the higher race-adjusted incidence intensity, characterized by both the models. It was because these 25 top ranking counties with the highest adjusted intensity rates in the settings of either models were allocated sequentially and alternatively across the 3 relatively high adjusted-intensity regions in the northeast, south, and southwest, shown in Figs. 2a and 3a. It indicates that the counties with the higher race-adjusted rates did not show high degree of clustering geographically. No hierarchy in high race-adjusted intensity would be recognized.
Intriguingly, a small region of 4 mutually adjacent counties with the higher race-adjusted rates, which was hidden previously, emerged in the southwest. The combined raw incidence and SMR of this region are 2.88 per 1000 live births and 1.977, respectively. The combined SMR of 1.977 in this small region was higher than the 95 of the 100 counties, shown in Table 4. Unobserved spatially related risk factors may cause the elevated risk in this region.
In contrast, the Poisson model appeared more appropriate than the model based on SMR in the study of geographical race-adjusted low-intensity clustering patterns, particularly at facilitating discrimination between the 13 counties with 0 SIDS but different numbers of live births. In the comparison of Fig. 1a, b with Fig. 3a, b, uneven distribution of the race-specific live births substantially changed the geographical race-adjusted low-intensity clustering patterns, too. The previously detected geographical low-intensity clusters in the northwest and the eastern coast disappeared, indicating that high percentage of white births alone was sufficient to explain the excess of SIDS risk in these 2 regions. The previously detected geographical low-intensity cluster in the mid-west remained, but divided into 2 smaller distinct geographical clusters. A large geographical cluster of 10 mutually adjacent counties with the low race-adjusted rates, which was hidden previously, emerged in the mid-east. It suggests that unidentified spatially related protective factors could explain the unusually low-risk clusters in the mid-west and the mid-east. A summary of hierarchical cluster analysis based on different models is presented in Table 8.
Analysis with different neighborhood systems
The definition of neighborhood systems may govern the analysis outcomes. Because of the irregularity in shape, contour, size, and location among distinct areas under study, special spatial configurations of certain sampled areas may be intuitively clustering, but not all of which are actually adjacent geographically. Due to the inherent irregular nature of most spatial data, we consider two distinct neighborhood systems for the North Carolina counties in illustrating the spatial statistical models that we propose in this report. In addition to the one based on geographical adjacency that we have used previously, we extended the proposed methods by using a distance-based neighborhood system, in which the neighbor of counties is defined with being within 30 miles between the seats of the 2 counties in this application. The neighborhood information based on this criterion is available in the existing literature [4].
We propose the use of a distance-based definition of neighbors as follows: within 30 miles between the seats of the 2 counties, denoted by ω2, in this application. We carried out the analysis with the test statistic B2 = (1/2) Σ ω2ij xi xj, where xi = 1 if county i is a high-risk (or low-risk) county for some disease and 0 otherwise, and where ω2ij = 1 if counties i and j whose seats are closer less than 30 miles and 0 otherwise, ω2ij = ω2ji, ω2ii = 0. The sum ranges over all pairs of counties. Again, we used simulation-based permutations using 1 million replicates based on ω2 and obtained the null distribution of B2. Each of the 24 distributions of B2 for k = 2, 3, 4… 25 is given in Additional file 1. Table S1. The values of the test statistic B2 and the associated p value, denoted by p value2, are presented on the right panel of Tables 2, 3, 4, 5, 6 and7.
The geographical incidence intensity clustering patterns characterized by B2 and p value2 were generally in close agreement with those by B1 and p value1 with and without the adjustment for race. It is not surprising in this particular application as the neighborhood system, defined using the 30-mile criterion, corresponds nearly precisely to the one, defined by those counties mutually sharing common geographical boundaries, in North Carolina county system [4, 5].
Comparison with analysis using spatial scan statistic
Both our proposed statistical methods and the spatial scan statistic allow for confounding variables and are used to identify disease clustering or detect disease clusters in a spatial point process in general. They are sensitive to different respects of spatially characteristic incidence clustering patterns and structured to provide different spatial clustering information. The spatial scan statistic determines the most likely disease clusters and secondary disease clusters based on the maximum likelihood ratio, whose statistical significance is evaluated, using Monte Carlo hypothesis testing. The spatial scan statistic tends to detect relatively broad spatial clusters, and the detected most likely disease clusters may not be the regions with the highest rates. In the analysis of SIDS in North Carolina counties, the spatial scan statistic detected the most likely cluster in the south with incidence of 3.8 and the secondary cluster in the northeast with incidence of 4.1 per 1000 live births, with the state-wide incidence of 2.0, shown in Table 1 [8].
In contrast, our proposed methods are designed to prioritize the counties with the highest or lowest, adjusted or unadjusted, intensity rates in the testing for the presence of spatial clustering. We let the threshold rate of incidence vary downwards continuously, in which case it includes exactly next one area in the rank order at one time, up to a certain upper limit. As shown in Table 2 and Figs. 1a, b, the highest intensive clustering region was recognized in the northeast with incidence of 4.98, and secondary intensive clustering region, in the south with incidence of 4.06 per 1, 00 live births.
After adjusting for race, the spatial scan statistic detected an emerging broad secondary cluster in the west, which was previously hidden and comprised 18 counties, with the SMR of 1.357, presented in Table 2 [8]. In contrast, our method, either based on the Poisson model or SMR, recognized a small high-intensity region, which was previously hidden and emerged in the southwest, comprising only 4 counties: Rutherford, McDowell, Transylvania, and Henderson, shown in Tables 4 and 6 and Figs. 2a and 3a. This small regions in the southwest was a much smaller subset of the broad region identified by the spatial scan statistic and had a higher SMR of 1.977 per 1000 live births.
Discussion
In this paper, we have presented a general framework for differentiating intensity of geographical disease clusters of peak incidence and low incidence in a hierarchical manner with the adjustment for covariates as well as testing for the presence of disease clustering. The first method is structured for recognizing and constructing hierarchical (in intensity) disease clusters of low incidence. The second method generalizes to take into account covariates that are known or hypothesized risk factors of the disease in constructing hierarchical (in adjusted intensity) clusters of high-risk areas and low-risk areas close within geographic proximity, respectively. We formulated the adjustment for covariates by calculating the expected number of cases in each county through indirect standardization. We proposed two probability models, a Poisson-distribution-model and SMR, to facilitate discrimination between the 100 North Carolina counties based on the deviation between the observed incidence and covariate-adjusted expected incidence in each county, through which the hierarchy in adjusted intensity is recognized, beyond the effect of covariates.
The application to the data on North Carolina SIDS, using the proposed methods, shows that the two probability models performed similarly in the geographical race-adjusted intensity clustering analysis of counties with the highest rates. While the analysis was very different in the investigation of the mutually adjacent counties with the lowest adjusted rates. The Poisson model that can account for the unequal variances of the county rates performed better particularly at facilitating discrimination between the 13 counties with zero SIDS but different numbers of live births than the model based on SMR. The hierarchical race-adjusted low-intensity clusters, characterized by the Poisson model, should be more reliable, shown in Tables 6 and 7 and Fig. 3a, b.
With the information on hierarchy in adjusted intensity levels, provided by the application of our proposed methods, epidemiologists can best prioritize the regions with the highest rates within which to conduct thorough etiologic investigations and search for hidden spatially related risk factors. Similar research designs are commonly applied in studies of human genetics, in which a group of affected sibships with extreme traits are used for detecting commonly shared genetic defects of a disease of interest in gene mappings [15]. Meanwhile, public health officials can better prioritize the high-risk regions precisely and promptly move resources to areas with genuine highest abnormalities.
The identification of geographical and temporal disease clusters serves as a preliminary step that expedites subsequent etiological investigation and analysis of epidemicity. Most reports of perceived clusters do not lead to the identification of a common casual exposure for the events of interest [16]. The reasons for this are many. As Rothman and many others pointed out that vast resources spent on the investigation of possible alarms of disease clustering are often in vain. We should not be aiming to detect clustering, but to understand why clusters occur [17]. The four stages in the guidelines for the investigation of disease clusters issued by the US Centers for Disease Control in 1990 are (1) initial contact with and response to the individual who reported the cluster; (2) a preliminary assessment, including evaluations of whether an excess has occurred; (3) a formal feasibility study; and (4) a full etiologic investigation [18]. The ordinary statistical methods for detecting temporal and spatial clustering in disease incidence frequency alone are useful in the second stage. While, the utilities of our proposed statistical methods contribute to the third and fourth stages.
In addition to the focus on peak incidence, we extended the proposed methods to investigate geographical disease clusters of low incidence. We exemplified the utilities of recognizing and constructing the geographical hierarchical (in intensity) disease clusters of low incidence and peak incidence without and with the adjustment for covariates. The studies of incidence paucity and incidence clustering characterize opposite aspects of an observed geographical incidence pattern by using different parts of information from the data. In this report, we show that statistical methods that focus on geographical incidence paucity can be as meaningful and useful in spatial epidemiology and spatial statistics as the methods that focus on peak incidence in space. We articulate the difference in sensitivity, power, and applicability between the studies of incidence paucity and incidence clustering, using a temporal series of data, in our previous articles [1, 2].
Availability of data and materials
Information on data on the spatial occurrence of SIDS in North Carolina counties from July 01, 1974 to June 30, 1978, is available on the paper: Cressie N, Chan NH (1989). Spatial Modeling of Regional Variables. Journal of the American Statistical Association, 84(406):393–401.
References
Wu CC, Grimson RC, Amos CI, Shete S. Statistical methods for anomalous discrete time series based on minimum cell count. Biom J. 2008;50(1):86–96.
Wu CC, Grimson RC, Shete S. Exact statistical tests for heterogeneity of frequencies based on extreme values. Commun Stat Simul Comput. 2010;39(3):612–23.
Atkinson D. Epidemiology of sudden infant death in North Carolina: do cases tend to cluster?. Chapel Hill: University of North Carolina; 1979.
Cressie N, Chan NH. Spatial modeling of regional variables. J Am Stat Assoc. 1989;84(406):393–401.
Cressie N, Read TRC. Do sudden infant deaths come in clusters? Stat Decis. 1985;3(Supplement Issue No. 2):333–49.
Cressie N, Read TRC. Spatial data analysis of regional counts. Biom J. 1989;31:699–719.
Grimson RC, Wang KC, Johnson PWC. Search for hierarchical clusters of disease: spatial patterns of sudden infant death syndrome. Soc Sci Med. 1981;15(D):287–93.
Kulldorff M. A spatial scan statistic. Commun Stat Theory Methods. 1997;26(2):1481–96.
Symons MJ, Grimson RC, Yuan YC. Clustering of rare events. Biometrics. 1983;39(1):193–205.
Lai WT, Chen CH, Hung H, Chen RB, Shete S, Wu CC. Recognizing spatial and temporal clustering patterns of dengue outbreaks in Taiwan. BMC Infect Dis. 2018;18(1):256.
Carlin RF, Moon RY. Risk factors, protective factors, and current recommendations to reduce sudden infant death syndrome: a review. JAMA Pediatr. 2017;171(2):175–80.
Mantel N. The detection of disease clustering and a generalized regression approach. Can Res. 1967;27(2):209–20.
Cliff AD, Ord JK. Spatial autocorrelation. London: Pion Press; 1973.
Cliff AD, Ord JK. Spatial Processes: Models & Applications. Milton Park: Taylor & Francis; 1981.
Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science. 1995;268(5217):1584–9.
Rothman KJ, Greenland S. Modern epidemiology, 2nd edn. Philadelphia: Maple Press; 1998.
Rothman KJ. A sobering start for the cluster busters' conference. Am J Epidemiol. 1990;132(1 Suppl):S6–13.
Centers for Disease Control and Prevention (CDC). Guidelines for investigating clusters of health events. Morbidity and mortality weekly report (MMWR), vol. 39. Atlanta: Centers for Disease Control and Prevention (CDC); 1990. p. 1–23.
Acknowledgements
Not applicable.
Funding
The research presented in this manuscript was partially supported by the Taiwan Ministry of Science and Technology grant MOST 109-2118-M-006-005 to C.C.W.
Author information
Authors and Affiliations
Contributions
CCW designed the study, performed analyses, and drafted the manuscript. SS participated in project conception and discussion of results of the original manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1: Table S1.
Frequency distributions of the number of neighbors (seats of counties within 30 miles) simulated on the basis of 1 million random selections in North Carolina counties.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wu, CC., Shete, S. Differentiating anomalous disease intensity with confounding variables in space. Int J Health Geogr 19, 37 (2020). https://doi.org/10.1186/s12942-020-00231-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12942-020-00231-3