Confidence intervals for rate ratios between geographic units
 Li Zhu^{1}Email authorView ORCID ID profile,
 Linda W. Pickle^{2} and
 James B. PearsonJr.^{2}
DOI: 10.1186/s1294201600735
© The Author(s) 2016
Received: 1 September 2016
Accepted: 30 November 2016
Published: 15 December 2016
Abstract
Background
Ratios of ageadjusted rates between a set of geographic units and the overall area are of interest to the general public and to policy stakeholders. These ratios are correlated due to two reasons—the first being that each region is a component of the overall area and hence there is an overlap between them; and the second is that there is spatial autocorrelation between the regions. Existing methods in calculating the confidence intervals of rate ratios take into account the first source of correlation. This paper incorporates spatial autocorrelation, along with the correlation due to area overlap, into the rate ratio variance and confidence interval calculations.
Results
The proposed method divides the rate ratio variances into three components, representing no correlation, overlap correlation, and spatial autocorrelation, respectively. Results applied to simulated and real cancer mortality and incidence data show that with increasing strength and scales in spatial autocorrelation, the proposed method leads to substantial improvements over the existing method. If the data do not show spatial autocorrelation, the proposed method performs as well as the existing method.
Conclusions
The calculations are relatively easy to implement, and we recommend using this new method to calculate rate ratio confidence intervals in all cases.
Keywords
Spatial autocorrelation Rate ratio Variance Confidence intervals Cancer statistics Linked micromap plotBackground
Ratios of ageadjusted rates are of interest in public health research as a means of comparing rates in a set of geographic units with the rate in the overall area or in an area considered to be a “standard”. They are useful in providing information to the general public on the health condition of the community, and to policy stakeholders on program planning and priority setting. These rates (and ratios) between geographic units are correlated due to two reasons—the first being that each region is a component (subregion) of the overall area; and the second often referred to as the “First Law of Geography” [1], i.e., everything is related to everything else, but near things are more related than distant things. The second source of correlation is called spatial autocorrelation.
Earlier work developed methods that estimated confidence intervals (CI) on ageadjusted rates and ratios of ageadjusted rates between nonoverlapping regions [2] as well as between a subregion and its parent region [3]. The latter took into account the correlation due to overlapping between the two regions, and showed that the F and the normal approximation were more efficient than the gamma approximation and the F interval [4]. Spatial autocorrelation is well known to be present in the distribution of diseases [5, 6] but has not yet been incorporated into rate ratio interval estimates published by the National Cancer Institute.
The Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute [7] is an authoritative source of information on cancer incidence and survival in the United States. SEER currently collects and publishes cancer incidence and survival data from populationbased cancer registries covering approximately 30% of the US population. Cancer statistics, including rate ratios, are released annually at the national, registry, state and/or county level via the statistical software SEER*Stat [8]. Rate ratios are provided but their interval estimates are computed assuming nonoverlapping regions and no spatial autocorrelation.
In this paper, we present revised confidence intervals that take spatial autocorrelation into account and show that the interval coverage is more accurate and provides a higher statistical power. The proposed method considers both overlapping (nonspatial) and spatial autocorrelation. The three methods referred to throughout the paper are (1) the Tiwari method [3] that considers overlapping between a subregion and its parent region; (2) the nonspatial method which is a special case in our proposal in that spatial autocorrelation is ignored and only overlap is accounted for; and (3) the spatial method that includes both overlap and spatial autocorrelation. Methods (1) and (2) are equivalent, only based on different probability distribution assumptions.
Methods
Ageadjusted rates
Spatial autocorrelation
Suppose \( \{ Z({\mathbf{s}}):{\mathbf{s}} \in {\mathbf{S}}\} \) represents a random process on surface \( {\mathbf{S}} \in {\mathbf{R}}^{{\mathbf{2}}} \), and Z(s _{1}), Z(s _{2}), …, Z(s _{ I }) represents a partial realization of the random process. Z(·) is said to be secondorder stationary if \( E[Z({\mathbf{s}})] = \mu \) for all \( {\mathbf{s}} \in {\mathbf{S}} \) (i.e., the mean of the process does not depend on location) and \( Cov\left[ {Z\left( {s_{i} } \right), Z\left( {s_{{i^{{\prime }} }} } \right)} \right] = C\left( {s_{i}  s_{{i^{{\prime }} }} } \right) \) for all \( s_{i} ,s_{i'} \in {\mathbf{S}} \). That is, the covariance function C(·), a measure of spatial correlation, depends only on the difference between locations s _{ i } and \( s_{{i^{{\prime }} }} \), not on the locations themselves. If \( Var\left[ {Z\left( {s_{i} } \right)  Z\left( {s_{{i^{{\prime }} }} } \right)} \right] = 2\gamma \left( {s_{i}  s_{{i^{{\prime }} }} } \right) \), then Z(·) is said to be intrinsically stationary and the function 2γ(·) is called a variogram. The semivariogram, γ(·), has a value \( r\left( {s_{i}  s_{{i^{{\prime }} }} } \right) \) which is a function of the difference between locations s _{ i } and \( s_{{i^{{\prime }} }} \) [10]. If, in addition, the covariance function C(·) does not depend on the direction between locations s _{ i } and \( s_{{i^{{\prime }} }} \), the process is called isotropic.
Variance calculation
Using notation similar to Tiwari et al. [2] where variances and confidence intervals of ageadjusted rates were derived, let \( {\mathbf{r}} = ({\mathbf{r}}_{1} , \ldots ,{\mathbf{r}}_{\text{I}} )^{{\prime }} \) denote the vector of agespecific rates for the regions 1 through I, and each component \( \text{r}_{\text{i}} = (r_{i1} , \ldots ,r_{iJ} )^{{\prime }} \) represent the rates for the J age groups in region i. Also let \( {\bar{\mathbf{R}}} = (R_{1} , \ldots ,R_{I} ,R_{\varOmega } )^{{\prime }} \) denote the vector of ageadjusted rates for regions 1 through I and the overall ageadjusted rate R _{ Ω }. Tiwari et al. [3] derived confidence intervals (and therefore the relevant variances and covariances) for an ageadjusted rate and of the difference and ratio of two ageadjusted rates, specifically R _{ i } and R _{ Ω }. as above. The calculation took into account the correlation due to the overlap between the subregions and the parent region. The derived 95% CI were shown to be more efficient than previously proposed methods [4]. However, both of these derivations ignored potential spatial autocorrelation among the areaspecific rates. In this report, we will follow the development of Tiwari et al. [3] to derive the variance/covariance matrix for ln (R _{ i }/R _{ Ω }), the logarithm of the rate ratio for region i relative to the overall rate, adding spatial autocorrelation as necessary.
Therefore, we need to compute \( Var({\mathbf{r}}) \), multiply by the factors in the above equation to estimate Var(R _{ Ω }), and then use the components of that result to compute the variance of the logarithm of the rate ratio.
All ageplacespecific rate (λ _{ ij }) terms in the variance calculation will be approximated by the observed rates R _{ ij }.
Results
To explore the impact of spatial autocorrelation in the confidence intervals of rate ratios, we will apply the methods with and without spatial autocorrelation to cancer incidence data from the SEER Program and cancer mortality data from the National Centers for Health Statistics [12]; both incidence and mortality data are available via the SEER*Stat software. The current SEER*Stat software provides ageadjusted rates with associated standard errors, confidence intervals, and betweengeographicarea rate ratios with associated intervals and the p value (to test the rate ratio equals to 1). The rate ratio intervals are calculated using the Tiwari et al. [2] method (nonoverlapping) but when rate ratio between two overlapping regions are requested, an alert message pops up that reads “The algorithms for the confidence intervals and the significance testing assume nonoverlapping groups. Please use caution when interpreting these results.”
All ageadjusted rates in this paper are calculated using the 2000 US Standard Population [13] using the direct method, and the unit of the ageadjusted rates is per 100,000 people at risk. In addition to the real datasets, corresponding simulated rate ratios were created under the assumption that the logarithm of rate ratios comes from a spatial Gaussian process. The variation of the mean vector and variance–covariance structure will be described below. The simulation was implemented using the spam package [14] in R [15] with 10,000 realizations were created for each simulation.
Statelevel data on different cancers
A linked micromap plot [17] reveals the spatial pattern and associations between multiple variables simultaneously. To explore factors that contribute to the higher precision (and shorter confidence intervals) associated with the spatial method, we used a linked micromap plot (Fig. 6) to show the percent of length reduction in the 95% CI of lung cancer rate ratios from the Tiwari method to the spatial method in the US, expressed as percentages of the length of the Tiwari 95% CI for each state. The panels in the plot (from left to the right) are the maps of states with the highest to lowest % Reduction (from top to bottom row), the state names, the values of the % Reduction, the population size in each state and the estimated rate ratio with its 95% CI using the spatial method. The region with the highest % Reduction is in the midwest, and expands to the east, west, and south, and finally to New England, Florida, and the Pacific West. Except for the few states with really large populations (e.g., New York, Texas, Florida, and California), there is a positive correlation between the population size and the % Reduction. In other words, the higher a state’s population size, the larger benefit it gains from the spatial method. Many of the larger states have such precise rate ratio estimates that the CI line is completely covered by the estimate’s dot. The only state that has a larger confidence interval using the spatial method is California, due to a relatively large proportion of the population of the state of the US total and hence a large overlap, as explained above. By adding the spatial correlation, the variance of the logarithm of the US rate, \( \text{var} (\ln (R)) \), increases from 6.4E−6 to 1.0E−4, a 15fold change.
For esophagus and tongue cancer data, spatial autocorrelation is weak and the advantage of the spatial method is minimal. The results of the three measures are very close across the Tiwari, nonspatial, and spatial methods for esophagus and tongue cancer (data not shown).
Statelevel data with varying autocorrelation strength and scale
To further explore the impact of spatial autocorrelation on the confidence intervals of rate ratios, we applied the competing methods to a set of simulated statelevel data. The observed rate ratios of 2004 lung cancer mortality were taken as the mean, and the variance–covariance matrix was set with a variety of autocorrelation strength and scale. As shown in formula (6), for an exponential semivariogram model, spatial autocorrelation depends on the proportion of partial sill to sill, \( \frac{{c{}_{e}}}{{c{}_{0} + c{}_{e}}} \), as well as the range, a _{ e }. The partial sill to sill proportion ranges between 0 and 1, with a higher value representing stronger spatial autocorrelation. The range is the distance between two regions beyond which the observations are considered spatially uncorrelated. Of the 50 US states and District of Columbia, the pairwise distance ranges from the minimum at 25.7 km (between District of Columbia and Maryland) to the median at 1608.0 km (between Florida and Oklahoma), and the maximum at 4984.0 km (between Maine and Hawaii). We set the range a _{ e } in formula (6) at three different values, 500, 1700, and 4000 km, to represent local, regional, and global scale of spatial autocorrelation. The partial sill to sill proportion, \( \frac{{c{}_{e}}}{{c{}_{0} + c{}_{e}}} \), was set at 0.1, 0.5, and 0.9, to represent weak, moderate, and strong spatial autocorrelation. A total of nine simulated datasets were created, and percent of length reduction from nonspatial method to spatial method was calculated for each state and averaged across the whole US.
Percent of length reduction for 95% CI for rate ratios, from nonspatial method to spatial method with varying spatial autocorrelation scale (measured by range in semivariogram) and strength (measured by partial sill to sill proportion in semivariogram)
Scale of spatial autocorrelation measured by range (km)  

500 (local)  1700 (regional)  4000 (global)  
Strength of spatial autocorrelation  
0.1 (weak)  (0, 0.004, 0.095) 0.48  (0.005, 0.039, 0.098) 1.7  (0.029, 0.067, 0.099) 2.7 
0.5 (moderate)  (0, 0.02, 0.47) 2.4  (0.027, 0.19, 0.49) 9.2  (0.14, 0.33, 0.50) 14.6 
0.9 (strong)  (0, 0.036, 0.85) 4.5  (0.048, 0.35, 0.89) 17.7  (0.26, 0.60, 0.89) 29.3 
Countylevel data
A simulation study was performed based on the Kentucky countylevel male lung cancer incidence rates. The simulation study serves two purposes. First, we would like to establish the relationship among the three interwoven factors—county to state rate ratios, county population size, and statistical power in testing the hypothesis of rate ratios equal to 1. Then we would compare the nonspatial and spatial methods. To serve the first purpose, we simulated a set of county rates (according to the population size of each county in Kentucky) with the countytostate rate ratios ranging between 0.2 and 2.0, with an increment of 0.05. This range is broader than that in a realistic situation and should provide a complete picture of the multidimensional relationship between rates, rate ratios, population size, and power. However, this is a hypothetical scenario since it is impossible to observe that all counties have the same rate ratio at a certain fixed level, so the analysis in this scenario only helps with understanding the impact of rate ratio and population size on the statistical power. It does not estimate how much improvement the spatial method will provide in a real world situation. To serve the second purpose, we created the simulated data assuming the hypothesized mean rate ratios and the variance–covariance structure were the same as the observed values. Then we compare the statistical power and coverage probability of the 95% CI between the nonspatial and spatial method.
Power and coverage probability of the nonspatial and spatial methods in the simulation study for Kentucky countylevel male lung cancer incidence data
Power  Coverage probability  

Nonspatial (%)  Spatial (%)  Nonspatial (%)  Spatial (%)  
Simulation under hypothetical fixed rate ratios  
0.50  98.7  98.9  96.7  95.0 
0.75  65.5  71.0  96.7  95.0 
1.0  3.23  4.98  96.8  95.0 
1.25  48.7  55.3  96.7  94.9 
1.50  85.7  88.5  96.8  95.0 
1.75  96.3  97.0  96.8  95.0 
Discussion
Presenting confidence intervals along with point estimates provides a measure of uncertainty of the estimate. Failure to include spatial autocorrelation when it is present can lead to errors in the estimates and subsequent inferential errors. We have extended a previous method of calculating confidence intervals for the rate ratio of a subregion to the full region to include spatial autocorrelation. Tiwari et al. [3] proposed a method to compute a 95% confidence interval around the ratio of an areaspecific rate to the rate for a larger area which includes that area. This method accounts for correlation between the single area rate and that of the larger area, but does not account for likely spatial autocorrelation between rates of neighboring areas. Our method includes components of total variance due to the standard distributional assumption, the correlation between the single area rate and the larger area rate, and spatial autocorrelation among the areas.
The rate ratio is a common measure of the relative ranking of areas, e.g., counties within a state, and is easily computed. The earlier nonoverlapping method of Tiwari et al. [2] is now implemented in SEER*Stat, the National Cancer Institute’s popular software to disseminate cancer statistics. Our goals were to improve upon the Tiwari [3] methods by including spatial autocorrelation and to develop a resulting method that could replace the current method in SEER*Stat. Therefore, we used a similar approach as Tiwari to develop our confidence interval and applied the new method to the same three cancer datasets so as to make the results most comparable.
Several limitations exist in this study. First, isotropy is assumed for spatial autocorrelation. It may be argued that direction as well as distance between regions will have an impact on disease rates. Considering anisotropy will greatly increase the complexity of model and computation. Secondly, the exponential semivariogram model that fit our lung cancer data well may not be the best choice for every outcome variable, although experts believe that the choice of semivariogram model is less important than the inclusion of spatial autocorrelation at all (see p. 379 of [10]).
We were limited in our ability to assess the impact of adding spatial autocorrelation to the rate ratio variance calculations for uncommon and rare cancers like esophagus and tongue cancers. Even when we simulated data with very strong spatial autocorrelation, the resulting data did not show much of a detectable spatial pattern. This is probably due to the small number of cases or the Modifiable Areal Unit Problem (MAUP). Assessing the impact of small numbers or MAUP to spatial autocorrelation is beyond the scope of this paper; our method was developed to account for spatial autocorrelation without regard to how it came about.
Another limitation of our approach is that we have underestimated the uncertainty of the rate ratio variance by assuming no error in the estimation of the covariance parameter estimates. It will be very important to explore approaches that fully account for the uncertainty in the spatial autocorrelation estimation in future work. However, because users are typically not interested in computing a rate ratio for areas with small populations, knowing that the confidence intervals will be large and uninformative in this situation, we suspect that uncertainty in the rate ratio estimate due to the estimation process will be small relative to uncertainty due to unmeasured spatial autocorrelation. Therefore, we believe that while we have shown that our method is superior to the Tiwari et al. method when spatial autocorrelation is present, and equivalent to it when area rates are uncorrelated, any further improvement resulting from a hierarchical (Bayesian) model that can compute the estimation process variation across many replicates will be relatively small. We have initiated a followup study to confirm this belief. It is impractical to implement a Bayesian estimation method that typically requires computation of thousands of replicates in a server or webbased statistical system such as SEER*Stat, and therefore it is important to verify that the method proposed here, one that can easily be implemented, has captured nearly all of the variation in the rate ratio estimate.
Conclusions
We have developed a method that takes into account spatial autocorrelation along with area overlap in calculating variances and confidence intervals of rate ratios between a subregion and the total region. Our variance is comprised of three components, representing no correlation, area overlap, and spatial autocorrelation, respectively. We have shown that calculating the variance of the rate ratio including the possibility of spatial autocorrelation among the areaspecific rates can lead to substantial improvements over the Tiwari method. For U.S. statelevel cancer data, confidence intervals were shorter and power was greater than the Tiwari method. The Tiwari method accounted for correlation due to area overlap but not for spatial autocorrelation among the areas.
Application to simulated statelevel data showed that the advantage of the proposed method is directly related to the strength and scale of the spatial autocorrelation. Improved results were also seen at the county level using simulated data based on population patterns in Kentucky counties. We did note two instances where results were not as good as in the Tiwari method, both where one region’s population constituted a large proportion of the population for the entire aggregated area (California compared to the U.S. and Jefferson County compared to Kentucky). In these cases, the correlation due to the population overlap was much larger than any observed spatial autocorrelation, eliminating any advantage for our method over methods that ignore this source of correlation. Because of the large population size in these regions, though, the variance and interval estimates were already precise, and adding spatial autocorrelation does not practically impact the interval estimates or the statistical power.
One approach to calculating confidence intervals for the rate ratio might be to assess the degree and scale of spatial autocorrelation among the areas and then decide on whether or not to include the spatial autocorrelation. However, these correlations are on a continuum, so it would be difficult to set a cutpoint beyond which the spatial method should be used. We have shown that our method provides the same results as the Tiwari method when there is very little spatial autocorrelation, and it does account for overlap between a subregion and its parent region. Thus for little additional computational cost, we obtain an estimate equal to that currently used for SEER data if areaspecific rates are independent, but in the presence of increasing spatial autocorrelation we can obtain substantial improvements over the existing method. Since the calculations are relatively easy to implement, we recommend using the new method in all cases.
Abbreviation
 CI:

confidence interval
Declarations
Authors’ contributions
LZ, and LP developed the method and study design. JBP developed the computing code for the analysis and mapping. All authors read and approved the final manuscript.
Acknowledgements
The authors acknowledge colleagues Eric J. Feuer and Benmei Liu who have provided comments that improved the quality of this manuscript.
Competing interests
The authors declare that they have no competing interests.
Availability of data and material
The software along with datasets supporting the conclusions of this article are included as supporting materials.
Funding
This work was partly supported by the Statistical Methodology and Applications Branch of the National Cancer Institute through a contract HHSN261201400193P.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Tobler W. A computer movie simulating urban growth in the Detroit region. Econ Geogr. 1970;46(2):234–40.View ArticleGoogle Scholar
 Tiwari RC, Clegg LX, Zou Z. Efficient interval estimation for ageadjusted cancer rates. Stat Methods Med Res. 2006;15(6):547–69.View ArticlePubMedGoogle Scholar
 Tiwari RC, Li Y, Zou Z. Interval estimation for ratios of correlated ageadjusted rates. J Data Sci. 2010;8:471–82.PubMedPubMed CentralGoogle Scholar
 Fay MP, Feuer EJ. Confidence intervals for directly standardized rates: a method based on the gamma distribution. Stat Med. 1997;16(7):791–801.View ArticlePubMedGoogle Scholar
 Kulldorff M, et al. Breast cancer clusters in the northeast United States: a geographic analysis. Am J Epidemiol. 1997;146(2):161–70.View ArticlePubMedGoogle Scholar
 Pickle L, et al. Atlas of United States mortality. Hyattsville: National Center for Health Statistics; 1996.Google Scholar
 National Cancer Institute. Surveillance Epidemiology and End Results (SEER) Program. 2016. http://seer.cancer.gov/.
 National Cancer Institute. SEER*Stat Software. 2016. http://seer.cancer.gov/seerstat/.
 Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. 3rd ed. New York: Wiley; 2003.View ArticleGoogle Scholar
 Waller LA, Gotway CA. Applied spatial statistics for public health data. Hoboken: Wiley; 2004.View ArticleGoogle Scholar
 Bishop YM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge: MIT Press; 1975.Google Scholar
 Centers for Disease Control and Prevention. Mortality data. http://www.cdc.gov/nchs/deaths.htm.
 Zhu L, et al. Predicting US and statelevel cancer counts for the current calendar year. Cancer. 2012;118(4):1100–9.View ArticlePubMedGoogle Scholar
 Furrer R, Sain SR. spam: a sparse matrix R package with emphasis on MCMC methods for Gaussian Markov random fields. J Stat Softw. 2010;36(10):1–25.View ArticleGoogle Scholar
 R Development Core Team. The R Project for statistical computing. 2016. https://www.rproject.org/.
 Surveillance Epidemiology and End Results (SEER) Program, Mortality—All COD. Aggregated with State, Total U.S. (1969–2004). N.C.f.H.S. (NCHS), editor. Bethesda: National Cancer Institute SEER Program; 2007.
 Pickle LW, Pearson JB, Carr DB. micromapST: exploring and communicating geospatial patterns in U.S. state data. J Stat Softw. 2015;63(2):1–25.Google Scholar
 Christian WJ, et al. Exploring geographic variation in lung cancer incidence in Kentucky using a spatial scan statistic: elevated risk in the Appalachian coalmining region. Public Health Rep. 2011;126(6):789–96.PubMedPubMed CentralGoogle Scholar
 Samui P, Sitharam TG. Application of geostatistical models for estimating spatial variability of rock depth. Engineering. 2011;3:886–94.View ArticleGoogle Scholar