- Research
- Open Access
- Published:

# Robustness of the BYM model in absence of spatial variation in the residuals

*International Journal of Health Geographics*
**volume 6**, Article number: 39 (2007)

## Abstract

### Background

In the context of ecological studies, the Bayesian hierarchical Poisson model is of prime interest when studying the association between environmental exposure and rare diseases. However, adding spatially structured extra-variability in the model fitted to the data when such extra-variability does not exist conditionally on the covariates included in the model (*over-fitting*) may bias the estimation of the ecological association between covariates and relative risks toward the null. In order to investigate that possibility, a simulation study of the impact of introducing unnecessary residual spatial structure in the estimation model was conducted.

### Results

In the case where no underlying extra-variability from the Poisson process exists, the simulation results show that models accounting for structured and unstructured residuals do not underestimate the ecological association, unless covariates have a very strong autocorrelation structure, i.e., 0.98 at 100 km on a territory of diameter 1000 km."

## 1 Background

Ecological regression studies investigate potential association between geographical variation in disease rates (or counts) and environmental covariates. For example, a recent study evaluated the ecological association between indoor radon concentration and acute leukaemia incidence among children [1]. For rare diseases and/or small areas, Bayesian hierarchical Poisson model is commonly used where within-area variability of disease is modelled at the first stage as a Poisson process and ecological relationships between disease and covariates are introduced at the second stage of the hierarchical model. Spatially extra-Poisson variability potentially due to aggregated effect of unknown confounders is commonly taken into account through spatially structured residuals added in the second stage of the model.

In that context, the BYM (Besag, York and Mollié) model [2] is a standard model for estimation of the ecological associations. The overall variability of a health indicator is broken down into a random Poisson component, a spatially structured area-specific random effect and an unstructured random term, across geographic units. It has been extensively shown that not accounting for an actual spatial variability may lead to major biases [3].

Conversely, if the spatial variability of a health indicator is completely explained by that of environmental factors and the other ecological covariates taken into consideration, regression residuals do not have spatial structure. Modelling the spatial structure of residuals could then lead to a biased estimate of the ecological association via a phenomenon of *over-fitting* [4–7].

To the author's knowledge, the quantitative impact of fitting a model including extra-Poisson variability to analyse data generated by a model where such extra-Poisson variability does not exist conditionally on the covariates included in the model (*over-fitting*) has not previously been explicitly and quantitatively investigated. Robustness of residuals modelling as BYM was studied in a different inferential context. In the frame of an extensive investigation of the statistical performances of a number of spatial models, Lawson *et al*. [8] studied the performance of such models on relative risks estimates in the case of mapping modelling, i.e. without covariates, where different true spatial structures of residuals were simulated. The authors showed that BYM model performed well on risks estimations except when the true residuals were resulting from a mixture structure. Different models were compared to detect effect from a putative source on a regular lattice [9]. They concluded in particular that the introduction of a spatially structured area-specific random effect leads to much less bias in the parameter estimate and that BYM model is the least biased. Notably, biases in parameter estimation appear when the random effects are not acounted for. In the present study we focused on ecological association estimate when covariates with spatially structure are introduced. Such covariates are often of interest in epidemiology when environmental exposures are studied. In our work we focus on a particular model with France mainland as study domain. The aim of our study was then to determine the robustness of the BYM model in the absence of residual spatial variation, i.e., the impact on the ecological association estimate on an irregular domain. The estimates performances were discussed and characterized according to the covariates structure. Simulations protocols assumed systematically a Poisson model at the first stage of the hierarchical model and log linear relationship between the incidence of the disease and exposure at the second stage without addition of extra-Poisson residuals. Various spatial structures of exposure were considered. The explained spatial variability is thus fully specified/attributable by the covariate structure.

First, ecological models that do or do not allow spatially structured or unstructured heterogeneity will be considered. Various simulation protocols for parameter values enabling balanced or unbalanced between/within area variability will then be presented. The results of the various simulation protocols will then be considered in terms of their performances with regard to the estimation of ecological associations. The paper will conclude with a discussion.

## 2 Methods

### 2.1 Statistical Models

Let *D* be the study area of interest, partitioned into *m* geographic areas. The data consist in the observed *Y*_{
i
}and expected *E*_{
i
}disease counts for each area *i*, (*i* = 1,...,*m*). Let *X*_{
i
}be an ecological variable of interest in area *i*. The ecological Poisson *M*_{0} model is expressed in hierarchical form as follow:

in which *R*_{
i
}is the relative risk in area *i*. The second stage models the relationship between the relative risk and exposure variable. The ecological model *M*_{0} does not include any spatially structured or unstructured heterogeneity.

In order to account for those variabilities, the BYM model [2] was proposed :

in which *δ*_{
i
}denotes the set of labels of the neighbours of area *i*, *n*_{
i
}is the number of neighbours *i*, *U*_{
i
}(*i* = 1,...,*m*) models the spatially-structured area-specific random effect based on the conditional autoregressive approach CAR [10], and *V*_{
i
}(*i* = 1,...,*m*) is the unstructured random effect. The BYM model is the benchmark parametric model and is widely used in disease-mapping studies mainly because of the flexibility of the residuals.

### 2.2 Design of the simulation study

Processes X and Y are simulated on D under model *M*_{0} accordingly to parameters values. The simulation parameters were selected with reference to the overall variability of the estimated relative risks, thus enabling realistic and reasonable values for relative risks. More precisely, let ${\widehat{R}}_{i}$ = *Y*_{
i
}/*E*_{
i
}be the maximum likelihood estimate of the relative risk for the area. If *X*_{
i
}~ *N*(0, 1), then:

If the relative risks are spatially independent of the expected disease counts *E*_{
i
},

where ${\overline{E}}_{h}$ is the harmonic mean of *E*_{
i
}(*i* = 1,...,*m*). The overall variance may be expressed as:

This variance may be broken down into a *Within area variability* term Wv = 1/${\overline{E}}_{h}$ × exp(-*α* + *β*^{2}) and a *Between area variability* term Bv = *β*^{2}. Let *p* denote the proportion of between area variance, *p* = Bv/*Var*[log(${\widehat{R}}_{i}$)], a high value of *p* corresponds to high between-area variability, that is a high amount of information with which to estimate the ecological link *β*. Hereafter, without any loss of generality, *α* will be considered equal to 0.

The geographic scale unit consisted in the 94 *Departements* of mainland France (Corsica excluded). The expected disease counts (*E*_{
i
}) consist in the expected cases of acute leukaemia in children aged less than 15 years for the period 1990–1998 *in Departement i*. The cases were retrieved from the French National Registry of Childhood Leukaemia and Lymphoma [11]. The expected numbers ranged from 4.2 to 204, with a harmonic mean of 23.35. Scenarios in which the within-area variance was either doubled (${\overline{E}}_{h}$ = 46.6) or divided by 10 (${\overline{E}}_{h}$ = 2.33) were also considered.

Given that *X*_{
i
}~ *N*(0, 1), within-area variance depends on 3 parameters, namely: the harmonic mean of expected disease counts ${\overline{E}}_{h}$, the ecological link *β* and the autocorrelation structure of *X*_{
i
}.

As *X* has a standardized normal distribution, the 2.5 (*p*_{2.5%}) and 97.5 (*p*_{97.5%}) percentiles of the relative risks are under model *M*_{0} exp(*α* ± 1.96*β*) and their ratio *Q* = *p*_{97.5%}/*p*_{2.5%} is exp(2 × *β* × 1.96). The quantile ratios were considered equal to 1.0, 1.5, 2.0 and 3.0, equivalent to no effect, weak, moderate and strong effects, respectively. The corresponding values for *β* were 0.00, 0.12, 0.21 and 0.33. The proportions of between-area variance, *p*, by ecological link *β* and ${\overline{E}}_{h}$ breakdown are summarized in Table 1.

For ${\overline{E}}_{h}$ = 23.35, the between area variance proportion ranges from 0 to 71%, with a balanced case for an ecological link when *β* = 0.21. For ${\overline{E}}_{h}$ = 46.69, *p* ranges from 0 to 83%, while *p* ranges from 0 to 19% when ${\overline{E}}_{h}$ = 2.33.

The autocorrelation of the exposure variable was also modulated. The following exponential autocorrelation structure was considered: cov(*X*_{
i
}, *X*_{
j
}) = *exp*(-*d*(*i*, *j*)*φ*), in which *d*(*i, j*) is the distance between areas *i* and *j*. Let *ρ*_{
xx
}= *exp*(-100*φ*) be the autocorrelation of two areas 100 km distant from each other. The following values for *ρ*_{
xx
}= (0.40, 0.90,0.95, 0.98) were studied. That correlation structure is shown in Figures 1. High values of *ρ*_{
xx
}may mimic a spatial bloc structure.

For each combination of parameters (${\overline{E}}_{h}$, *β*, *ρ*_{
xx
}), 400 replicates of (*X*, *Y*) = ((*X*_{
i
}, *Y*_{
i
}), *i* = 1,...,*N*) were generated using the *M*_{0} model. For each replication, the ecological link was estimated by both models (*M*_{0} and BYM) in a Bayesian framework.

The estimations were made with BRugs [12] software. For each data set, a burn-in of 5000 iterations was used and Bayesian inferences were based on 45000 iterations from Gibbs sampling giving Monte Carlo standard errors of less than 5% of the posterior standard deviation of each parameter [13]. The Monte Carlo standard error is an estimate of the difference between the mean of the sampled values and the true posterior mean. Non-informative priors were chosen for the parameters: *α* ~ *U*(-∞; +∞), *β* ~ *N*(0.0, 1.0*E* + 5), *τ*_{
U
}~ Γ(0.5, 0.0005), *τ*_{
V
}~ Γ(0.5, 0.0005) [14], in which Γ(*a*, *b*) denotes the Gamma distribution with expectation equal to *a/b*.

For each triplet (${\overline{E}}_{h}$, *β*, *ρ*_{
xx
}), the ecological link was estimated using the *M*_{0} and BYM models. Let ${\widehat{\beta}}^{(j)}$ be the posterior mean estimates of *β* at the *j*th replication, ${\sigma}_{{\widehat{\beta}}^{(j)}}$ the posterior standard error of ${\widehat{\beta}}^{(j)}$ and *CI*^{j}(*β*) the 95% credibility interval of *β*. The following criteria were computed for 400 replications (*j* = 1,...,400):

• The empirical mean of the estimated posterior expectations of $\beta \left({\overline{\beta}}_{sim}={\displaystyle {\sum}_{j=1}^{400}{\widehat{\beta}}^{(j)}/400}\right)$

• The empirical mean of the estimated posterior standard deviations of $\beta \left({\overline{\sigma}}_{{\beta}_{sim}}={\displaystyle {\sum}_{j=1}^{400}{\sigma}_{{\widehat{\beta}}^{(j)}}/400}\right)$

• The empirical standard deviation of the 400 estimated posterior means of *β* (*sd*(*β*_{
sim
}))

• The mean relative bias (MB) and its standard deviation (sd(MB)),

• The root mean square error (RMSE),

• The proportion of coverage, *π*(*β*): the percentage of time when *β* lay within its 95% credibility interval

• The proportion of non-coverage 1 - *π*(0): the percentage of time when 0 did not lie within its 95% credibility interval

When *β* ≠ 0, 1 - *π*(0) quantifies the ability of the estimation model to detect the existence of an association, which is analogous to the frequentist power. McNemar's test for comparing proportions from paired data (estimates of *β* from *M*_{0} and BYM based on the same replicated data set) was used to test whether the *π*(*β*) (or 1 - *π*(0)) values were significantly different. The over-fit of the BYM model (compared to the *M*_{0} model) was assessed via that criterion.

## 3 Simulation Results

Simulations results are structured as follow: firstly, we present results for an harmonic mean ${\overline{E}}_{h}$ = 23.35 of expected disease counts equal to those from acute leukaemia in children for 1990–1998 in France and a covariate *X* with null to moderate autocorrelation. Secondly, for the same harmonic mean ${\overline{E}}_{h}$, we study the influence of strong autocorrelations for the covariate X and finally, variations of ${\overline{E}}_{h}$ (smaller and greater than 23.35) are explored.

### 3.1 Moderate covariate autocorrelation

The first scenario considered ${\overline{E}}_{h}$ = 23.35 and the autocorrelation equal to 0.0 or 0.4; the results are shown in Table 2. In that setting, the between-area variance varied from 0 to 71%. In the absence of any spatial structure for *X* (*ρ*_{
xx
}= 0), the estimate of *β* was unbiased, irrespective of the estimation model. The mean bias was less than 1% and the RMSE was less than 0.02 for the four values of the ecological link (for both models). The coverage proportions *π*(*β*) were similar for the two models and greater than 94.5%. The coverage proportion of the BYM model was consistently slightly greater than that of the *M*_{0} model. For *β* = 0, *π*(*β*) was close to 95% and, equivalently, 1 - *π*(0) was close to 5%. The BYM model thus handles the scenario in which the covariate has no spatial structure.

When the autocorrelation was increased to 0.4, the results were similar to those with the previous setting. The mean bias was less than 1%. There was a slight increase in the RMSE but it remained less than 0.02. The *β* coverage proportion with the BYM model was greater than that with the *M*_{0} model. The non-coverage proportion was equal to 1, except when there was no association (*β* = 0). The non-coverage proportions were significantly different when *β* = 0. The proportion the closest to 5% was obtained with the *M*_{0} model. Irrespective of the value of *β*, the variability of *β* was always slightly over-estimated with the BYM model. The coverage proportion of the *M*_{0} model varied from 94.5 to 95.8% (96.8 to 97.5% for the BYM model). In the absence of, or with moderate, autocorrelation, the *over-fitting* effect was not observed. Both models provided an almost unbiased estimate of the ecological link.

### 3.2 Strong covariate autocorrelation

The scenario of strong autocorrelation for *X* was then considered: *ρ*_{
xx
}= 0.90, 0.95, 0.98 at 100 km with ${\overline{E}}_{h}$ = 23.35. The results are shown in Table 3. When *ρ*_{
xx
}= 0.90, the mean bias and the RMSE were low (0.03). The mean bias decreased as the value of the ecological link increased, reflecting an increase in between-area variability. The coverage proportion with the BYM model was higher than the coverage proportion with the *M*_{0} model for all values of *β*. The non-coverage proportions were significantly different for *β* = 0.00 and *β* = 0.12, in favor of the *M*_{0} model. For *ρ*_{
xx
}= 0.95, the bias was still small and the RMSE increased to 0.04. The non-coverage proportions were again significantly different in the cases in which *β* = 0.00 and *β* = 0.12 in favor of the *M*_{0} model. Lastly, for *ρ*_{
xx
}= 0.98 and for the first three values of *β* (*β* = 0.00, 0.12, 0.21), the non-coverage proportions were significantly different, again in favor of *M*_{0}. When the ecological link was null, the non-coverage proportion was again smaller with the BYM model, a consequence of the over-estimation of parameter variability. The *β* coverage proportions were lower for *M*_{0} for 4 values of *β*. When the autocorrelation increased from 0.90 to 0.98, the RMSE increased from 2.99 to 6.48, mainly due to the decrease in independent information. A slight increase was observed for all the other criteria. The bias was weak, resulting in very small RMSE and *sd*(*β*_{
sim
}) in both models. At high autocorrelation values, the overall variability of the estimates increased. There was more variability for each *β* value with the BYM model. This is exemplified by the mean posterior standard deviation, which increased four-fold (for all *β* values) between the first (*ρ*_{
xx
}= 0.00) and last (*ρ*_{
xx
}= 0.98) autocorrelation scenario. This was also the case for *sd*(*β*_{
sim
}).

### 3.3 Variation of (harmonic mean of) expected counts

The next scenario consisted in strong autocorrelation of *ρ*_{
xx
}= 0.95 at 100 km with variation in number of expected disease counts. The results for that scenario are shown in Table 4.

For ${\overline{E}}_{h}$ = 46.6 (and *ρ*_{
xx
}= 0.95), the mean bias and RMSE were smaller than in the scenario in which ${\overline{E}}_{h}$ = 23.3. The *β* coverage rate was greater than 94.5% for both models and the proportion was higher for the BYM model. The non-coverage proportions were significantly different, in favor of the *M*_{0} model, for *β* = 0.12. For *β* = 0, the non-coverage proportion was again smaller with the BYM model.

For ${\overline{E}}_{h}$ = 2.33, the bias increased (up to 6%) and the RMSE was the highest observed in the various cases (14% approx.). The coverage proportion was smaller than in the previous scenario but greater than 92%. The coverage proportion *π*(*β*) was higher with the BYM model than with the *M*_{0} model. While the non-coverage proportions of 0 were close to 1 (except for *β* = 0), the proportions decreased by 16% for *M*_{0} and 13% for BYM for *β* = 0.12. Moreover the non-coverage proportions of 0 for the two models were significantly different and in favor of the *M*_{0} model for *β* = 0.12, 0.21, 0.33.

High autocorrelations thus appear to influence the *over-fitting* effect of the BYM model. The expected disease counts, also modulates the overall accuracy of the estimation. In fact, with highly correlated spatial structure and low disease counts, the bias of the *β* estimate generated by the BYM increases. But, even with a highly autocorrelated covariate and adequate disease counts, when *E*_{
i
}is doubled, the BYM model estimates the ecological link with little bias.

## 4 Discussion

A simulation study was conducted in order to assess estimation performance with respect to the ecological association between covariates and health indicators. Key parameters, such as the ecological link, expected disease counts and autocorrelation strength were selected to ensure that the simulation covered realistic situations. The choice of parameters enabled coverage of balanced and unbalanced between- and within-area variabilities.

For moderate autocorrelation structures, both the Poisson model and the BYM model performed well and the estimation performances were similar. Underestimation of ecological links was only observed for high autocorrelations. Overall, the posterior standard deviation of *β* was slightly over-estimated with the BYM model, resulting in conservative results when the true value of *β* was null.

The expected disease counts are also of interest because, with a high autocorrelation, the underestimation of the BYM model is present. In practice, this worst-case scenario can nonetheless be found. Except for the extreme scenario, strong spatial structure and low disease counts, both models perform well, even with strong spatial structure. As a consequence, the BYM model can be used to estimate ecological associations without fearing underestimation. The simulation results show that models accounting for structured and unstructured residuals do not underestimate materially the ecological association. The rational is the following: not accounting for an actual spatial variability leads to strong bias. Thus from a practical point of view, the BYM model should be preferred to the Poisson if spatial autocorrelation of covariate is suspected. Moreover, autocorrelation structure will be first investigated via Moran's I test [15].

## References

- 1.
Evrard A, Hémon D, Billon S, Laurier D, Jougla E, Tirmarche M, Clavel J: Ecological association between indoor radon concentration and childhood leukaemia incidence in France, 1990–1998. Eur J Cancer Prev. 2005, 14 (2): 147-57. 10.1097/00008469-200504000-00011.

- 2.
Besag J, York Y, Mollié A: Bayesian Image Restoration With Two applications In Spatial Statistics. Ann Inst Statist Math. 1991, 43: 1-59. 10.1007/BF00116466.

- 3.
Salway R: Statistical Issues in the Analysis of Ecological Studies. PhD thesis. 2003, Imperial College School of Medicine

- 4.
Clayton D, L B, C M: Spatial Correlation in Ecological Analysis. International Journal of Epidemiology. 1993, 22: 1193-1202. 10.1093/ije/22.6.1193.

- 5.
Best NG, Arnold RA, Thomas A, Waller LA, Conlon EM: Bayesian models for spatially correlated disease and exposure data. Bayesian Statistics 6. 1999, Oxford University Press, 131-56.

- 6.
Richardson S: Spatial models in epidemiological applications. Highly Structured Stochastic Systems, Volume 27 of Oxford Statistical Science Series. Edited by: Green PJ, Hjort NL, Richardson S. 2003, Oxford: Oxford University Press, 237-259.

- 7.
Wakefield J: Sensitivity analyses for ecological regression. Biometrics. 2003, 59: 9-17. 10.1111/1541-0420.00002.

- 8.
Lawson AB, Biggeri AB, Boehning D, Lesaffre E, Viel JF, Clark A, Schlattmann P, Divino F: Disease mapping models: an empirical evaluation. Disease Mapping Collaborative Group. Statistics in Medicine. 2000, 19 (17–18): 2217-2241. [Comparative Study]

- 9.
Ma B, Lawson AB, Liu Y: Evaluation of Bayesian models for focused clustering in health data. Environmetrics. 2007, [DOI: 10.1002/env.850]

- 10.
Besag J: Spatial interaction and the statistical analysis of lattice systems (with discussion). Journal of Royal Statistical Society, Series B. 1974, 36: 192-236.

- 11.
Clavel J, Goubin A, Auclerc MF, Auvrignon A, Waterkeyn C, Patte C, Baruchel A, Leverger G, Nelken B, Philippe N, Sommelet D, Vilmer E, Bellec S, Perrillat-Menegaux F, Hemon D: Incidence of childhood leukaemia and non-Hodgkin's lymphoma in France: National Registry of Childhood Leukaemia and Lymphoma, 1990–1999. Eur J Cancer Prev. 2004, 13 (2): 97-103. 10.1097/00008469-200404000-00002.

- 12.
Thomas A, O'Hara B: BRugs: OpenBUGS and its R interface BRugs. 2005

- 13.
Roberts G: Markov chain concepts related to sampling algorithms. Markov chain Monte Carlo in Practice. Edited by: Gilks W, Richardson S, Spiegelhalter D. 1996, Chapman and Hall

- 14.
Kelsall JE: Discussion of Bayesian models for spatially correlated disease and exposure data. Bayesian Statistics 6. 1999, Oxford University Press, 131-56.

- 15.
Moran P: Notes on continuous stochastic phenomena. Biometrika. 1950, 37: 17-23.

## Acknowledgements

This work was supported by grants AFSSE (RD2004004) and INSERM-ATC (A03150LS). The first author was supported by INSERM (*poste Blanc*)

## Author information

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Root Mean Square Error
- Autocorrelation Structure
- Coverage Proportion
- Ecological Association
- Posterior Standard Deviation