Robustness of the BYM model in absence of spatial variation in the residuals

Background In the context of ecological studies, the Bayesian hierarchical Poisson model is of prime interest when studying the association between environmental exposure and rare diseases. However, adding spatially structured extra-variability in the model fitted to the data when such extra-variability does not exist conditionally on the covariates included in the model (over-fitting) may bias the estimation of the ecological association between covariates and relative risks toward the null. In order to investigate that possibility, a simulation study of the impact of introducing unnecessary residual spatial structure in the estimation model was conducted. Results In the case where no underlying extra-variability from the Poisson process exists, the simulation results show that models accounting for structured and unstructured residuals do not underestimate the ecological association, unless covariates have a very strong autocorrelation structure, i.e., 0.98 at 100 km on a territory of diameter 1000 km."


Background
Ecological regression studies investigate potential association between geographical variation in disease rates (or counts) and environmental covariates. For example, a recent study evaluated the ecological association between indoor radon concentration and acute leukaemia incidence among children [1]. For rare diseases and/or small areas, Bayesian hierarchical Poisson model is commonly used where within-area variability of disease is modelled at the first stage as a Poisson process and ecological relationships between disease and covariates are introduced at the second stage of the hierarchical model. Spatially extra-Poisson variability potentially due to aggregated effect of unknown confounders is commonly taken into account through spatially structured residuals added in the second stage of the model.
In that context, the BYM (Besag, York and Mollié) model [2] is a standard model for estimation of the ecological associations. The overall variability of a health indicator is broken down into a random Poisson component, a spatially structured area-specific random effect and an unstructured random term, across geographic units. It has been extensively shown that not accounting for an actual spatial variability may lead to major biases [3].
Conversely, if the spatial variability of a health indicator is completely explained by that of environmental factors and the other ecological covariates taken into consideration, regression residuals do not have spatial structure. Modelling the spatial structure of residuals could then lead to a biased estimate of the ecological association via a phenomenon of over-fitting [4][5][6][7].
To the author's knowledge, the quantitative impact of fitting a model including extra-Poisson variability to analyse data generated by a model where such extra-Poisson variability does not exist conditionally on the covariates included in the model (over-fitting) has not previously been explicitly and quantitatively investigated. Robustness of residuals modelling as BYM was studied in a different inferential context. In the frame of an extensive investigation of the statistical performances of a number of spatial models, Lawson et al. [8] studied the performance of such models on relative risks estimates in the case of mapping modelling, i.e. without covariates, where different true spatial structures of residuals were simulated. The authors showed that BYM model performed well on risks estimations except when the true residuals were resulting from a mixture structure. Different models were compared to detect effect from a putative source on a regular lattice [9]. They concluded in particular that the introduction of a spatially structured area-specific random effect leads to much less bias in the parameter estimate and that BYM model is the least biased. Notably, biases in parameter estimation appear when the random effects are not acounted for. In the present study we focused on ecological association estimate when covariates with spatially structure are introduced. Such covariates are often of interest in epidemiology when environmental exposures are studied. In our work we focus on a particular model with France mainland as study domain. The aim of our study was then to determine the robustness of the BYM model in the absence of residual spatial variation, i.e., the impact on the ecological association estimate on an irregular domain. The estimates performances were discussed and characterized according to the covariates structure. Simulations protocols assumed systematically a Poisson model at the first stage of the hierarchical model and log linear relationship between the incidence of the disease and exposure at the second stage without addition of extra-Poisson residuals. Various spatial structures of exposure were considered. The explained spatial variability is thus fully specified/attributable by the covariate structure.
First, ecological models that do or do not allow spatially structured or unstructured heterogeneity will be considered. Various simulation protocols for parameter values enabling balanced or unbalanced between/within area variability will then be presented. The results of the various simulation protocols will then be considered in terms of their performances with regard to the estimation of eco-logical associations. The paper will conclude with a discussion.

Statistical Models
Let D be the study area of interest, partitioned into m geographic areas. The data consist in the observed Y i and expected E i disease counts for each area i, (i = 1,...,m). Let X i be an ecological variable of interest in area i. The ecological Poisson M 0 model is expressed in hierarchical form as follow: in which R i is the relative risk in area i. The second stage models the relationship between the relative risk and exposure variable. The ecological model M 0 does not include any spatially structured or unstructured heterogeneity.
In order to account for those variabilities, the BYM model [2] was proposed : in which δ i denotes the set of labels of the neighbours of area i, n i is the number of neighbours i, U i (i = 1,...,m) models the spatially-structured area-specific random effect based on the conditional autoregressive approach CAR [10], and V i (i = 1,...,m) is the unstructured random effect. The BYM model is the benchmark parametric model and is widely used in disease-mapping studies mainly because of the flexibility of the residuals.

Design of the simulation study
Processes X and Y are simulated on D under model M 0 accordingly to parameters values. The simulation parameters were selected with reference to the overall variability of the estimated relative risks, thus enabling realistic and reasonable values for relative risks. More precisely, let = Y i /E i be the maximum likelihood estimate of the relative risk for the area. If X i ~ N(0, 1), then: If the relative risks are spatially independent of the expected disease counts E i , of p corresponds to high between-area variability, that is a high amount of information with which to estimate the ecological link β. Hereafter, without any loss of generality, α will be considered equal to 0.
The geographic scale unit consisted in the 94 Departements of mainland France (Corsica excluded). The expected disease counts (E i ) consist in the expected cases of acute leukaemia in children aged less than 15 years for the period 1990-1998 in Departement i. The cases were retrieved from the French National Registry of Childhood Leukaemia and Lymphoma [11]. The expected numbers ranged from 4.2 to 204, with a harmonic mean of 23.35. Scenarios in which the within-area variance was either doubled ( = 46.6) or divided by 10 ( = 2.33) were also considered.
Given that X i ~ N(0, 1), within-area variance depends on 3 parameters, namely: the harmonic mean of expected disease counts , the ecological link β and the autocor- As X has a standardized normal distribution, the 2.5 (p 2.5% ) and 97.5 (p 97.5% ) percentiles of the relative risks are under model M 0 exp(α ± 1.96β) and their ratio Q = p 97.5% / p 2.5% is exp(2 × β × 1.96). The quantile ratios were considered equal to 1.0, 1.5, 2.0 and 3.0, equivalent to no effect, weak, moderate and strong effects, respectively. The corresponding values for β were 0.00, 0.12, 0.21 and 0.33. The proportions of between-area variance, p, by ecological link β and breakdown are summarized in Table 1. The autocorrelation of the exposure variable was also modulated. The following exponential autocorrelation structure was considered: cov( The estimations were made with BRugs [12] software. For each data set, a burn-in of 5000 iterations was used and Bayesian inferences were based on 45000 iterations from  (1) β: Ecological link (2) : Harmonic mean of expected disease counts  was used to test whether the π(β) (or 1 -π(0)) values were significantly different. The over-fit of the BYM model (compared to the M 0 model) was assessed via that criterion.

Simulation Results
Simulations results are structured as follow: firstly, we present results for an harmonic mean = 23.35 of expected disease counts equal to those from acute leukae-

Moderate covariate autocorrelation
The first scenario considered = 23.35 and the autocorrelation equal to 0.0 or 0.4; the results are shown in Table  2. In that setting, the between-area variance varied from 0 to 71%. In the absence of any spatial structure for X (ρ xx = 0), the estimate of β was unbiased, irrespective of the estimation model. The mean bias was less than 1% and the RMSE was less than 0.02 for the four values of the ecological link (for both models). The coverage proportions π(β) were similar for the two models and greater than 94.5%. The coverage proportion of the BYM model was consistently slightly greater than that of the M 0 model. For β = 0, π(β) was close to 95% and, equivalently, 1 -π(0) was close to 5%. The BYM model thus handles the scenario in which the covariate has no spatial structure.
When the autocorrelation was increased to 0.4, the results were similar to those with the previous setting. The mean bias was less than 1%. There was a slight increase in the RMSE but it remained less than 0.02. The β coverage proportion with the BYM model was greater than that with the M 0 model. The non-coverage proportion was equal to 1, except when there was no association (β = 0). The noncoverage proportions were significantly different when β = 0. The proportion the closest to 5% was obtained with the M 0 model. Irrespective of the value of β, the variability of β was always slightly over-estimated with the BYM model. The coverage proportion of the M 0 model varied from 94.5 to 95.8% (96.8 to 97.5% for the BYM model).
In the absence of, or with moderate, autocorrelation, the over-fitting effect was not observed. Both models provided an almost unbiased estimate of the ecological link.
High autocorrelations thus appear to influence the overfitting effect of the BYM model. The expected disease counts, also modulates the overall accuracy of the estimation. In fact, with highly correlated spatial structure and low disease counts, the bias of the β estimate generated by the BYM increases. But, even with a highly autocorrelated covariate and adequate disease counts, when E i is doubled, the BYM model estimates the ecological link with little bias.

Discussion
A simulation study was conducted in order to assess estimation performance with respect to the ecological association between covariates and health indicators. Key parameters, such as the ecological link, expected disease counts and autocorrelation strength were selected to ensure that the simulation covered realistic situations. The choice of parameters enabled coverage of balanced and unbalanced between-and within-area variabilities.
For moderate autocorrelation structures, both the Poisson model and the BYM model performed well and the estimation performances were similar. Underestimation of ecological links was only observed for high autocorrelations. Overall, the posterior standard deviation of β was slightly over-estimated with the BYM model, resulting in conservative results when the true value of β was null.
The expected disease counts are also of interest because, with a high autocorrelation, the underestimation of the BYM model is present. In practice, this worst-case scenario can nonetheless be found. Except for the extreme scenario, strong spatial structure and low disease counts, both models perform well, even with strong spatial structure. As a consequence, the BYM model can be used to estimate ecological associations without fearing underestimation. The simulation results show that models accounting for structured and unstructured residuals do not underestimate materially the ecological association. The rational is the following: not accounting for an actual spatial variability leads to strong bias. Thus from a practical point of view, the BYM model should be preferred to the Poisson if spatial autocorrelation of covariate is suspected. Moreover, autocorrelation structure will be first investigated via Moran's I test [15].