We considered the following analyses: 1) calculation of the gross incidence rate at provincial level, as well as of standardized rates to facilitate comparison with other provinces of Apulia and the comparison with the mortality rates observed in Italy and other European nations; the reference period used for these analyses differs from that used for spatial risk estimation, for the reason that we were interested to compare patterns of temporal disease evolution in the province of Lecce with respect to other Apulian provinces as well, a task that requires a larger temporal window and that was possible due to the wide availability of data at provincial level; 2) spatial analysis to build a risk map based on the specification of an area-specific Poisson model, where the high-risk areas are identified on the basis of p-values associated with the null hypothesis of no-increased risk; 3) spatial analysis adjusted for the presence of spatial correlation and extra-Poisson variation in area-specific relative risks, using a Besag-Yorke-Mollié (BYM) model estimated by Markov Chain Monte Carlo (MCMC) methods in Winbugs; 4) adjustment for material deprivation by inserting an ecological covariate in the BYM model to take account of socio-economic score at areal level; 5) disease cluster detection using a new model-based Bayesian paradigm based, on a suitable modification of the BYM model.
The analyses were conducted separately for both sexes: the difference in terms of incidence and/or mortality between males and females is becoming less relevant, as witnessed also by numerous references in literature. The latest data even reduce the importance of smoking as a risk factor in males, emphasizing instead the large increase in the prevalence of female smokers [38]: [39] report data about the "epidemic" of lung cancer mortality among young women in Europe. As we do not have a priori reasons to reject the hypothesis that health effects of other risk factors could be quite different between the two sexes, the adoption of separate analyses appears to be appropriate.
Data
The data at the provincial level were obtained from the Cislaghi Italian mortality atlas based on Istat data [40], considering deaths occurred in Apulia during the period 1981–2001 for the class of disease indicated by code 162 on the ICD-9-CM classification (i.e.: malignant neoplasm of trachea, bronchus and lung, IXth Revision of the International Classification of Causes of Death, published in 1979 by the World Health Organization). To analyze the overall mortality dynamics, the reference period has been divided into seven triennials: 1981–83, 1984–86, 1987–89, 1990–92, 1993–95, 1996–98, 1999–01.
To estimate the spatial distribution of mortality, the number of deaths for lung cancer in each of the 97 municipalities in the province of Lecce was considered in the period between January 1st, 1992 and December 31th, 2001 inclusive. The data were obtained once again from the Cislaghi atlas: the choice of an extensive reference period was due as much to the necessity to not have information too scattered within each stratum in which the dataset was divided, as the impossibility to obtain the data, because of possible identity disclosures if the reference period was too short. It should also be noted that the areas actually analyzed are 96 in total; due to the abovementioned privacy issue, the Cislaghi atlas considers the neighbouring municipalities Racale and Taviano as a single administrative unit. To draw the maps, the areas of these two municipalities were aggregated, creating a single area to which was assigned the name of the Taviano municipality. These operations were carried out on a Geographic Information System (GIS) using the union operation: the population of the new area was simply assumed as the sum of the populations of both the municipalities.
Calculation of mortality rates for provincial data
Referring to data at provincial level, the crude mortality rate (M) was calculated from the following expression
where D
t
is the number of deaths observed for the cause of death considered in the period t, t +k (expressed in years, in our case k = 2 considering three years), while R
t
is the sum of the population at risk in the same period. Exploiting age-specific data in the expression of M
t
(both the numerator as well the denominator), a specific rate Mt, j, for j = 1,..., J, is obtained for each age-group considered: the division originally planned from the Cislaghi atlas includes the classes 0, 1–14, 15–34, 35–54, 55–64, 65–74, 75 +, although the first two have never been taken into account, because no deaths have ever been observed in either of the triennial considered.
To make internal and international comparisons possible, eliminating the influence of age structure on mortality, we calculated the standardized rate for each triennial with the direct method applied to the standard European population [41], i.e.
where w
j
is the relative weight of the j-th age group in the standard European population.
Spatial analysis 1: area-specific Poisson model
Let Y
ij
be the number of deaths observed within the i-th area and the j-th stratum in the period in question (i = 1,..., N, j = 1,..., J). As the analyses were conducted separately for males and females, the strata were constructed by classifying the cases in six broad age groups: 0–24, 25–39, 40–54, 55–69, 70–84, 85+. We can assume that, independently in each area and stratum, disease counts follow a Poisson distribution
where p
ij
is the mortality rate within the i-th area and j-th stratum, while N
ij
is the amount of person-years at risk in the time period considered (estimated with N
ij
= k
y
× n
ij
, where k
y
= 10 and n
ij
is the amount – provided by Istat – of resident population as of January 1st, 1997). Estimation of the N × J probabilities p
ij
is carried out by assuming that the proportionality relationship p
ij
= q
j
× θ
i
holds, where q
j
, j = 1,..., J, is a set of known stratum-specific reference rates, and each parameter θ
i
is an area-specific relative risk [42, 43]. Indirect standardization can account for effects attributable to differences in the confounder-specific populations: it can be carried out collapsing the strata, Y
i
= ∑
j
Y
ij
, to obtain the following saturated Poisson model for disease counts
where E
i
= ∑
j
N
ij
q
j
is the number of expected deaths in the i-th area. The choice of the stratum-specific reference rates q
j
is crucial [44]: we estimated each rate with q
j
= ∑Y
ij
/∑
i
N
ij
(internal standardization, [45, 46]); this approach centres the data with respect to the map, and the areas where there is an excess of risk are those in which the number of observed cases is higher than the number of expected cases.
P-value based maps
The obvious summaries of the Poisson model are given by the maximum likelihood estimates of each area-specific relative risk
, i = 1,..., N. For the calculation of the respective confidence intervals it can be seen that
, hence an estimate of the variance is given by
. Assuming that
has an approximately Gaussian distribution, a first order Taylor approximation leads to
; therefore, an approximate 95% confidence interval for log(θ
i
) is given by
. Back-transforming, we have the following approximate 95% confidence interval for
that may be used to summarize the significance of the statistical hypothesis of no increased risk within the i-th area H0: θ
i
= 1, against the alternative hypothesis H0 : θ
i
> 1 (if
> 1, otherwise the alternative hypothesis is H0 : θ
i
< 1 if
< 1).
The exact area-specific p-values associated to the null hypothesis of no increased risk H0 : θ
i
= 1 are given by
All of these values, for i = 1,..., N may be classified to draw a probability map, attributing to each area a colour level that denotes class membership [47]. At a significance level of α = 0.05, high-risk areas are those in which ρ
i
<α and
> 1 occur simultaneously.
It is worth noting that probability maps may not be very informative, as p-values alone do not give any information about the level of risk.
Spatial analysis 2: the Besag-York-Molliè (BYM) model
Apart the abovementioned issues, outcomes in spatial units are often not independent of each other. Risk estimates of areas that are close to each other will tend to be positively correlated as they share a number of spatially varying characteristics. Ignoring the overdispersion caused by spatial autocorrelation in the residual leads to incorrect inferences: in particular, an extreme value of ρ
i
may be more due to the lack of fit of the saturated Poisson model than to its deviation from the null hypothesis of neutral risk. This effect of overdispersion due to spatial autocorrelation is very strong only for small area (i.e. areas with very low populations), while is negligible for large municipalities [48].
As risk estimates of areas that are close to each other will tend to be positively correlated, if all such characteristics could be properly measured, then the model would be fully specified and residual spatial variation would be fully explained. However, this will never be the case and unmeasured spatial factors will introduce spatial dependence that can be described by introducing into the model suitable random effects. This is the reason why we considered a standard BYM model for a more refined analysis [49, 50]; we briefly recall the formulation of the hierarchical model:
-
1st level – Disease counts are assumed to be distributed as in the saturated Poisson model, and assumed to be conditionally independent given the relative risks θ
i
;
-
2nd level – The linear predictor assumes the following form
where α is a baseline log-relative risk, and v
i
and ε
i
are random effect representing spatial clustering (autocorrelation) and unstructured extra-Poisson variation respectively. Distributional forms commonly adopted at the second level for the two random effects are:
Clustering – v
i
is defined by the CAR (Conditional Autoregressive, [51]) specification
, where
, ∂
i
is the set of all areas neighbouring the i-th area, and n
i
is the number of elements within the set. Hence, the CAR effect describes spatially varying risk factors, based on which neighbouring areas tend to have similar relative risks. The specification of a CAR effect forces the estimates of area-specific log-relative risks toward a local mean (with an obvious smoothing effect and noise reduction of the map);
Heterogeneity – ε
i
is used instead to describe the sources of error not spatially structured: for this reason we hypothesize, as usual, the exchangeable specification
.
-
3rd level – Prior distributions of the parameters of the two random effects must be chosen [52]: let G(a, b) denote the Gamma distribution with expected value a/b and variance a/b2. For each of the two precision parameters,
and
, we set τ
v
~ Gamma(a
v
, b
v
) and τ
ε
~ Gamma(a
ε
, bε). In this paper we used a
v
= 0.5, b
v
= 0.005 for the spatially structured component [53], which corresponds to a diffuse prior that does not artificially force a spatial structure in the log-relative risk estimates when this is not actually present in the data. By simple calculations it can be proven that, with this choice, the standard deviation of spatially structured random effects is a random variable centred around 0.01, and the probability of observing values smaller than 0.01 or larger than 2.5 is equal to 0.01. For the non-spatially structured random effect we set a
ε
= 0.01, b
ε
= 0.01: these values correspond to a non-informative prior, and they were chosen on the ground of an empirical trade-off between the need to not use an overly informative prior (given the previous absence of knowledge on spatial distribution of mortality from lung cancer), and the care taken to avoid deterioration of the convergence of the estimation algorithm, a very common issue when a flat prior is used.
Prior to model estimation we tested SMRs for the presence of overdispersion and spatial autocorrelation, as the use of BYM model needs to be motivated if the evidence for spatial autocorrelation or extra-Poisson variation is not strong. We applied the following battery of tests: Pearson chi-square and Potthoff-Whittinghill's statistics for assessing homogeneity of relative risks [54]; Dean's overdispersion score for testing the presence of extra-Poisson variation versus the null hypothesis of Poisson distribution [55]; Moran's I and Geary's C tests to assess the presence of spatial autocorrelation, accounting for over-dispersion by assuming a negative-binomial distribution as the sampling model needed to compute the null distribution by means of parametric bootstrap [56].
Parameter estimation and disease mapping
Posterior estimates of the parameters were obtained by simulating from the joint posterior by means of a Markov Chain Monte Carlo (MCMC) algorithm, using the OpenBugs 3.0.3 software together with the GeoBugs 1.3 extension [57]; 6,000 burn-in iterations on two parallel chains starting from overdispersed values were simulated: the convergence was checked using the Brooks-Gelman-Rubin diagnostic, summarized with the
coefficient which tends to 1 in case the convergence is achieved [58]. Subsequently, another 3,000 iterations (for each chain) were generated: only one out of each three was considered for estimation, in order to eliminate the serial autocorrelation and to reduce the standard error estimate of the parameters. We monitored and estimated the area-specific relative risks θ
i
by means of the MCMC algorithm described in the previous section. Maps were drawn by dividing the whole range of relative risk posterior estimates in five non-overlapping sub-intervals delimited by quintiles, assigning a suitable colour to each interval and classifying areas accordingly.
Within the context of cluster detection, questions have been raised about the performance of the BYM model in recovering the true risk surface. For this reason, rather than insisting on the interpretation of relative risk posterior estimates, we supplemented our result by monitoring and estimating area-specific posterior probabilities δ
i
= E[I(θ
i
> 1)|Y] = Pr{θ
i
> 1|Y} (here I(•) denotes the event indicator function). Based on those new posterior area-specific summaries, maps were drawn dividing the interval [0,1] in ten equally spaced sub-intervals, and assigning a colour to each area accordingly. The resulting maps are likely to be insufficiently informative on the actual risk level (as well as p-value based maps are), but the may be indeed useful to confirm the presence of "hot-spots" of high-risk areas exploiting results given in a wide simulation experiment based on synthetic data, where a feasible benchmark for the risk calibration problem was provided [59]. The authors, defining three different loss function representing weighted tradeoffs between false positive and false negative, showed that by declaring at "high risk" those areas where
> 0.8, each area with relative risk below 2 was classified as high risk with a probability of at least 75% if the expected cases in each area are between 10 and 20; this probability approximates to 1 for areas with a relative risk of 3 if the expected cases are never less than 5.
Correction for edge effects
Even when disease counts are independent, any smoothing operation applied to SMRs that borrows information from neighbouring areas will induce edge effects, that consists in biased estimates in areas located near the boundary of the investigated region because information on what happen on the other side of the boundary is missing: a larger estimation variance will be also found in boundary areas due to the low proportion of neighbouring cases [60]. Because some putative sources of pollution are located outside the study area, and the estimation of large-scale patterns are likely to be affected by such statistical biases, it was necessary to adjust for edge effects. A classical methods is to employ guard areas, which are areas external to the main study window of interest and are added to the window to provide a guard area [60]: in this case, given the availability of mortality data for the whole Apulia, it was possible to estimate a global disease map in order to assess, without distortions, the presence of spatial patterns originating from the putative sources located around Brindisi and Taranto, and the large-scale structure of disease risk. In order to make comparisons between the two maps feasible, the stratum-specific reference rates q
j
for the Apulia map were set equal to those used for the Lecce map (external standardization): in this way, expected counts for areas located inside the province of Lecce resulted to be the same for both maps. We also considered internal standardization for the Apulia map; for the province of Lecce considered in isolation, this is essentially equivalent to shift upward posterior estimates obtained by external standardization, but it may be undoubtedly useful in order to compare the average disease risk level in the province of Lecce with that occurring in the whole Apulia. Spatial risk estimation was carried out by simulating 12,000 MCMC burn-in iterations, and subsequently another 6,000 iterations (for each chain) considering only one out of each three for estimation.
Spatial analysis 3: accounting for socio-economic factors
As we said, for many diseases there is a strong link between health and material deprivation, and pollution sources tend to be found in deprived areas. Hence, deprivation is a potential confounder, the role of which must be carefully examined during the analysis. This is the reason why we considered an ecological correlation model as well, in which the linear predictor of the BYM model is expanded in the following way
The area-specific deprivation measure x
i
considered here is the Cadum index [61]: this measure is well suited to the information flows available in Italy, and was calculated using Census data provided by Istat for the 1991 census. The amounts taken into consideration in constructing the index are as follows:
-
proportion of population with primary education;
-
proportion of rental housing;
-
proportion of occupied residences without bathroom;
-
proportion of the active workforce unemployed or looking for a first job;
-
proportion of single-parent families with children.
The index in question is constructed by calculating, for each area, the z score of each variable standardized with respect to the national average and to the national standard deviation, and then adding the five scores thus obtained: by design, higher scores are observed in poorer areas. A posterior summary of the importance of Cadum index in explaining the spatial pattern of disease risk can be obtained by monitoring and drawing residual spatial variation exp(v
i
+ ε
i
), an area-specific quantity that is adjusted to eliminate the net effect due to the level of material deprivation in the i-th area: the presence of a weak spatial pattern in residual spatial variation indicates a strong association with deprivation. Alternatively, the ecological coefficient β can be considered significantly positive if its posterior mean is greater then zero and its 95% percent credible interval excludes zero: this means that there is an actual risk increase in those areas where the level of poverty is higher. It is also interesting to note that when a covariate inserted in an ecological BYM model result to be appropriate to model the spatial variation of risk, random effects v
i
and ε
i
may become unidentifiable and care must be taken to avoid poor convergence and invalid inferences if an improper posterior is used [62].
Spatial analysis 4: cluster detection and inference
As we said, several authors noted that the smoothing effect of the BYM model results in a correct estimation of the spatial distribution of disease risk, but it renders almost impossible the detection of localized increases if these are not based on large local excesses [59, 63]. It may be the case that 95% relative risk credible intervals include unity in all areas, even when local risk excesses are actually present in the data. For this reason we considered a version of the BYM model, suitably modified to detect geographical clusters of disease and to confirm results obtained interpreting posterior summaries of the "pure" BYM model: if Δ is a collection of candidate zones, i.e. the set full set of possible clusters given a maximal dimension, conditionally to each cluster Z ∈ Δ the linear predictor is reformulated as
where a cluster-specific effect α
Z
enters the model as the coefficient of the dummy variable I(A
i
∈ Z), which assumes value 1 if area A
i
is in Z and 0 otherwise. In a like vein to the model-based approach introduced in [64], where an overdispersion parameter similar to the general overdispersion parameter in Poisson regression was used to reduce the effect of extra-Poisson variability on cluster location detection, our Bayesian formulation expressly addresses the problem that cluster detection nominal type I error and power of classical Scan Statistics are likely to be appreciably affected by the presence of spatial autocorrelation [65]. We propose the following sequential model-based algorithm for cluster detection (here by "cluster" we mean any collection of neighbouring areas):
-
Given an initial collection Δ of candidate zones, conditionally to Z, we fit our model for all Z ∈ Δ by suitably tuned MCMC simulations. The collection of fitted models is ranked (in terms of parsimony and predictive power) on the ground of the Deviance Information Criterion (DIC, which is a hierarchical modelling generalization of Akaike Information Criterion introduced in [66]: between two competing models the one that has a lower DIC score should be preferred, see the legend of Tab. 3 for further details), and the first cluster Z* is identified accordingly, by means of the cluster-specific indicator variable entering the model that has the lowest DIC value; a non-informative Gaussian prior is usually assumed for α
Z
;
-
Let
be posterior estimate of the cluster-specific effect: a new model containing an additional offset term accounting for the effect due to Z* is considered, treating I(A
i
∈ Z*) as an explanatory variable with known coefficient equal to
. Let Δ* be the collection of zones including Z* and every cluster overlapping with Z*: the previous step is iterated assuming Δ - Δ* as the collection of candidate clusters, and a second optimal cluster is identified in case;
-
The procedure stops when no better data explanation is possible by letting further cluster-specific terms enter the model: this is easily appreciated by means of the sequence of the DIC scores of the best model of each iteration, in the sense that the algorithm ends when such sequence becomes increasing.
It should be noted that cluster location detection and cluster inference are quite distinct: the above-described algorithm reduces the number of feasible clusters by comparing a large number of models on the ground of a data-analytic criterion. Anyway, declaring a cluster Z statistically "significant" is a different task: in our Bayesian approach this occurs when 95% posterior credible interval for the cluster-specific log-relative risk α
Z
excludes zero.