# Geostatistical analysis of disease data: visualization and propagation of spatial uncertainty in cancer mortality risk using Poisson kriging and p-field simulation

- Pierre Goovaerts
^{1}Email author

**5**:7

https://doi.org/10.1186/1476-072X-5-7

© Goovaerts; licensee BioMed Central Ltd. 2006

**Received: **29 December 2005

**Accepted: **09 February 2006

**Published: **09 February 2006

## Abstract

### Background

Smoothing methods have been developed to improve the reliability of risk cancer estimates from sparsely populated geographical entities. Filtering local details of the spatial variation of the risk leads however to the detection of larger clusters of low or high cancer risk while most spatial outliers are filtered out. Static maps of risk estimates and the associated prediction variance also fail to depict the uncertainty attached to the spatial distribution of risk values and does not allow its propagation through local cluster analysis. This paper presents a geostatistical methodology to generate multiple realizations of the spatial distribution of risk values. These maps are then fed into spatial operators, such as in local cluster analysis, allowing one to assess how risk spatial uncertainty translates into uncertainty about the location of spatial clusters and outliers. This novel approach is applied to age-adjusted breast and pancreatic cancer mortality rates recorded for white females in 295 US counties of the Northeast (1970–1994). A public-domain executable with example datasets is provided.

### Results

Geostatistical simulation generates risk maps that are more variable than the smooth risk map estimated by Poisson kriging and reproduce better the spatial pattern captured by the risk semivariogram model. Local cluster analysis of the set of simulated risk maps leads to a clear visualization of the lower reliability of the classification obtained for pancreatic cancer versus breast cancer: only a few counties in the large cluster of low risk detected in West Virginia and Southern Pennsylvania are significant over 90% of all simulations. On the other hand, the cluster of high breast cancer mortality in Niagara county, detected after application of Poisson kriging, appears on 60% of simulated risk maps. Sensitivity analysis shows that 500 realizations are needed to achieve a stable classification for pancreatic cancer, while convergence is reached for less than 300 realizations for breast cancer.

### Conclusion

The approach presented in this paper enables researchers to generate a set of simulated risk maps that are more realistic than a single map of smoothed mortality rates and allow the propagation of cancer risk uncertainty through local cluster analysis. Coupled with visualization and querying capabilities of geographical information systems, animated display of realizations can highlight areas that depart consistently from the general behavior observed across the region, guiding further investigation and control activities.

## Background

Cancer mortality maps are used by public health officials to identify areas of excess and to guide surveillance and control activities. Quality of decision-making thus relies on an accurate quantification of risks from observed rates which can be very unreliable when computed from sparsely populated geographical entities or for diseases with a low frequency of occurrence [1]. Smoothing methods have been developed to improve the reliability of these estimates by borrowing information from neighboring entities [2, 3]. These methods range from simple deterministic techniques (i.e. head banging method [1]) to sophisticated full Bayesian models [4, 5]. Empirical Bayes smoothers [6, 7] and Poisson kriging [8] provide model-based approaches with intermediate difficulty in terms of implementation and computer requirements. Although simulation studies [7, 8] have demonstrated the benefit of smoothing methods for risk prediction, some uncertainty will always be associated with the estimated risk. In Poisson kriging the uncertainty about the risk within a given geographical entity is modeled by computing a minimum error variance (kriging) estimate of the risk and the associated error variance, which can then be combined to derive a Gaussian-type confidence interval. Full Bayes models go one step further and yields the full posterior distribution of the risk while accounting for the uncertainty in the parameters of the model.

Most studies do not make use of the uncertainty measure provided by smoothing methods, and only the map of smoothed rates is reported and used in the analysis. This is unfortunate since all rate smoothers, including Poisson kriging, cause the loss of local details of the spatial variation of the risk. This smoothing potentially affects the subsequent analysis, leading for example to the detection of larger aggregates of low or high cancer risk in local cluster analysis while most spatial outliers are filtered out [9]. To depict the uncertainty attached to risk maps, some authors recommend mapping the 95^{th} percentile range of the posterior distribution of risk values or the probability that the risk in each entity exceeds a specific threshold of interest [4, 10, 11]. Richardson *et al.* [12] proposed to use this probability of exceedence to decide whether an area should be classified as having an excess risk of cancer. They discussed different decision rules or loss functions which represent weighted trade-offs between false-positive results (i.e. declaring an area as having elevated risk when in fact its underlying true risk equals the background level) and false-negative results (i.e. declaring an area to be in the background when in facts its underlying risk is elevated).

A major weakness of the uncertainty measures reported in today's health science literature is that they are area-specific, that is they inform on the uncertainty prevailing over a single geographical entity at a time. Except if the risk values are spatially independent, the probabilities of excess of a specific threshold computed for several entities do not provide a measure of uncertainty about whether these entities jointly exceed that threshold. In addition to a measure of "local" uncertainty, one thus needs to assess the "spatial" uncertainty, that is the uncertainty attached to the spatial distribution of risk values across the study area. The quantification of spatial uncertainty is particularly important for cluster detection, since the focus is on risk values for a group of geographical entities considered simultaneously. This information is not conveyed by a statistic map of the estimated risk and the associated prediction variance.

Spatial uncertainty modeling has been one of the most vibrant areas of research in geostatistics for the last 2 decades [13–15]. Applications, such as modeling the migration of pollutants in the subsurface environment, require measures of multiple-point uncertainty, such as the probability of occurrence of a string of high or low permeability values that may represent a flow path or flow barrier [16]. These joint probabilities are assessed numerically from a set of realizations of the spatial distribution of attribute values over the locations of interest. In other words, the spatial uncertainty is modeled through the generation of a set of equally-probable simulated maps, each consistent with the information available, such as histogram or a spatial correlation function. Stochastic simulation allows generation of maps that reproduce the spatial variability of the data without smoothing effect [17]. The set of simulated maps can also be used to propagate the uncertainty through spatial operators or transfer functions. For example, the set of simulated permeability maps can be fed into a flow simulator, yielding a distribution of response values, such as travel times to the water table [18].

From the user's perspective, it is important to be able to visualize the uncertainty in the spatial model of risk values. Although stochastic simulation offers a way to generate a large number of potential scenarios, the burden of manually scrolling through hundreds of different maps will test the patience of most users and be little informative. The information contained in the set of simulated maps is thus often summarized through a **static display** of probabilities of exceeding particular threshold or some measures of the spread of the posterior distribution [13]. By doing so, one however fails to depict the uncertainty about spatial features and essentially maps the area-specific measures of uncertainty provided by kriging and other smoothing methods. To depict visually the spatial uncertainty, several authors [19, 20] have developed algorithms that show the realizations one at a time in rapid succession, like the frames of an animated cartoon, thereby eliminating the burden of manually displaying each simulated map. Like an animated cartoon, successive realizations must be similar enough to allow the eye to catch gradual changes. Such a similarity can be achieved by ranking the realizations appropriately or by using a simulation algorithm, called p-field simulation [21–23], that generates realizations that are incrementally different. The **animated display** of realizations allows one to distinguish areas that remain stable over all realizations (low uncertainty) from those where large fluctuations occur between realizations (high uncertainty).

This paper presents the first application of geostatistical simulation to model the spatial uncertainty attached to cancer risk values. This approach combines Poisson kriging and p-field simulation to generate multiple realizations of the spatial distribution of cancer risk values. These simulated maps are then fed into a local cluster analysis, allowing one to assess how uncertainty about the spatial distribution of risk values translates into uncertainty about the location of spatial clusters and outliers in cancer risks. The simulation approach is also used to generate a series of realizations that are incrementally different and animations are created using the Space-Time Information System (STIS) technology [24]. A public-domain executable for generating multiple risk maps, along with example datasets, is provided. This novel methodology is applied to the modeling, visualization and propagation of the spatial uncertainty of age-adjusted breast and pancreatic cancer mortality risks for white females in 295 US counties of the Northeast (1970–1994).

## Methods

### Data

### Poisson kriging

The geostatistical methodology for the estimation of risk values from empirical frequencies, and its performance relative to common smoothers, is described in details in Goovaerts [8]. This section briefly reviews its most salient features. For a given number *N* of entities (e.g. counties, states, electoral ward), denote the observed mortality rates as *z*(**u**_{
α
}) = *d*(**u**_{
α
})/*n*(**u**_{
α
}), where *d*(**u**_{
α
}) is the number of recorded mortality cases and *n*(**u**_{
α
}) is the size of the population at risk. These entities are referenced geographically by their centroids (or seats) with the vector of spatial coordinates **u**_{
α
}= (x_{
α
},y_{
α
}). The disease count *d*(**u**_{
α
}) is interpreted as a realization of a random variable *D*(**u**_{
α
}) that follows a Poisson distribution with one parameter (expected number of counts) that is the product of the population size *n*(**u**_{
α
}) by the local risk *R*(**u**_{
α
}).

The risk estimate over a given entity with centroid **u**_{
α
}, and the attached prediction variance, are computed from *K* neighboring observed rates as:

where *C*_{R}(**u**_{i}-**u**_{
α
}) is the covariance between the rate measured at **u**_{i} and the risk estimated at **u**_{
α
}; the spatial proximity is here defined in terms of Euclidian distance between the geographical centroids of the corresponding counties. The kriging weights *λ*_{
i
}(**u**_{
α
}) are the solution of the following system of linear equations, known as the "Poisson Kriging" (PK) system [28, 29]:

where *δ*_{ij} = 1 if **u**_{i} = **u**_{j} and 0 otherwise, and *m** is the population-weighted mean of the rates. The term *μ*(**u**_{
α
}) is a Lagrange parameter that results from the minimization of the estimation variance subject to the unbiasedness constraint on the estimator. The addition of an "error variance" term, *m**/*n*(**u**_{i}), for a zero distance accounts for variability arising from population size, leading to smaller weights for less reliable data (i.e. measured over smaller populations).

The computation of kriging weights and kriging variance requires knowledge of the covariance of the unknown risk, *C*_{
R
}(**h**), or equivalently its semivariogram *γ*_{
R
}(**h**) = *C*_{
R
}(0)-*C*_{
R
}(**h**). Following Monestiez *et al.* [28, 29] the semivariogram of the risk is estimated as:

where N(**h**) is the number of pairs of county centroids separated by a vector **h**. The different pairs [*z*(**u**_{
α
})-*z*(**u**_{
α
}+**h**)] are weighted by the corresponding population sizes to homogenize their variance. A permissible model, *γ*_{
R
}(**h**), is then fitted to the experimental semivariogram, i.e. using weighted least-square regression [30] in this paper.

### Modeling local and spatial uncertainty

The uncertainty about the cancer mortality risk prevailing within the county with centroid **u**_{
α
}can be modeled using the conditional cumulative distribution function (ccdf) of the risk variable *R*(**u**_{
α
}) defined as:

where *G*(.) is the cumulative distribution function of the standard normal random variable. The notation "|(*K*)" expresses conditioning to the local information, say, *K* neighboring observed rates. The function (5) gives the probability that the unknown risk is no greater than any given threshold *r*. It is modeled as a Gaussian distribution with the mean and variance corresponding to the Poisson kriging estimate and variance. Measures of the spread of the distribution (i.e. variance or interquartile range), as well as the probability of exceeding any given threshold, are easily computed and mapped [[13], p. 358–361].

**u**

_{ α }),

*α*= 1,...,

*N*}, stochastic simulation aims to generate a set of

*L*equally-probable realizations of the spatial distribution of risk values, {

*r*

^{(l)}(

**u**

_{ α }),

*α*=1,...,

*N*;

*l*= 1,...,

*L*}, each consistent with the spatial pattern of the risk as modeled using the function

*γ*

_{ R }(

**h**). Simulation of spatial phenomena can be accomplished using a growing variety of techniques that differ in the underlying random function model, the amount and type of information that can be accounted for, and the computer requirements [13]. The main difficulty for the current application is that there is no measured risk data; only a semivariogram indirectly computed from observed rates according to expression (4) is available. The lack of target histogram is not an issue for the p-field simulation approach [21–23]. The basic idea is to generate a realization {

*r*

^{(l)}(

**u**

_{ α }),

*α*= 1,...,

*N*} through the sampling of the set of ccdfs by a set of spatially correlated probability values {

*p*

^{(l)}(

**u**

_{ α }),

*α*= 1,...,

*N*}, known as a probability field or p-field. Since the probability distributions are Gaussian (Equation 5), each risk value is simulated as:

*r*^{(l)}(**u**_{
α
}) = *F*^{-1}(**u**_{
α
}; *p*^{(l)}(**u**_{
α
}) | (*K*)) =
(**u**_{
α
}) + *σ*_{
PK
}(**u**_{
α
})*y*^{(l)}(**u**_{
α
}) (Equation 6)

where *y*^{(l)}(**u**_{
α
}) is the quantile of the standard normal distribution corresponding to the cumulative probability *p*^{(l)}(**u**_{
α
}).

#### Generating the probability field

The *L* sets of random deviates or normal scores, {*y*^{(l)}(**u**_{
α
}), *α* = 1,...*N*}, are generated using non-conditional sequential Gaussian simulation which proceeds as follows (see [13], p. 380 for more details):

1. Define a random path (i.e. using a random number generator) visiting each county centroid **u**_{
α
}only once.

2. At each location **u**_{
α
}determine the mean and variance of the Gaussian probability distribution of *y*-values as:

where *y*^{(l)}(**u**_{i}) are normal scores simulated at locations previously visited along the random path and located within a search radius from **u**_{
α
}, and *C*(**u**_{i}-**u**_{
α
}) is the covariance function of the normal score variable Y for the separation vector **h**_{iα}= **u**_{i}-**u**_{
α
}. In this paper, *C*(**h**) was identified to the covariance function of the risk after rescaling to a unit sill, i.e. *C*(**h**) = 1-*γ*_{
R
}(**h**)/*C*_{
R
}(0). The *λ*_{i} are kriging weights obtained by solving the following system of linear equations (simple kriging, SK):

3. Draw a simulated value from the Gaussian ccdf and add it to the data set. In other words, the simulated value at **u**_{
α
}is *y*^{(l)}(**u**_{
α
}) =
(**u**_{
α
})+*σ*_{
SK
}(**u**_{
α
}) × *G*^{-1} [*p*^{(l)}], where *p*^{(l)} is a random number between 0 and 1.

4. Proceed to the next location along the random path, and repeat the two previous steps.

5. Loop until all *N* locations (i.e. N = 295 here) are simulated.

The procedure is repeated using a different random path and set of random numbers to generate another realization. Note that the mean and variance of the set of simulated normal scores can sometimes deviate substantially from zero and one, respectively. Such discrepancies between model and realization statistics are referred to as ergodic fluctuations [[13], p. 426]. These fluctuations can be important when the range of the semivariogram model is large with respect to the size of the simulated area, in particular when the nugget effect is small [[31, 32] p. 128–133]. In the program **pfield.exe** described below, each set of simulated normal scores is rescaled by its mean and variance to fulfill the underlying assumptions of a standard normal random function.

#### Creating animated displays of realizations

One interesting feature of the p-field simulation approach is its ability to generate easily a series of realizations that are incrementally different. The algorithm, originally proposed by Srivastava [19] for the simulation of raster grids, requires the generation of a probability field that is much larger than the area to be simulated. Multiple realizations are then generated by shifting slightly the probability field before sampling the set of ccdfs. This small shift ensures that, albeit different, the probabilities used to sample the ccdfs from one realization to the next are very close to each other, leading to gradual changes between successive realizations. These realizations can be used as consecutive frames in an animation, enabling a dynamic and evolving display of the uncertainty attached to the spatial distribution of attribute values. A similar approach is here implemented for the simulation of risk values over the non-gridded set of county centroids; see the *Results and Discussion* Section.

### Local cluster analysis and propagation of uncertainty

Once the rates have been geostatistically filtered or simulated, classical Exploratory Spatial Data Analysis (ESDA) techniques can be applied to investigate the existence of local clusters or outliers of high or low cancer risk values. For example, the local Moran test evaluates local clustering or spatial autocorrelation [33]. Its null hypothesis is that there is no association between rates in neighboring geographical units; i.e. counties in this paper. The working (alternative) hypothesis is that spatial clustering exists. For each county, the so-called LISA (Local Indicator of Spatial Autocorrelation) statistic is computed as:

**u**

_{ α }) is the kriged risk for the county being tested, which is referred to as the "kernel" hereafter; (

**u**

_{ j }) are the risk values for the

*J*(

**u**

_{ α }) neighboring counties that are here defined as units sharing a common border or vertex with the kernel

**u**

_{ α }(1-st order queen adjacencies). All values are standardized using the mean

*m*and standard deviation

*s*of the set of risk estimates. Since the standardized values have zero mean, a negative value for the LISA statistic indicates a negative local auto-correlation and the presence of spatial outlier where the kernel value is much lower (higher) than the surrounding values. Cluster of low (high) values will lead to positive values of the LISA statistic.

In addition to the sign of the LISA statistic, its magnitude informs on the extent to which kernel and neighborhood values differ. To test whether this difference is significant or not, a Monte Carlo simulation is conducted, which traditionally consists of sampling randomly and without replacement the global distribution of rates (i.e. sample histogram) and computing the corresponding simulated neighborhood averages. This operation is repeated many times (e.g. *M* = 999 draws) and these simulated values are multiplied by the kernel value to produce a set of *M* simulated values of the LISA statistic at location **u**_{
α
}. This set represents a numerical approximation of the probability distribution of the LISA statistic at **u**_{
α
}, under the assumption of spatial independence. The observed statistic (Equation 10) is compared to the probability distribution, enabling the computation of the probability of not rejecting the null hypothesis. The so-called *p*-value is compared to the significance level *α* chosen by the user and representing the probability of rejecting the null hypothesis when it is true (Type I error). Every county where the *p*-value is lower than the significance level is classified as a significant spatial outlier (HL: high value surrounded by low values, and LH: low value surrounded by high values) or cluster (HH: high value surrounded by high values, and LL: low value surrounded by low values). If the *p*-value exceeds the *α* level, the county is declared non-significant (NS).

Instead of conducting the local cluster analysis on the set of smooth risk estimates, one of the key ideas of this paper is to apply the algorithm to the set of simulated risk maps, yielding a set of classifications of counties into significant clusters and outliers, or non significant units. The ensemble of classified maps can be used to compute the probability that a county belongs to each of the five classes. This county is then allocated to the class with the highest probability of occurrence (maximum likelihood classification).

Any prior information on the spatial pattern of mortality risks could be included directly into the randomization procedure underpinning the test of significance of local cluster analysis using the approach proposed by Goovaerts and Jacquez [9]. In other words, the null hypothesis of spatial randomness and uniform risk, corresponding to the operation of randomization of the outcomes, is replaced by models that are more realistic. Realizations of the so-called "Neutral Models" are also generated using geostatistical simulation. The approach allows the identification of spatial patterns above and beyond that incorporated into the neutral model, enabling, for example, the identification of "hot spots" beyond background variation in a pollutant or the detection of clusters beyond regional variation in the risk of developing cancer [34].

### Software

Poisson kriging, including the estimation and modelling of the semivariogram of the risk, was conducted using the public-domain executable **poisson-kriging.exe** described in Goovaerts [8]. The data and parameter file used for the estimation of pancreatic cancer risks are provided with the paper (additional file 1: pancreas.dat, additional file 2: poisson-kriging.par). Probability fields were generated using a modified version of the code **sgsim.exe** from the public-domain Geostatistical Software Library Gslib [32]. This modified code allows simulation over an irregular grid of county centroids, and the random deviates are directly combined with the output of Poisson kriging to generate simulated risk values. The executable of this p-field code is provided with the paper (additional file 3: **pfield.exe**), along with the parameter file used for pancreatic cancer (additional file 4: pfield.par). Local cluster analysis and animation of results were performed using the Space-Time Information System (STIS) technology [24].

**pfield.par**, used to generate 100 realizations of the spatial distribution of pancreatic cancer mortality risk values over 295 county centroids. The text file includes the following information:

• Name of the text file including the Poisson kriging risk estimates and variances. This file is one of the output files of **poisson_kriging.exe.**

**pfield.par**,

**100pancreas-pfield.out**, is shown in Figure 3.

• Name of the text file including the ID field and spatial coordinates of the centroids of the geographical units to be simulated.

• The column numbers for the observation identification code (i.e. FIP county), and the variables with the spatial coordinates.

• The number of geographical units to simulate.

• The number of realizations to generate.

• Random number seed to initiate the random number generator. This number should be a large odd integer.

• The maximum number of previously simulated values to use in the computation of kriging weights and kriging variance (parameter *K* in Equations 7–9). These values are selected according to their covariance with the node being simulated, which accounts for both distance and direction in presence of anisotropy. Values beyond the range of autocorrelation (i.e. zero covariance) are not selected since their weight is systematically zero in simple kriging.

**poisson_kriging.exe**; see example in Figure 4.

## Results and discussion

### Semivariograms of cancer mortality risk

The semivariograms of the risk are much better structured and have smaller sills than the corresponding semivariograms of the rates. This is expected since the weights in expression (4) attenuate the influence of extreme rates computed from small population sizes and subtraction of the correction term *m** reduces the variance even more. This effect is more pronounced for pancreatic cancer: the relative decrease in sill value is larger than for breast cancer, and the risk semivariogram has a much longer range of autocorrelation than the traditional estimator that is almost pure nugget effect (i.e. no spatial correlation). Since pancreatic cancer is less frequent than breast cancer, its mortality rates are more likely to be impacted by the small number problem and display higher levels of noise. In particular for breast cancer the semivariogram of the risk displays a slight anisotropy, with smaller variability (red curve) along the NE-SW direction for small distances.

### Poisson kriging

*α*= 0.05 was used and the

*p*-values were corrected for multiple testing using the Simes adjustment [36]. The analysis of risk estimates indicates significant low-low clusters (LL) in West Virginia, and part of Pennsylvania for pancreatic cancer. Significant clusters of high risks (HH) are found around New York and Boston for both cancers.

Results of local cluster analysis of mortality rate and risk maps. Number of counties classified as significant spatial clusters or outliers for breast and pancreatic cancers on the original mortality rate maps, kriged risk maps, and on average over the 500 simulated risk maps.

Maps | HH cluster | LL cluster | HL Outlier | LH outlier |
---|---|---|---|---|

Breast cancer | ||||

Mortality rates | 20 | 39 | 2 | 1 |

Risk estimates | 35 | 50 | 0 | 0 |

Simulation | 37.2 | 48.2 | 0.18 | 0.07 |

Pancreatic cancer | ||||

Mortality rates | 5 | 16 | 7 | 0 |

Risk estimates | 42 | 60 | 0 | 0 |

Simulation | 44.5 | 51.0 | 0.24 | 0.08 |

### Stochastic simulation of risk values

A shortcoming of the local cluster analysis in Figure 8 is that it ignores the uncertainty attached to the predicted risks. For example, the large cluster of low mortality risk for pancreatic cancer should be interpreted cautiously because of the large prediction variance in the Southern part of the study area. One notices also the absence of spatial outliers for both cancers which, to some extent, results from the smoothing of high and low values by kriging. This conditional bias (i.e. overestimation of low values and underestimation of high values) might cause some counties to be wrongly classified as non-significant too.

*Methods*Section. Figures 9 and 10 show three realizations and the results of the corresponding local cluster analysis. The simulated maps are more variable than the kriged risk maps of Figure 6, yet they are smoother than the maps of potentially unreliable rates of Figure 1. Differences among realizations depict the uncertainty attached to the risk maps. For example, in Maine and Northwestern part of New-York state the simulated risk for pancreatic cancer takes a wide range of values, emphasizing the lack of reliability of high risk estimates in these regions. Differences between realizations are much smaller for breast cancer, which is expected given the higher frequency of this disease leading to more reliable risk estimates.

*a priori*variance [38], can be identified with the mean of the risk estimates and the sill of the risk semivariogram model, respectively. Figure 11 shows the histogram of these two statistics computed from the set of 500 simulated maps. The

*a priori*variance corresponds to the sill of the model fitted to the directional semivariograms computed for each set of simulated risk values; this statistic is a better measure of dispersion than the sample variance since it accounts for the spatial correlation between risk values (i.e. variance of data separated by a distance larger than the range of autocorrelation). The black dot in the box plot below each histogram is the target value, i.e. the average estimated risk or the sill of the risk semivariogram model. Five vertical lines are the 0.025 quantile, lower quartile, median, upper quartile, and 0.975 quantile of the distribution. The simulated maps reproduce, on average, reasonably well the two target statistics, in particular the target mean. The coefficient of variation is smaller for breast cancer relative to pancreatic cancer. Besides the reproduction of the sill, reproduction of the shape of the target semivariogram model should also be assessed. The directional semivariograms of simulated risk values were computed for each of the 500 realizations, and their average is superimposed on the target model (solid curve) at the bottom of Figure 11. For both cancers the simulated risk maps very well reproduce the spatial anisotropy; the sill is slightly underestimated in the realizations of pancreatic mortality risks, which agrees with the underestimation of the target

*a priori*variance noticed on the histogram. Discrepancies between the realization and target semivariograms (i.e. ergodic fluctuations) in this application are, to a large extent, explained by the large range of autocorrelation (up to 800 km) with respect of the size of the study area, as well as the small nugget effect of the risk semivariogram models. Yet, comparison of Figures 11 and 7 shows that simulated maps better reproduce the spatial variability of risk values, as modeled by the semivariogram function

*γ*

_{ R }(

**h**), than the corresponding kriged maps.

### Propagation of uncertainty through local cluster analysis

The classified maps in the right column of Figures 9 and 10 illustrate how the uncertainty about risk values translates into uncertainty about the results of the local cluster analysis. From one realization to another, the shape and position of local clusters can change substantially. For example, the southern part of the cluster of low risk values detected on the kriged map for pancreatic cancer becomes non significant on some realizations; see Figure 10 (bottom right graph). The location and nature of clusters is much more stable for breast cancer. Table 1 indicates that slightly more counties are classified as significant clusters when the analysis is performed on kriged risks instead of simulated values. A few counties are also now classified as significant outliers, which confirms the potential bias of an analysis based on smoothed rates.

### How many realizations are needed?

### Visualization of spatial uncertainty

## Conclusion

Capitalizing on the abundant geostatistical literature devoted to the modelling of local and spatial uncertainty plus the recent development of Poisson kriging, this paper presented a novel approach, and the corresponding computer code, to generate realizations of the spatial distribution of risk values. P-field simulation proceeds in two steps. First, the local uncertainty about the risk prevailing within each geographical unit is modelled from the observed mortality rates and population at risk using Poisson kriging. The set of local probability distributions is then sampled using a set of spatially correlated probability values generated using non-conditional sequential Gaussian simulation. Through sampling by slowly moving probability fields, the approach also allows the creation of incrementally different realizations that can be used as consecutive frames in an animation, enabling a dynamic and evolving display of the uncertainty attached to the spatial distribution of risk values.

Simulated risk maps reproduce the spatial variability of the risk, thereby overcoming the smoothing effect inherent to all current procedures for stabilization of rate data, including Poisson kriging. In fact, the map of kriged risk is but the average of all simulated risk maps, while the kriging variance corresponds to the variance of the local distribution of simulated risk values. The probability of occurrence of multi-point or spatial features, such as aggregates of counties with low or high cancer mortality risk, can also be assessed numerically from the ensemble of realizations. This information can be conveyed through static maps of spatial clusters and outliers, using a color code indicating the likelihood of the classification, or through the dynamic display of the set of classified maps.

The analysis of breast and pancreatic cancer mortality rates illustrated the potentialities of the approach for modelling and propagating their spatial uncertainty through local cluster analysis. Geostatistical simulation generated risk maps that are more variable than the smooth risk map estimated by Poisson kriging and reproduce better the spatial pattern captured by the risk semivariogram model. Differences between realizations were particularly important for the less frequent pancreatic cancer, reflecting for example the uncertainty attached to higher risk estimates in sparsely populated counties of Maine and North-western New-York. Such uncertainty translates into a lack of reliability of some of the spatial clusters and outliers detected on the map of smoothed rates. The case study showed how geostatistical simulation permits the quantification of the likelihood of spatial clusters and outliers detected using Moran's I. Local cluster analysis of the set of simulated risk maps leads to a clear visualization of the lower reliability of the classification for pancreatic cancer versus breast cancer: only a few counties in the large cluster of low risk detected in West Virginia and Southern Pennsylvania are significant over 90% of all simulations. On the other hand, the cluster of high breast cancer mortality in Niagara county, detected after application of Poisson kriging, is significant for 60% of simulated risk maps. Sensitivity analysis shows that 500 realizations are needed to achieve a stable classification for pancreatic cancer, while convergence is reached after fewer than 300 realizations for breast cancer.

Maps of cancer incidence as well as mortality rates are frequently used as input to disease clustering procedures whose purpose is to identify local areas of excess and deficit. Most users recognize that rates recorded for small populations are uncertain, leading to the routine application of smoothing methods prior to the analysis. Yet, in their interpretation of the risk maps, they tend to forget that the risks themselves are prone to uncertainty and that some features, such as large clusters of low or high risk, might just be artifacts of the smoothing method. Ignoring such uncertainty can lead to misallocation of resources to investigate unreliable clusters of high risk, while areas of real concern might go undetected. Instead of a single, and potentially misleading, map of smoothed rates, the simulation approach provides public health officials with a set of plausible scenarios for the spatial pattern of risk. Those permit investigating the impact of risk uncertainty on alternate intervention and control activities. In the future, the concept of stochastic simulation of mortality risks will be combined with the neutral model methodology [9] to assess geographic clustering using appropriate null hypotheses that account for the spatial correlation and background variation modeled from the observed rates and any ancillary information (i.e. exposure model).

## Declarations

### Acknowledgements

This research was funded by grants 1R43CA110281-01 and R44-CA092807 from the National Cancer Institute. The views stated in this publication are those of the author and do not necessarily represent the official views of the NCI.

## Authors’ Affiliations

## References

- Mungiole M, Pickle LW, Hansen Simonson K: Application of a weighted head-banging algorithm to mortality data maps. Statistics in Medicine. 1999, 18: 3201-3209.PubMedView ArticleGoogle Scholar
- Pickle LW: Spatial analysis of disease. Biostatistical Applications in Cancer Research. Edited by: Beam C. 2002, Boston, Kluwer Academic Publishers, Chapter 7: 113-150.View ArticleGoogle Scholar
- Waller LA, Gotway CA: Applied Spatial Statistics for Public Health Data. 2004, New Jersey: John Wiley and SonsView ArticleGoogle Scholar
- Christensen OF, Waagepetersen R: Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics. 2002, 58: 280-286.PubMedView ArticleGoogle Scholar
- Best N, Richardson S, Thomson A: A comparison of Bayesian spatial models for disease mapping. Statistical Methods in Medical Research. 2005, 14: 35-59.PubMedView ArticleGoogle Scholar
- Clayton DG, Kaldor J: Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics. 1987, 43: 671-681.PubMedView ArticleGoogle Scholar
- Marshall RJ: Mapping disease and mortality rates using empirical Bayes estimators. Applied Statistics. 1991, 40 (2): 283-294.PubMedView ArticleGoogle Scholar
- Goovaerts P: Geostatistical analysis of disease data: estimation of cancer mortality risk from empirical frequencies using Poisson kriging. International Journal of Health Geographics. 2005, 4: 31-PubMedPubMed CentralView ArticleGoogle Scholar
- Goovaerts P, Jacquez GM: Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York. International Journal of Health Geographics. 2004, 3: 14-PubMedPubMed CentralView ArticleGoogle Scholar
- Clayton DG, Bernardinelli L: Bayesian methods for mapping disease risk. Geographical and Environmental Epidemiology: Methods for Small-area Studies. Edited by: Elliot P, Cuzick J, English D, Stern R. 1992, Oxford, UK, Oxford University Press, 205-220.Google Scholar
- Johnson GD: Small area mapping of prostate cancer incidence in New York State (USA) using fully Bayesian hierarchical modelling. International Journal of Health Geographics. 2004, 3: 29-PubMedPubMed CentralView ArticleGoogle Scholar
- Richardson S, Thomson A, Best N, Elliot P: Interpreting posterior relative risk estimates in disease-mapping studies. Environmental Health Perspectives. 2004, 112: 1016-1025.PubMedPubMed CentralView ArticleGoogle Scholar
- Goovaerts P: Geostatistics for Natural Resources Evaluation. 1997, New York: Oxford University PressGoogle Scholar
- Chiles JP, Delfiner P: Geostatistics: Modeling Spatial Uncertainty. 1999, New York: WileyView ArticleGoogle Scholar
- Goovaerts P: Geostatistical modelling of uncertainty in soil science. Geoderma. 2001, 103: 3-26.View ArticleGoogle Scholar
- Lemke LD, Abriola LM, Goovaerts P: DNAPL source zone characterization: influence of hydraulic property correlation on predictions of DNAPL infiltration and entrapment. Water Resources Research. 2004, 40 (1): W01511 10.1029Google Scholar
- Goovaerts P: Estimation or simulation of soil properties? An optimization problem with conflicting criteria. Geoderma. 2000, 3–4: 165-186.View ArticleGoogle Scholar
- Goovaerts P: Geostatistical modelling of the spaces of local, spatial and response uncertainty for continuous petrophysical properties. Stochastic Modeling and Geostatistics: Principles, Methods and Case Studies. AAPG Computer Applications in Geology. Edited by: Coburn TC, Yarus JM, Chambers RL. 2005, Tulsa, Oklahoma, The American Association of Petroleum Geologists, 2: 1-21.Google Scholar
- Srivastava M: The visualization of spatial uncertainty. Stochastic Modeling and Geostatistics: Principles, Methods and Case Studies. AAPG Computer Applications in Geology. Edited by: Yarus JM, Chambers RL. 1994, Tulsa, Oklahoma, The American Association of Petroleum Geologists, 339-345.Google Scholar
- Wang L: Animated visualization of multiple simulated realizations and assessment of uncertainty. Report 7 of Stanford Center for Reservoir Forecasting. 1994, Stanford, CA, Stanford Center for Reservoir ForecastingGoogle Scholar
- Froidevaux R: Probability field simulation. Geostatistics Troia '92. Edited by: Soares A. 1993, Dordrecht, The Netherlands, Kluwer Academic Publishers, 2: 73-84.View ArticleGoogle Scholar
- Srivastava M: Reservoir characterization with probability field simulation. SPE Annual Conference and Exhibition. 1992, Washington, D.C., Society of Petroleum Engineers, 24753: 927-938.Google Scholar
- Goovaerts P: Geostatistical modeling of spatial uncertainty using p-field simulation with conditional probability fields. International Journal of Geographical Systems. 2002, 16: 167-178.View ArticleGoogle Scholar
- AvRuskin GA, Jacquez GM, Meliker JR, Slotnick MJ, Kaufmann A, Nriagu JO: Visualization and exploratory analysis of epidemiologic data using using a novel space time information system. International Journal of Health Geographics. 2004, 3: 26-PubMedPubMed CentralView ArticleGoogle Scholar
- Pickle LW, Mungiole M, Jones GK, White AA: Exploring spatial patterns of mortality: the new Atlas of United States mortality. Statistics in Medicine. 1999, 18: 3211-3220.PubMedView ArticleGoogle Scholar
- Grauman DJ, Tarone RE, Devesa SS, Fraumeni JF: Alternate ranging methods for cancer mortality maps. Journal of the National Cancer Institute. 2000, 92 (7): 534-543.PubMedView ArticleGoogle Scholar
- Brewer CA, Pickle L: Evaluation of methods for classifying epidemiological data on choropleth maps in series. Annals of the Association of American Geographers. 2002, 92 (4): 662-681.View ArticleGoogle Scholar
- Monestiez P, Dubroca L, Bonnin E, Durbec JP, Guinet C: Comparison of model based geostatistical methods in ecology: application to fin whale spatial distribution in northwestern Mediterranean Sea. Geostatistics Banff 2004. Edited by: Leuangthong O, Deutsch CV. 2005, Dordrecht, The Netherlands, Kluwer Academic Publishers, 2: 777-786.View ArticleGoogle Scholar
- Monestiez P, Dubroca L, Bonnin E, Durbec JP, Guinet C: Geostatistical modelling of spatial distribution of Balenoptera physalus in the northwestern Mediterranean Sea from sparse count data and heterogeneous observation efforts. Ecological Modelling. 2006,Google Scholar
- Pardo-Iguzquiza E: VARFIT: a Fortran-77 program for fitting variogram models by weighted least squares. Computers and Geosciences. 1999, 25: 251-261.View ArticleGoogle Scholar
- Goovaerts P: Impact of the simulation algorithm, magnitude of ergodic fluctuations and number of realizations on the spaces of uncertainty of flow properties. Stochastic Environmental Research and Risk Assessment. 1999, 13: 161-182.View ArticleGoogle Scholar
- Deutsch CV, Journel AG: GSLIB: Geostatistical Software Library and User's Guide. 1998, New York: Oxford Univ. Press, 2Google Scholar
- Anselin L: Local indicators of spatial association – LISA. Geographical Analysis. 1995, 27: 93-115.View ArticleGoogle Scholar
- Goovaerts P: Detection of spatial clusters and outliers in cancer rates using geostatistical filters and spatial neutral models. geoENV V – Geostatistics for Environmental Applications. Edited by: Renard Ph, Demougeot-Renard H, Froidevaux R. 2005, The Netherlands, Springer-Verlag, 149-160.Google Scholar
- Englund E, Sparks A: Geo-EAS 1.2.1 User's Guide. EPA Report # 60018-91/008. 1988, EPA-EMSL, Las Vegas, NVGoogle Scholar
- Simes RJ: An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986, 73: 751-4.View ArticleGoogle Scholar
- New York State Department of Health: The New York State Cancer Surveillance Improvement Initiative. 1998, Albany, NYGoogle Scholar
- Journel AG, Huijbregts ChJ: Mining Geostatistics. 1978, London: Academic PressGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.