Geostatistical analysis of disease data: visualization and propagation of spatial uncertainty in cancer mortality risk using Poisson kriging and pfield simulation
 Pierre Goovaerts^{1}Email author
Affiliated with
DOI: 10.1186/1476072X57
© Goovaerts. 2006
Received: 29 December 2005
Accepted: 09 February 2006
Published: 09 February 2006
Abstract
Background
Smoothing methods have been developed to improve the reliability of risk cancer estimates from sparsely populated geographical entities. Filtering local details of the spatial variation of the risk leads however to the detection of larger clusters of low or high cancer risk while most spatial outliers are filtered out. Static maps of risk estimates and the associated prediction variance also fail to depict the uncertainty attached to the spatial distribution of risk values and does not allow its propagation through local cluster analysis. This paper presents a geostatistical methodology to generate multiple realizations of the spatial distribution of risk values. These maps are then fed into spatial operators, such as in local cluster analysis, allowing one to assess how risk spatial uncertainty translates into uncertainty about the location of spatial clusters and outliers. This novel approach is applied to ageadjusted breast and pancreatic cancer mortality rates recorded for white females in 295 US counties of the Northeast (1970–1994). A publicdomain executable with example datasets is provided.
Results
Geostatistical simulation generates risk maps that are more variable than the smooth risk map estimated by Poisson kriging and reproduce better the spatial pattern captured by the risk semivariogram model. Local cluster analysis of the set of simulated risk maps leads to a clear visualization of the lower reliability of the classification obtained for pancreatic cancer versus breast cancer: only a few counties in the large cluster of low risk detected in West Virginia and Southern Pennsylvania are significant over 90% of all simulations. On the other hand, the cluster of high breast cancer mortality in Niagara county, detected after application of Poisson kriging, appears on 60% of simulated risk maps. Sensitivity analysis shows that 500 realizations are needed to achieve a stable classification for pancreatic cancer, while convergence is reached for less than 300 realizations for breast cancer.
Conclusion
The approach presented in this paper enables researchers to generate a set of simulated risk maps that are more realistic than a single map of smoothed mortality rates and allow the propagation of cancer risk uncertainty through local cluster analysis. Coupled with visualization and querying capabilities of geographical information systems, animated display of realizations can highlight areas that depart consistently from the general behavior observed across the region, guiding further investigation and control activities.
Background
Cancer mortality maps are used by public health officials to identify areas of excess and to guide surveillance and control activities. Quality of decisionmaking thus relies on an accurate quantification of risks from observed rates which can be very unreliable when computed from sparsely populated geographical entities or for diseases with a low frequency of occurrence [1]. Smoothing methods have been developed to improve the reliability of these estimates by borrowing information from neighboring entities [2, 3]. These methods range from simple deterministic techniques (i.e. head banging method [1]) to sophisticated full Bayesian models [4, 5]. Empirical Bayes smoothers [6, 7] and Poisson kriging [8] provide modelbased approaches with intermediate difficulty in terms of implementation and computer requirements. Although simulation studies [7, 8] have demonstrated the benefit of smoothing methods for risk prediction, some uncertainty will always be associated with the estimated risk. In Poisson kriging the uncertainty about the risk within a given geographical entity is modeled by computing a minimum error variance (kriging) estimate of the risk and the associated error variance, which can then be combined to derive a Gaussiantype confidence interval. Full Bayes models go one step further and yields the full posterior distribution of the risk while accounting for the uncertainty in the parameters of the model.
Most studies do not make use of the uncertainty measure provided by smoothing methods, and only the map of smoothed rates is reported and used in the analysis. This is unfortunate since all rate smoothers, including Poisson kriging, cause the loss of local details of the spatial variation of the risk. This smoothing potentially affects the subsequent analysis, leading for example to the detection of larger aggregates of low or high cancer risk in local cluster analysis while most spatial outliers are filtered out [9]. To depict the uncertainty attached to risk maps, some authors recommend mapping the 95^{th} percentile range of the posterior distribution of risk values or the probability that the risk in each entity exceeds a specific threshold of interest [4, 10, 11]. Richardson et al. [12] proposed to use this probability of exceedence to decide whether an area should be classified as having an excess risk of cancer. They discussed different decision rules or loss functions which represent weighted tradeoffs between falsepositive results (i.e. declaring an area as having elevated risk when in fact its underlying true risk equals the background level) and falsenegative results (i.e. declaring an area to be in the background when in facts its underlying risk is elevated).
A major weakness of the uncertainty measures reported in today's health science literature is that they are areaspecific, that is they inform on the uncertainty prevailing over a single geographical entity at a time. Except if the risk values are spatially independent, the probabilities of excess of a specific threshold computed for several entities do not provide a measure of uncertainty about whether these entities jointly exceed that threshold. In addition to a measure of "local" uncertainty, one thus needs to assess the "spatial" uncertainty, that is the uncertainty attached to the spatial distribution of risk values across the study area. The quantification of spatial uncertainty is particularly important for cluster detection, since the focus is on risk values for a group of geographical entities considered simultaneously. This information is not conveyed by a statistic map of the estimated risk and the associated prediction variance.
Spatial uncertainty modeling has been one of the most vibrant areas of research in geostatistics for the last 2 decades [13–15]. Applications, such as modeling the migration of pollutants in the subsurface environment, require measures of multiplepoint uncertainty, such as the probability of occurrence of a string of high or low permeability values that may represent a flow path or flow barrier [16]. These joint probabilities are assessed numerically from a set of realizations of the spatial distribution of attribute values over the locations of interest. In other words, the spatial uncertainty is modeled through the generation of a set of equallyprobable simulated maps, each consistent with the information available, such as histogram or a spatial correlation function. Stochastic simulation allows generation of maps that reproduce the spatial variability of the data without smoothing effect [17]. The set of simulated maps can also be used to propagate the uncertainty through spatial operators or transfer functions. For example, the set of simulated permeability maps can be fed into a flow simulator, yielding a distribution of response values, such as travel times to the water table [18].
From the user's perspective, it is important to be able to visualize the uncertainty in the spatial model of risk values. Although stochastic simulation offers a way to generate a large number of potential scenarios, the burden of manually scrolling through hundreds of different maps will test the patience of most users and be little informative. The information contained in the set of simulated maps is thus often summarized through a static display of probabilities of exceeding particular threshold or some measures of the spread of the posterior distribution [13]. By doing so, one however fails to depict the uncertainty about spatial features and essentially maps the areaspecific measures of uncertainty provided by kriging and other smoothing methods. To depict visually the spatial uncertainty, several authors [19, 20] have developed algorithms that show the realizations one at a time in rapid succession, like the frames of an animated cartoon, thereby eliminating the burden of manually displaying each simulated map. Like an animated cartoon, successive realizations must be similar enough to allow the eye to catch gradual changes. Such a similarity can be achieved by ranking the realizations appropriately or by using a simulation algorithm, called pfield simulation [21–23], that generates realizations that are incrementally different. The animated display of realizations allows one to distinguish areas that remain stable over all realizations (low uncertainty) from those where large fluctuations occur between realizations (high uncertainty).
This paper presents the first application of geostatistical simulation to model the spatial uncertainty attached to cancer risk values. This approach combines Poisson kriging and pfield simulation to generate multiple realizations of the spatial distribution of cancer risk values. These simulated maps are then fed into a local cluster analysis, allowing one to assess how uncertainty about the spatial distribution of risk values translates into uncertainty about the location of spatial clusters and outliers in cancer risks. The simulation approach is also used to generate a series of realizations that are incrementally different and animations are created using the SpaceTime Information System (STIS) technology [24]. A publicdomain executable for generating multiple risk maps, along with example datasets, is provided. This novel methodology is applied to the modeling, visualization and propagation of the spatial uncertainty of ageadjusted breast and pancreatic cancer mortality risks for white females in 295 US counties of the Northeast (1970–1994).
Methods
Data
Poisson kriging
The geostatistical methodology for the estimation of risk values from empirical frequencies, and its performance relative to common smoothers, is described in details in Goovaerts [8]. This section briefly reviews its most salient features. For a given number N of entities (e.g. counties, states, electoral ward), denote the observed mortality rates as z(u _{ α }) = d(u _{ α })/n(u _{ α }), where d(u _{ α }) is the number of recorded mortality cases and n(u _{ α }) is the size of the population at risk. These entities are referenced geographically by their centroids (or seats) with the vector of spatial coordinates u _{ α }= (x_{ α },y_{ α }). The disease count d(u _{ α }) is interpreted as a realization of a random variable D(u _{ α }) that follows a Poisson distribution with one parameter (expected number of counts) that is the product of the population size n(u _{ α }) by the local risk R(u _{ α }).
The risk estimate over a given entity with centroid u _{ α }, and the attached prediction variance, are computed from K neighboring observed rates as:
where C _{R}(u _{i}u _{ α }) is the covariance between the rate measured at u _{i} and the risk estimated at u _{ α }; the spatial proximity is here defined in terms of Euclidian distance between the geographical centroids of the corresponding counties. The kriging weights λ _{ i }(u _{ α }) are the solution of the following system of linear equations, known as the "Poisson Kriging" (PK) system [28, 29]:
where δ _{ij} = 1 if u _{i} = u _{j} and 0 otherwise, and m* is the populationweighted mean of the rates. The term μ(u _{ α }) is a Lagrange parameter that results from the minimization of the estimation variance subject to the unbiasedness constraint on the estimator. The addition of an "error variance" term, m*/n(u _{i}), for a zero distance accounts for variability arising from population size, leading to smaller weights for less reliable data (i.e. measured over smaller populations).
The computation of kriging weights and kriging variance requires knowledge of the covariance of the unknown risk, C _{ R }(h), or equivalently its semivariogram γ _{ R }(h) = C _{ R }(0)C _{ R }(h). Following Monestiez et al. [28, 29] the semivariogram of the risk is estimated as:
where N(h) is the number of pairs of county centroids separated by a vector h. The different pairs [z(u _{ α })z(u _{ α }+h)] are weighted by the corresponding population sizes to homogenize their variance. A permissible model, γ _{ R }(h), is then fitted to the experimental semivariogram, i.e. using weighted leastsquare regression [30] in this paper.
Modeling local and spatial uncertainty
The uncertainty about the cancer mortality risk prevailing within the county with centroid u _{ α }can be modeled using the conditional cumulative distribution function (ccdf) of the risk variable R(u _{ α }) defined as:
where G(.) is the cumulative distribution function of the standard normal random variable. The notation "(K)" expresses conditioning to the local information, say, K neighboring observed rates. The function (5) gives the probability that the unknown risk is no greater than any given threshold r. It is modeled as a Gaussian distribution with the mean and variance corresponding to the Poisson kriging estimate and variance. Measures of the spread of the distribution (i.e. variance or interquartile range), as well as the probability of exceeding any given threshold, are easily computed and mapped [[13], p. 358–361].
Instead of a unique set of smooth risk estimates { (u _{ α }), α = 1,...,N}, stochastic simulation aims to generate a set of L equallyprobable realizations of the spatial distribution of risk values, {r ^{(l)}(u _{ α }), α =1,...,N; l = 1,...,L}, each consistent with the spatial pattern of the risk as modeled using the function γ _{ R }(h). Simulation of spatial phenomena can be accomplished using a growing variety of techniques that differ in the underlying random function model, the amount and type of information that can be accounted for, and the computer requirements [13]. The main difficulty for the current application is that there is no measured risk data; only a semivariogram indirectly computed from observed rates according to expression (4) is available. The lack of target histogram is not an issue for the pfield simulation approach [21–23]. The basic idea is to generate a realization {r ^{(l)}(u _{ α }), α = 1,...,N} through the sampling of the set of ccdfs by a set of spatially correlated probability values {p ^{(l)}(u _{ α }), α = 1,...,N}, known as a probability field or pfield. Since the probability distributions are Gaussian (Equation 5), each risk value is simulated as:
r ^{(l)}(u _{ α }) = F ^{1}(u _{ α }; p ^{(l)}(u _{ α })  (K)) = (u _{ α }) + σ _{ PK }(u _{ α })y ^{(l)}(u _{ α }) (Equation 6)
where y ^{(l)}(u _{ α }) is the quantile of the standard normal distribution corresponding to the cumulative probability p ^{(l)}(u _{ α }).
Generating the probability field
The L sets of random deviates or normal scores, {y ^{(l)}(u _{ α }), α = 1,...N}, are generated using nonconditional sequential Gaussian simulation which proceeds as follows (see [13], p. 380 for more details):
1. Define a random path (i.e. using a random number generator) visiting each county centroid u _{ α }only once.
2. At each location u _{ α }determine the mean and variance of the Gaussian probability distribution of yvalues as:
where y ^{(l)}(u _{i}) are normal scores simulated at locations previously visited along the random path and located within a search radius from u _{ α }, and C(u _{i}u _{ α }) is the covariance function of the normal score variable Y for the separation vector h _{iα }= u _{i}u _{ α }. In this paper, C(h) was identified to the covariance function of the risk after rescaling to a unit sill, i.e. C(h) = 1γ _{ R }(h)/C _{ R }(0). The λ _{i} are kriging weights obtained by solving the following system of linear equations (simple kriging, SK):
3. Draw a simulated value from the Gaussian ccdf and add it to the data set. In other words, the simulated value at u _{ α }is y ^{(l)}(u _{ α }) = (u _{ α })+σ _{ SK }(u _{ α }) × G ^{1} [p ^{(l)}], where p ^{(l)} is a random number between 0 and 1.
4. Proceed to the next location along the random path, and repeat the two previous steps.
5. Loop until all N locations (i.e. N = 295 here) are simulated.
The procedure is repeated using a different random path and set of random numbers to generate another realization. Note that the mean and variance of the set of simulated normal scores can sometimes deviate substantially from zero and one, respectively. Such discrepancies between model and realization statistics are referred to as ergodic fluctuations [[13], p. 426]. These fluctuations can be important when the range of the semivariogram model is large with respect to the size of the simulated area, in particular when the nugget effect is small [[31, 32] p. 128–133]. In the program pfield.exe described below, each set of simulated normal scores is rescaled by its mean and variance to fulfill the underlying assumptions of a standard normal random function.
Creating animated displays of realizations
One interesting feature of the pfield simulation approach is its ability to generate easily a series of realizations that are incrementally different. The algorithm, originally proposed by Srivastava [19] for the simulation of raster grids, requires the generation of a probability field that is much larger than the area to be simulated. Multiple realizations are then generated by shifting slightly the probability field before sampling the set of ccdfs. This small shift ensures that, albeit different, the probabilities used to sample the ccdfs from one realization to the next are very close to each other, leading to gradual changes between successive realizations. These realizations can be used as consecutive frames in an animation, enabling a dynamic and evolving display of the uncertainty attached to the spatial distribution of attribute values. A similar approach is here implemented for the simulation of risk values over the nongridded set of county centroids; see the Results and Discussion Section.
Local cluster analysis and propagation of uncertainty
Once the rates have been geostatistically filtered or simulated, classical Exploratory Spatial Data Analysis (ESDA) techniques can be applied to investigate the existence of local clusters or outliers of high or low cancer risk values. For example, the local Moran test evaluates local clustering or spatial autocorrelation [33]. Its null hypothesis is that there is no association between rates in neighboring geographical units; i.e. counties in this paper. The working (alternative) hypothesis is that spatial clustering exists. For each county, the socalled LISA (Local Indicator of Spatial Autocorrelation) statistic is computed as:
where (u _{ α }) is the kriged risk for the county being tested, which is referred to as the "kernel" hereafter; (u _{ j }) are the risk values for the J(u _{ α }) neighboring counties that are here defined as units sharing a common border or vertex with the kernel u _{ α }(1st order queen adjacencies). All values are standardized using the mean m and standard deviation s of the set of risk estimates. Since the standardized values have zero mean, a negative value for the LISA statistic indicates a negative local autocorrelation and the presence of spatial outlier where the kernel value is much lower (higher) than the surrounding values. Cluster of low (high) values will lead to positive values of the LISA statistic.
In addition to the sign of the LISA statistic, its magnitude informs on the extent to which kernel and neighborhood values differ. To test whether this difference is significant or not, a Monte Carlo simulation is conducted, which traditionally consists of sampling randomly and without replacement the global distribution of rates (i.e. sample histogram) and computing the corresponding simulated neighborhood averages. This operation is repeated many times (e.g. M = 999 draws) and these simulated values are multiplied by the kernel value to produce a set of M simulated values of the LISA statistic at location u _{ α }. This set represents a numerical approximation of the probability distribution of the LISA statistic at u _{ α }, under the assumption of spatial independence. The observed statistic (Equation 10) is compared to the probability distribution, enabling the computation of the probability of not rejecting the null hypothesis. The socalled pvalue is compared to the significance level α chosen by the user and representing the probability of rejecting the null hypothesis when it is true (Type I error). Every county where the pvalue is lower than the significance level is classified as a significant spatial outlier (HL: high value surrounded by low values, and LH: low value surrounded by high values) or cluster (HH: high value surrounded by high values, and LL: low value surrounded by low values). If the pvalue exceeds the α level, the county is declared nonsignificant (NS).
Instead of conducting the local cluster analysis on the set of smooth risk estimates, one of the key ideas of this paper is to apply the algorithm to the set of simulated risk maps, yielding a set of classifications of counties into significant clusters and outliers, or non significant units. The ensemble of classified maps can be used to compute the probability that a county belongs to each of the five classes. This county is then allocated to the class with the highest probability of occurrence (maximum likelihood classification).
Any prior information on the spatial pattern of mortality risks could be included directly into the randomization procedure underpinning the test of significance of local cluster analysis using the approach proposed by Goovaerts and Jacquez [9]. In other words, the null hypothesis of spatial randomness and uniform risk, corresponding to the operation of randomization of the outcomes, is replaced by models that are more realistic. Realizations of the socalled "Neutral Models" are also generated using geostatistical simulation. The approach allows the identification of spatial patterns above and beyond that incorporated into the neutral model, enabling, for example, the identification of "hot spots" beyond background variation in a pollutant or the detection of clusters beyond regional variation in the risk of developing cancer [34].
Software
Poisson kriging, including the estimation and modelling of the semivariogram of the risk, was conducted using the publicdomain executable poissonkriging.exe described in Goovaerts [8]. The data and parameter file used for the estimation of pancreatic cancer risks are provided with the paper (additional file 1: pancreas.dat, additional file 2: poissonkriging.par). Probability fields were generated using a modified version of the code sgsim.exe from the publicdomain Geostatistical Software Library Gslib [32]. This modified code allows simulation over an irregular grid of county centroids, and the random deviates are directly combined with the output of Poisson kriging to generate simulated risk values. The executable of this pfield code is provided with the paper (additional file 3: pfield.exe), along with the parameter file used for pancreatic cancer (additional file 4: pfield.par). Local cluster analysis and animation of results were performed using the SpaceTime Information System (STIS) technology [24].

Name of the text file including the Poisson kriging risk estimates and variances. This file is one of the output files of poisson_kriging.exe.

Name of the output text file (GeoEAS format [35]) that includes the spatial coordinates of each entity centroid and the simulated values for each realization. The output file produced by the parameter file pfield.par, 100pancreaspfield.out, is shown in Figure 3

Name of the text file including the ID field and spatial coordinates of the centroids of the geographical units to be simulated.

The column numbers for the observation identification code (i.e. FIP county), and the variables with the spatial coordinates.

The number of geographical units to simulate.

The number of realizations to generate.

Random number seed to initiate the random number generator. This number should be a large odd integer.

The maximum number of previously simulated values to use in the computation of kriging weights and kriging variance (parameter K in Equations 7–9). These values are selected according to their covariance with the node being simulated, which accounts for both distance and direction in presence of anisotropy. Values beyond the range of autocorrelation (i.e. zero covariance) are not selected since their weight is systematically zero in simple kriging.

Parameters of the risk semivariogram model: nugget effect, number and type of basic semivariogram models, sill and range(s) of autocorrelation for each model, azimuth of the direction of maximum range if an anisotropic model is fitted. These values are found in the output text file created by poisson_kriging.exe; see example in Figure 4
Results and discussion
Semivariograms of cancer mortality risk
The semivariograms of the risk are much better structured and have smaller sills than the corresponding semivariograms of the rates. This is expected since the weights in expression (4) attenuate the influence of extreme rates computed from small population sizes and subtraction of the correction term m* reduces the variance even more. This effect is more pronounced for pancreatic cancer: the relative decrease in sill value is larger than for breast cancer, and the risk semivariogram has a much longer range of autocorrelation than the traditional estimator that is almost pure nugget effect (i.e. no spatial correlation). Since pancreatic cancer is less frequent than breast cancer, its mortality rates are more likely to be impacted by the small number problem and display higher levels of noise. In particular for breast cancer the semivariogram of the risk displays a slight anisotropy, with smaller variability (red curve) along the NESW direction for small distances.
Poisson kriging
Results of local cluster analysis of mortality rate and risk maps. Number of counties classified as significant spatial clusters or outliers for breast and pancreatic cancers on the original mortality rate maps, kriged risk maps, and on average over the 500 simulated risk maps.
Maps  HH cluster  LL cluster  HL Outlier  LH outlier 

Breast cancer  
Mortality rates  20  39  2  1 
Risk estimates  35  50  0  0 
Simulation  37.2  48.2  0.18  0.07 
Pancreatic cancer  
Mortality rates  5  16  7  0 
Risk estimates  42  60  0  0 
Simulation  44.5  51.0  0.24  0.08 
Stochastic simulation of risk values
A shortcoming of the local cluster analysis in Figure 8 is that it ignores the uncertainty attached to the predicted risks. For example, the large cluster of low mortality risk for pancreatic cancer should be interpreted cautiously because of the large prediction variance in the Southern part of the study area. One notices also the absence of spatial outliers for both cancers which, to some extent, results from the smoothing of high and low values by kriging. This conditional bias (i.e. overestimation of low values and underestimation of high values) might cause some counties to be wrongly classified as nonsignificant too.
Propagation of uncertainty through local cluster analysis
The classified maps in the right column of Figures 9 and 10 illustrate how the uncertainty about risk values translates into uncertainty about the results of the local cluster analysis. From one realization to another, the shape and position of local clusters can change substantially. For example, the southern part of the cluster of low risk values detected on the kriged map for pancreatic cancer becomes non significant on some realizations; see Figure 10 (bottom right graph). The location and nature of clusters is much more stable for breast cancer. Table 1 indicates that slightly more counties are classified as significant clusters when the analysis is performed on kriged risks instead of simulated values. A few counties are also now classified as significant outliers, which confirms the potential bias of an analysis based on smoothed rates.
How many realizations are needed?
Visualization of spatial uncertainty
Conclusion
Capitalizing on the abundant geostatistical literature devoted to the modelling of local and spatial uncertainty plus the recent development of Poisson kriging, this paper presented a novel approach, and the corresponding computer code, to generate realizations of the spatial distribution of risk values. Pfield simulation proceeds in two steps. First, the local uncertainty about the risk prevailing within each geographical unit is modelled from the observed mortality rates and population at risk using Poisson kriging. The set of local probability distributions is then sampled using a set of spatially correlated probability values generated using nonconditional sequential Gaussian simulation. Through sampling by slowly moving probability fields, the approach also allows the creation of incrementally different realizations that can be used as consecutive frames in an animation, enabling a dynamic and evolving display of the uncertainty attached to the spatial distribution of risk values.
Simulated risk maps reproduce the spatial variability of the risk, thereby overcoming the smoothing effect inherent to all current procedures for stabilization of rate data, including Poisson kriging. In fact, the map of kriged risk is but the average of all simulated risk maps, while the kriging variance corresponds to the variance of the local distribution of simulated risk values. The probability of occurrence of multipoint or spatial features, such as aggregates of counties with low or high cancer mortality risk, can also be assessed numerically from the ensemble of realizations. This information can be conveyed through static maps of spatial clusters and outliers, using a color code indicating the likelihood of the classification, or through the dynamic display of the set of classified maps.
The analysis of breast and pancreatic cancer mortality rates illustrated the potentialities of the approach for modelling and propagating their spatial uncertainty through local cluster analysis. Geostatistical simulation generated risk maps that are more variable than the smooth risk map estimated by Poisson kriging and reproduce better the spatial pattern captured by the risk semivariogram model. Differences between realizations were particularly important for the less frequent pancreatic cancer, reflecting for example the uncertainty attached to higher risk estimates in sparsely populated counties of Maine and Northwestern NewYork. Such uncertainty translates into a lack of reliability of some of the spatial clusters and outliers detected on the map of smoothed rates. The case study showed how geostatistical simulation permits the quantification of the likelihood of spatial clusters and outliers detected using Moran's I. Local cluster analysis of the set of simulated risk maps leads to a clear visualization of the lower reliability of the classification for pancreatic cancer versus breast cancer: only a few counties in the large cluster of low risk detected in West Virginia and Southern Pennsylvania are significant over 90% of all simulations. On the other hand, the cluster of high breast cancer mortality in Niagara county, detected after application of Poisson kriging, is significant for 60% of simulated risk maps. Sensitivity analysis shows that 500 realizations are needed to achieve a stable classification for pancreatic cancer, while convergence is reached after fewer than 300 realizations for breast cancer.
Maps of cancer incidence as well as mortality rates are frequently used as input to disease clustering procedures whose purpose is to identify local areas of excess and deficit. Most users recognize that rates recorded for small populations are uncertain, leading to the routine application of smoothing methods prior to the analysis. Yet, in their interpretation of the risk maps, they tend to forget that the risks themselves are prone to uncertainty and that some features, such as large clusters of low or high risk, might just be artifacts of the smoothing method. Ignoring such uncertainty can lead to misallocation of resources to investigate unreliable clusters of high risk, while areas of real concern might go undetected. Instead of a single, and potentially misleading, map of smoothed rates, the simulation approach provides public health officials with a set of plausible scenarios for the spatial pattern of risk. Those permit investigating the impact of risk uncertainty on alternate intervention and control activities. In the future, the concept of stochastic simulation of mortality risks will be combined with the neutral model methodology [9] to assess geographic clustering using appropriate null hypotheses that account for the spatial correlation and background variation modeled from the observed rates and any ancillary information (i.e. exposure model).
Declarations
Acknowledgements
This research was funded by grants 1R43CA11028101 and R44CA092807 from the National Cancer Institute. The views stated in this publication are those of the author and do not necessarily represent the official views of the NCI.
Authors’ Affiliations
References
 Mungiole M, Pickle LW, Hansen Simonson K: Application of a weighted headbanging algorithm to mortality data maps. Statistics in Medicine 1999, 18:3201–3209.View ArticlePubMed
 Pickle LW: Spatial analysis of disease. Biostatistical Applications in Cancer Research (Edited by: Beam C). Boston, Kluwer Academic Publishers 2002, Chapter 7:113–150.
 Waller LA, Gotway CA: Applied Spatial Statistics for Public Health Data New Jersey: John Wiley and Sons 2004.View Article
 Christensen OF, Waagepetersen R: Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics 2002, 58:280–286.View ArticlePubMed
 Best N, Richardson S, Thomson A: A comparison of Bayesian spatial models for disease mapping. Statistical Methods in Medical Research 2005, 14:35–59.View ArticlePubMed
 Clayton DG, Kaldor J: Empirical Bayes estimates of agestandardized relative risks for use in disease mapping. Biometrics 1987, 43:671–681.View ArticlePubMed
 Marshall RJ: Mapping disease and mortality rates using empirical Bayes estimators. Applied Statistics 1991,40(2):283–294.View ArticlePubMed
 Goovaerts P: Geostatistical analysis of disease data: estimation of cancer mortality risk from empirical frequencies using Poisson kriging. International Journal of Health Geographics 2005, 4:31.View ArticlePubMed
 Goovaerts P, Jacquez GM: Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York. International Journal of Health Geographics 2004, 3:14.View ArticlePubMed
 Clayton DG, Bernardinelli L: Bayesian methods for mapping disease risk. Geographical and Environmental Epidemiology: Methods for Smallarea Studies (Edited by: Elliot P, Cuzick J, English D, Stern R). Oxford, UK, Oxford University Press 1992, 205–220.
 Johnson GD: Small area mapping of prostate cancer incidence in New York State (USA) using fully Bayesian hierarchical modelling. International Journal of Health Geographics 2004, 3:29.View ArticlePubMed
 Richardson S, Thomson A, Best N, Elliot P: Interpreting posterior relative risk estimates in diseasemapping studies. Environmental Health Perspectives 2004, 112:1016–1025.View ArticlePubMed
 Goovaerts P: Geostatistics for Natural Resources Evaluation New York: Oxford University Press 1997.
 Chiles JP, Delfiner P: Geostatistics: Modeling Spatial Uncertainty New York: Wiley 1999.
 Goovaerts P: Geostatistical modelling of uncertainty in soil science. Geoderma 2001, 103:3–26.View Article
 Lemke LD, Abriola LM, Goovaerts P: DNAPL source zone characterization: influence of hydraulic property correlation on predictions of DNAPL infiltration and entrapment. Water Resources Research 2004.,40(1): W01511 10.1029
 Goovaerts P: Estimation or simulation of soil properties? An optimization problem with conflicting criteria. Geoderma 2000, 3–4:165–186.View Article
 Goovaerts P: Geostatistical modelling of the spaces of local, spatial and response uncertainty for continuous petrophysical properties. Stochastic Modeling and Geostatistics: Principles, Methods and Case Studies. AAPG Computer Applications in Geology (Edited by: Coburn TC, Yarus JM, Chambers RL). Tulsa, Oklahoma, The American Association of Petroleum Geologists 2005, 2:1–21.
 Srivastava M: The visualization of spatial uncertainty. Stochastic Modeling and Geostatistics: Principles, Methods and Case Studies. AAPG Computer Applications in Geology (Edited by: Yarus JM, Chambers RL). Tulsa, Oklahoma, The American Association of Petroleum Geologists 1994, 339–345.
 Wang L: Animated visualization of multiple simulated realizations and assessment of uncertainty. Report 7 of Stanford Center for Reservoir Forecasting Stanford, CA, Stanford Center for Reservoir Forecasting 1994.
 Froidevaux R: Probability field simulation. Geostatistics Troia '92 (Edited by: Soares A). Dordrecht, The Netherlands, Kluwer Academic Publishers 1993, 2:73–84.
 Srivastava M: Reservoir characterization with probability field simulation. SPE Annual Conference and Exhibition Washington, D.C., Society of Petroleum Engineers 1992, 24753:927–938.
 Goovaerts P: Geostatistical modeling of spatial uncertainty using pfield simulation with conditional probability fields. International Journal of Geographical Systems 2002, 16:167–178.View Article
 AvRuskin GA, Jacquez GM, Meliker JR, Slotnick MJ, Kaufmann A, Nriagu JO: Visualization and exploratory analysis of epidemiologic data using using a novel space time information system. International Journal of Health Geographics 2004, 3:26.View ArticlePubMed
 Pickle LW, Mungiole M, Jones GK, White AA: Exploring spatial patterns of mortality: the new Atlas of United States mortality. Statistics in Medicine 1999, 18:3211–3220.View ArticlePubMed
 Grauman DJ, Tarone RE, Devesa SS, Fraumeni JF Jr: Alternate ranging methods for cancer mortality maps. Journal of the National Cancer Institute 2000,92(7):534–543.View ArticlePubMed
 Brewer CA, Pickle L: Evaluation of methods for classifying epidemiological data on choropleth maps in series. Annals of the Association of American Geographers 2002,92(4):662–681.View Article
 Monestiez P, Dubroca L, Bonnin E, Durbec JP, Guinet C: Comparison of model based geostatistical methods in ecology: application to fin whale spatial distribution in northwestern Mediterranean Sea. Geostatistics Banff 2004 (Edited by: Leuangthong O, Deutsch CV). Dordrecht, The Netherlands, Kluwer Academic Publishers 2005, 2:777–786.View Article
 Monestiez P, Dubroca L, Bonnin E, Durbec JP, Guinet C: Geostatistical modelling of spatial distribution of Balenoptera physalus in the northwestern Mediterranean Sea from sparse count data and heterogeneous observation efforts. Ecological Modelling 2006, in press
 PardoIguzquiza E: VARFIT: a Fortran77 program for fitting variogram models by weighted least squares. Computers and Geosciences 1999, 25:251–261.View Article
 Goovaerts P: Impact of the simulation algorithm, magnitude of ergodic fluctuations and number of realizations on the spaces of uncertainty of flow properties. Stochastic Environmental Research and Risk Assessment 1999, 13:161–182.View Article
 Deutsch CV, Journel AG: GSLIB: Geostatistical Software Library and User's Guide 2 Edition New York: Oxford Univ. Press 1998.
 Anselin L: Local indicators of spatial association – LISA. Geographical Analysis 1995, 27:93–115.View Article
 Goovaerts P: Detection of spatial clusters and outliers in cancer rates using geostatistical filters and spatial neutral models. geoENV V – Geostatistics for Environmental Applications (Edited by: Renard Ph, DemougeotRenard H, Froidevaux R). The Netherlands, SpringerVerlag 2005, 149–160.
 Englund E, Sparks A: GeoEAS 1.2.1 User's Guide. EPA Report # 60018–91/008 EPAEMSL, Las Vegas, NV 1988.
 Simes RJ: An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986, 73:751–4.View Article
 New York State Department of Health: The New York State Cancer Surveillance Improvement Initiative. Albany, NY 1998.
 Journel AG, Huijbregts ChJ: Mining Geostatistics London: Academic Press 1978.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.