- Methodology
- Open Access
Estimating range of influence in case of missing spatial data: a simulation study on binary data
- Kristine Bihrmann^{1}Email author and
- Annette K Ersbøll^{2}
https://doi.org/10.1186/1476-072X-14-1
© Bihrmann and Ersbøll; licensee BioMed Central. 2015
- Received: 10 October 2014
- Accepted: 16 December 2014
- Published: 6 January 2015
Abstract
Background
The range of influence refers to the average distance between locations at which the observed outcome is no longer correlated. In many studies, missing data occur and a popular tool for handling missing data is multiple imputation. The objective of this study was to investigate how the estimated range of influence is affected when 1) the outcome is only observed at some of a given set of locations, and 2) multiple imputation is used to impute the outcome at the non-observed locations.
Methods
The study was based on the simulation of missing outcomes in a complete data set. The range of influence was estimated from a logistic regression model with a spatially structured random effect, modelled by a Gaussian field. Results were evaluated by comparing estimates obtained from complete, missing, and imputed data.
Results
In most simulation scenarios, the range estimates were consistent with ≤25% missing data. In some scenarios, however, the range estimate was affected by even a moderate number of missing observations. Multiple imputation provided a potential improvement in the range estimate with ≥50% missing data, but also increased the uncertainty of the estimate.
Conclusions
The effect of missing observations on the estimated range of influence depended to some extent on the missing data mechanism. In general, the overall effect of missing observations was small compared to the uncertainty of the range estimate.
Keywords
- Range of influence
- Missing data
- Binary data
- INLA
Background
In spatial data, the range of influence refers to the average distance between locations at which the observed outcome is no longer correlated. The range of influence can be estimated from a variogram or based on a regression model with a spatially structured random effect, modelled by a Gaussian field. Traditionally, the regression model has posed a computational challenge in all but very small data sets, due to the need to invert a dense covariance matrix. However, the recent development of the so-called stochastic partial differential equation (SPDE) approach [1] along with the Integrated Nested Laplace Approximation (INLA) [2] approach to Bayesian inference have made these models computationally feasible. The SPDE links the Gaussian field to a Gaussian Markov random field given by a sparse precision matrix, and INLA is a computationally efficient alternative to MCMC for Bayesian inference, particularly since no sampling is required. Regression models with a spatial component could be applied in several different settings where modelling a spatial pattern is of interest. Examples include rainfall in a geographic area and pricing of houses, as well as health, disease, and lifestyle outcomes, for example the spread of an infectious disease.
Many (if not the majority of) studies are subject to missing data. For example, the price of a house is only known if the house has been sold within the study period. Similarly, all studies based on survey data will only include information on those who participated in the survey. This study will focus on the situation where a binary outcome, disease status for example, is missing at some of a given set of locations.
Missing data are often classified in three categories [3]: 1) missing completely at random (MCAR): the probability of data being missing does not depend on the observed or the unobserved data, 2) missing at random (MAR): the probability of data being missing does not depend on the unobserved data, conditional on the observed data, and 3) missing not at random (MNAR): the probability of data being missing does depend on the unobserved data, conditional on the observed data. This categorisation is based on assumptions about the missing data and cannot be tested. It is well-known that missing data can cause biased and/or inefficient estimates of, for example, regression parameters and standard errors - if not handled adequately. A popular tool for handling missing data is multiple imputation (MI) [4], at least if the missing data are not MNAR. In MI, the distribution of the observed data is used to estimate a set of values of the missing data. To our knowledge, the effect of missing data (and in particular missing outcomes) on the estimated range of influence has not been studied.
The objective of this study was to investigate how the estimated range of influence is affected when 1) a binary outcome is only observed at some of a given set of locations, and 2) multiple imputation is used to impute the outcome at the non-observed locations. For simplicity, complete information on covariates was assumed. The investigation was based on the simulation of missing data in a complete data set with known locations.
Results and discussion
The range of influence was estimated to 12.7 km (standard deviation (SD) 4.4) in the complete data set. All the reported analyses included the only significant covariate ((log-) herd size) in the regression model. The estimation was based on a triangulation of the considered region, and a finer triangulation did not change the estimate, but did markedly increase computing time. Reducing the region to make the shape of the area more regular slightly decreased the estimated range of influence. A sensitivity analysis to determine the effect of the prior distribution of the hyperparameters showed no effect on the estimates obtained from the complete data. With missing data, the range estimate did not change when increasing the precision of the prior from 0.00001 to 0.001. In the majority of missing data scenarios, however, a precision of 0.1 produced a range estimate further from the estimate obtained from the complete data, as well as a larger standard deviation, compared to the results obtained with precision 0.001. The maximum difference in median range within scenarios was 1 km, and in most scenarios it was less than 0.5 km. These differences are small considering the uncertainty of the range estimate. They do, however, indicate that a precision of 0.1 was a potentially informative prior when the information within the data was reduced by missing observations. Therefore, a prior with precision 0.001 was preferred in the analyses. As a consequence, however, various computational problems were encountered with 75% missing data. These problems were solved by simply changing the precision to the default value of 0.1. This was done in all analyses of data sets with 75% missing data, and the results may not, therefore, be fully comparable to scenarios with fewer missing observations. The same computational problems were encountered with both 50% and 75% imputed data. In these scenarios, the precision 0.1 was also used. Overall, the variance parameter σ ^{2} (part of the spatial covariance function) was unaffected by the prior distribution.
As part of the INLA procedure used to estimate parameters, eigenvalues of a Hessian matrix are computed. Negative eigenvalues, which may affect the accuracy of the approximation, were seen in three simulation scenarios (MAR0 OR=3, MAR1 OR=3, and MNAR OR=1/3; simulation scenarios are described in the Methods section) when 50% or more of the observations were missing. This happened in only 1 or 2 out of 1000 data sets. In imputed data, the problem was slightly more evident, especially with 75% imputed data where on average it happened in three out of 1000 data sets. Due to the limited extent, however, the problem was disregarded.
Summary of parameter estimates obtained from complete and simulated missing data
Data ^{ 1 } | N | Intercept (SD) | RMeSE ^{ 2 } | Covariate ^{ 3 }(SD) | RMeSE ^{ 2 } | σ ^{ 2 } (SD) | RMeSE ^{ 2 } | Range ^{ 4 } (SD) | RMeSE ^{ 2 } |
---|---|---|---|---|---|---|---|---|---|
Complete | -4.3 (0.31) | - | 0.63 (0.052) | - | 0.49 (0.21) | - | 12.7 (4.4) | - | |
MCAR | |||||||||
5% | 1000 | -4.3 (0.32) | 0.047 | 0.62 (0.053) | 0.0094 | 0.49 (0.21) | 0.028 | 12.8 (4.7) | 0.8 |
10% | 1000 | -4.3 (0.32) | 0.066 | 0.62 (0.055) | 0.013 | 0.48 (0.22) | 0.039 | 12.9 (4.8) | 1.3 |
15% | 1000 | -4.3 (0.33) | 0.087 | 0.63 (0.056) | 0.018 | 0.49 (0.22) | 0.049 | 12.8 (4.9) | 1.4 |
25% | 1000 | -4.3 (0.35) | 0.12 | 0.63 (0.060) | 0.025 | 0.47 (0.24) | 0.069 | 13.1 (5.4) | 2.0 |
50% | 994 | -4.3 (0.42) | 0.21 | 0.63 (0.074) | 0.041 | 0.45 (0.30) | 0.13 | 13.1 (6.8) | 3.7 |
75%^{5} | 940 | -4.3 (0.56) | 0.35 | 0.63 (0.11) | 0.073 | 0.44 (0.48) | 0.22 | 12.9 (10.7) | 5.4 |
MAR0 OR=1/3 | |||||||||
5% | 1000 | -4.2 (0.32) | 0.055 | 0.61 (0.053) | 0.012 | 0.49 (0.21) | 0.033 | 13.0 (4.9) | 1.1 |
10% | 1000 | -4.3 (0.33) | 0.078 | 0.63 (0.056) | 0.017 | 0.48 (0.22) | 0.042 | 12.6 (4.7) | 1.4 |
15% | 1000 | -4.3 (0.34) | 0.097 | 0.64 (0.058) | 0.020 | 0.48 (0.22) | 0.046 | 12.6 (4.8) | 1.6 |
25% | 1000 | -4.3 (0.36) | 0.16 | 0.64 (0.062) | 0.028 | 0.50 (0.25) | 0.061 | 13.1 (5.6) | 2.1 |
50% | 1000 | -4.4 (0.44) | 0.21 | 0.66 (0.074) | 0.045 | 0.55 (0.32) | 0.11 | 13.8 (6.8) | 3.1 |
75%^{ 5 } | 985 | -4.5 (0.58) | 0.33 | 0.67 (0.098) | 0.067 | 0.63 (0.48) | 0.20 | 13.4 (9.0) | 4.1 |
MAR0 OR=3 | |||||||||
5% | 1000 | -4.1 (0.33) | 0.17 | 0.57 (0.058) | 0.048 | 0.47 (0.23) | 0.059 | 13.8 (5.7) | 1.7 |
10% | 1000 | -4.3 (0.33) | 0.078 | 0.62 (0.056) | 0.017 | 0.47 (0.22) | 0.048 | 12.8 (5.0) | 1.4 |
15% | 1000 | -4.2 (0.33) | 0.10 | 0.62 (0.058) | 0.020 | 0.46 (0.23) | 0.062 | 13.1 (5.2) | 1.6 |
25% | 1000 | -4.2 (0.35) | 0.13 | 0.61 (0.062) | 0.028 | 0.44 (0.25) | 0.087 | 13.2 (5.8) | 2.3 |
50% | 987 | -4.2 (0.41) | 0.20 | 0.60 (0.074) | 0.041 | 0.41 (0.31) | 0.13 | 11.9 (7.1) | 3.9 |
75%^{5} | 976 | -4.2 (0.51) | 0.31 | 0.58 (0.10) | 0.066 | 0.35 (0.42) | 0.21 | 10.6 (9.2) | 5.3 |
MAR1 OR=1/3 | |||||||||
5% | 1000 | -4.7 (0.36) | 0.46 | 0.72 (0.061) | 0.093 | 0.50 (0.22) | 0.030 | 12.6 (4.6) | 1.0 |
10% | 1000 | -4.9 (0.38) | 0.65 | 0.75 (0.066) | 0.13 | 0.51 (0.23) | 0.038 | 12.2 (4.5) | 1.1 |
15% | 1000 | -5.1 (0.40) | 0.80 | 0.78 (0.071) | 0.16 | 0.52 (0.24) | 0.047 | 11.8 (4.4) | 1.4 |
25% | 1000 | -5.3 (0.44) | 1.03 | 0.83 (0.079) | 0.21 | 0.53 (0.25) | 0.055 | 11.1 (4.1) | 1.9 |
50% | 1000 | -5.6 (0.55) | 1.36 | 0.89 (0.098) | 0.27 | 0.58 (0.29) | 0.10 | 9.5 (3.5) | 3.3 |
75%^{ 5 } | 1000 | -5.9 (0.73) | 1.57 | 0.94 (0.13) | 0.31 | 0.61 (0.35) | 0.14 | 9.4 (4.0) | 3.5 |
MAR1 OR=3 | |||||||||
5% | 1000 | -4.1 (0.33) | 0.17 | 0.57 (0.058) | 0.048 | 0.47 (0.23) | 0.059 | 13.8 (5.7) | 1.7 |
10% | 1000 | -4.0 (0.33) | 0.28 | 0.54 (0.062) | 0.082 | 0.46 (0.25) | 0.074 | 13.8 (6.1) | 2.0 |
15% | 998 | -3.9 (0.34) | 0.38 | 0.51 (0.065) | 0.11 | 0.46 (0.26) | 0.091 | 13.9 (6.5) | 2.5 |
25% | 985 | -3.7 (0.36) | 0.54 | 0.46 (0.072) | 0.16 | 0.47 (0.31) | 0.10 | 13.7 (6.7) | 2.7 |
50% | 942 | -3.4 (0.40) | 0.84 | 0.35 (0.089) | 0.27 | 0.53 (0.43) | 0.14 | 11.9 (7.5) | 4.1 |
75%^{ 5 } | 932 | -3.2 (0.46) | 1.10 | 0.25 (0.12) | 0.38 | 0.55 (0.61) | 0.22 | 10.3 (7.9) | 4.9 |
MNAR OR=1/3 | |||||||||
5% | 1000 | -4.2 (0.31) | 0.051 | 0.62 (0.052) | 0.0072 | 0.49 (0.21) | 0.019 | 12.8 (4.6) | 0.6 |
10% | 1000 | -4.2 (0.32) | 0.091 | 0.62 (0.053) | 0.012 | 0.49 (0.21) | 0.033 | 12.8 (4.7) | 1.0 |
15% | 1000 | -4.1 (0.32) | 0.14 | 0.62 (0.054) | 0.014 | 0.49 (0.22) | 0.037 | 12.9 (4.8) | 1.2 |
25% | 1000 | -4.0 (0.33) | 0.24 | 0.62 (0.056) | 0.017 | 0.49 (0.23) | 0.056 | 13.0 (5.1) | 1.6 |
50% | 997 | -3.8 (0.37) | 0.52 | 0.61 (0.065) | 0.032 | 0.47 (0.27) | 0.099 | 13.4 (6.1) | 2.6 |
75%^{ 5 } | 977 | -3.4 (0.47) | 0.87 | 0.61 (0.085) | 0.047 | 0.45 (0.38) | 0.18 | 12.7 (8.4) | 4.2 |
MNAR OR=3 | |||||||||
5% | 1000 | -4.4 (0.32) | 0.092 | 0.62 (0.055) | 0.013 | 0.48 (0.22) | 0.040 | 12.6 (4.7) | 1.3 |
10% | 1000 | -4.5 (0.34) | 0.17 | 0.63 (0.058) | 0.023 | 0.48 (0.23) | 0.058 | 12.9 (5.0) | 1.7 |
15% | 1000 | -4.5 (0.35) | 0.26 | 0.63 (0.061) | 0.026 | 0.48 (0.24) | 0.071 | 13.0 (5.4) | 2.2 |
25% | 999 | -4.7 (0.40) | 0.42 | 0.64 (0.068) | 0.039 | 0.46 (0.27) | 0.10 | 12.6 (5.7) | 3.1 |
50%^{ 5 } | 983 | -5.0 (0.53) | 0.69 | 0.64 (0.093) | 0.068 | 0.44 (0.38) | 0.19 | 12.2 (8.2) | 4.9 |
75%^{ 5 } | 905 | -5.3 (0.83) | 1.00 | 0.65 (0.15) | 0.18 | 0.34 (0.62) | 0.44 | 14.8 (19.3) | 6.7 |
Summary of parameter estimates obtained after multiple imputation of simulated missing data
Data ^{ 1 } | N | Intercept (SD) | RMeSE ^{ 2 } | Covariate ^{ 3 } (SD) | RMeSE ^{ 2 } | σ ^{ 2 } (SD) | RMeSE ^{ 2 } | Range ^{ 4 } (SD) | RMeSE ^{1} |
---|---|---|---|---|---|---|---|---|---|
MCAR | |||||||||
5% | 998 | -4.3 (0.32) | 0.046 | 0.62 (0.053) | 0.0095 | 0.49 (0.22) | 0.031 | 13.2 (5.3) | 1.0 |
10% | 998 | -4.3 (0.33) | 0.071 | 0.62 (0.055) | 0.015 | 0.49 (0.23) | 0.042 | 13.4 (5.5) | 1.4 |
15% | 998 | -4.3 (0.33) | 0.090 | 0.62 (0.057) | 0.019 | 0.48 (0.23) | 0.055 | 13.3 (5.8) | 1.6 |
25% | 997 | -4.3 (0.35) | 0.12 | 0.63 (0.060) | 0.025 | 0.46 (0.24) | 0.077 | 13.2 (6.0) | 2.1 |
50%^{ 5 } | 924 | -4.3 (0.38) | 0.22 | 0.62 (0.066) | 0.042 | 0.40 (0.28) | 0.13 | 15.1 (9.2) | 3.6 |
75%^{ 5 } | 571 | -4.2 (0.39) | 0.37 | 0.62 (0.069) | 0.069 | 0.44 (0.38) | 0.18 | 15.2 (15.0) | 4.2 |
MAR0 OR=1/3 | |||||||||
5% | 1000 | -4.3 (0.32) | 0.054 | 0.62 (0.053) | 0.012 | 0.49 (0.22) | 0.038 | 13.4 (5.4) | 1.3 |
10% | 1000 | -4.3 (0.33) | 0.091 | 0.64 (0.056) | 0.021 | 0.47 (0.22) | 0.047 | 13.1 (5.4) | 1.3 |
15% | 1000 | -4.3 (0.33) | 0.10 | 0.64 (0.057) | 0.022 | 0.46 (0.23) | 0.058 | 13.1 (5.6) | 1.5 |
25% | 1000 | -4.3 (0.35) | 0.13 | 0.64 (0.060) | 0.026 | 0.46 (0.24) | 0.064 | 13.9 (6.4) | 2.4 |
50%^{ 5 } | 990 | -4.3 (0.37) | 0.20 | 0.65 (0.064) | 0.041 | 0.42 (0.25) | 0.10 | 14.1 (7.4) | 2.7 |
75%^{ 5 } | 846 | -4.3 (0.39) | 0.31 | 0.65 (0.068) | 0.060 | 0.41 (0.30) | 0.14 | 15.0 (9.9) | 3.6 |
MAR0 OR=3 | |||||||||
5% | 1000 | -4.1 (0.33) | 0.17 | 0.58 (0.056) | 0.047 | 0.46 (0.24) | 0.067 | 14.7 (6.7) | 2.3 |
10% | 1000 | -4.2 (0.33) | 0.084 | 0.61 (0.056) | 0.018 | 0.47 (0.23) | 0.051 | 13.5 (5.9) | 1.6 |
15% | 1000 | -4.3 (0.33) | 0.10 | 0.62 (0.057) | 0.021 | 0.46 (0.25) | 0.076 | 13.4 (6.2) | 1.5 |
25% | 986 | -4.2 (0.34) | 0.13 | 0.61 (0.059) | 0.027 | 0.41 (0.27) | 0.11 | 14.5 (7.3) | 2.7 |
50%^{ 5 } | 821 | -4.0 (0.35) | 0.25 | 0.58 (0.063) | 0.047 | 0.36 (0.29) | 0.15 | 14.9 (11.0) | 3.8 |
75%^{ 5 } | 536 | -4.0 (0.37) | 0.32 | 0.58 (0.067) | 0.069 | 0.44 (0.52) | 0.19 | 16.1 (20.3) | 4.6 |
MAR1 OR=1/3 | |||||||||
5% | 1000 | -4.8 (0.34) | 0.48 | 0.72 (0.059) | 0.096 | 0.50 (0.23) | 0.033 | 12.6 (5.0) | 1.1 |
10% | 1000 | -4.9 (0.38) | 0.61 | 0.75 (0.065) | 0.12 | 0.51 (0.24) | 0.041 | 12.4 (5.1) | 1.2 |
15% | 1000 | -5.1 (0.38) | 0.85 | 0.79 (0.066) | 0.17 | 0.52 (0.25) | 0.052 | 12.1 (5.0) | 1.5 |
25% | 1000 | -5.4 (0.40) | 1.08 | 0.84 (0.071) | 0.22 | 0.53 (0.26) | 0.060 | 11.2 (4.7) | 1.9 |
50%^{ 5 } | 1000 | -5.6 (0.45) | 1.30 | 0.88 (0.080) | 0.26 | 0.54 (0.29) | 0.081 | 10.5 (4.6) | 2.5 |
75%^{ 5 } | 995 | -5.8 (0.51) | 1.52 | 0.93 (0.090) | 0.30 | 0.52 (0.32) | 0.093 | 10.5 (5.2) | 2.8 |
MAR1 OR=3 | |||||||||
5% | 1000 | -4.1 (0.33) | 0.17 | 0.57 (0.055) | 0.051 | 0.46 (0.24) | 0.069 | 14.5 (6.7) | 2.4 |
10% | 990 | -4.0 (0.33) | 0.28 | 0.54 (0.057) | 0.079 | 0.43 (0.25) | 0.087 | 15.4 (8.0) | 2.9 |
15% | 934 | -3.9 (0.33) | 0.39 | 0.52 (0.058) | 0.10 | 0.42 (0.26) | 0.10 | 15.8 (8.7) | 3.4 |
25% | 880 | -3.7 (0.34) | 0.54 | 0.45 (0.058) | 0.17 | 0.44 (0.30) | 0.10 | 14.6 (8.0) | 2.8 |
50%^{ 5 } | 644 | -3.4 (0.34) | 0.85 | 0.36 (0.059) | 0.26 | 0.46 (0.35) | 0.14 | 15.7 (11.2) | 3.7 |
75%^{ 5 } | 566 | -3.1 (0.33) | 1.18 | 0.25 (0.062) | 0.37 | 0.57 (0.50) | 0.16 | 13.6 (12.6) | 3.6 |
MNAR OR=1/3 | |||||||||
5% | 1000 | -4.2 (0.31) | 0.064 | 0.62 (0.052) | 0.0087 | 0.49 (0.22) | 0.026 | 13.2 (5.3) | 0.9 |
10% | 1000 | -4.2 (0.32) | 0.092 | 0.62 (0.053) | 0.011 | 0.49 (0.22) | 0.038 | 13.3 (5.3) | 1.2 |
15% | 1000 | -4.1 (0.32) | 0.17 | 0.62 (0.054) | 0.015 | 0.48 (0.22) | 0.045 | 13.4 (5.6) | 1.4 |
25% | 1000 | -4.1 (0.33) | 0.22 | 0.62 (0.056) | 0.018 | 0.46 (0.23) | 0.064 | 13.7 (6.0) | 1.8 |
50%^{ 5 } | 972 | -3.7 (0.33) | 0.62 | 0.60 (0.057) | 0.033 | 0.40 (0.23) | 0.12 | 14.7 (7.3) | 2.8 |
75%^{ 5 } | 727 | -3.3 (0.34) | 0.98 | 0.59 (0.061) | 0.051 | 0.34 (0.24) | 0.18 | 15.4 (10.1) | 3.9 |
MNAR OR=3 | |||||||||
5% | 1000 | -4.4 (0.32) | 0.097 | 0.63 (0.055) | 0.014 | 0.47 (0.22) | 0.045 | 13.0 (5.3) | 1.3 |
10% | 1000 | -4.5 (0.34) | 0.18 | 0.63 (0.056) | 0.023 | 0.49 (0.24) | 0.057 | 13.6 (5.9) | 1.8 |
15% | 1000 | -4.5 (0.40) | 0.26 | 0.63 (0.061) | 0.028 | 0.47 (0.25) | 0.074 | 13.6 (6.3) | 2.3 |
25% | 983 | -4.7 (0.40) | 0.39 | 0.64 (0.067) | 0.040 | 0.45 (0.29) | 0.11 | 13.3 (7.0) | 3.0 |
50%^{ 5 } | 727 | -5.0 (0.45) | 0.46 | 0.64 (0.080) | 0.046 | 0.50 (0.46) | 0.14 | 13.9 (11.8) | 3.6 |
75%^{ 5 } | 485 | -5.3 (0.55) | 0.98 | 0.63 (0.096) | 0.10 | 1.16 (5.5) | 0.67 | 15.9 (27.3) | 4.5 |
Missing data
The effect of missing observations on the regression parameter estimates was dependent upon the missing data mechanism (Table 1). In the MCAR and MAR0 scenarios, the values of the estimates were not substantially affected. In the MAR1 scenarios, both parameters (intercept and covariate effect) were affected. These results were all expected, since the missing observations were dependent upon the covariate in the MAR1 scenario, whereas the missing observations did not depend upon either the covariate or the outcome in the MAR0 scenario.
In the MNAR scenarios, only the intercept was affected. The covariate effect was not affected, since only the outcome was MNAR.
The variance parameter estimates were all reasonably similar, but with a slight tendency to either increase (especially MAR0 OR=1/3 and MAR1 OR=1/3) or decrease (especially MAR0 OR=3) with more than 50% missing data. Both the standard deviation of each parameter estimate, and the Root Median Squared Error (RMeSE) increased when the number of missing observations was increased, regardless of scenario.
Overall, the most pronounced effect on the range estimate was seen in the MAR1 scenarios, where the missing observations were dependent upon the covariate. The specific effect of missing data on the range estimate in these scenarios is the result of the combination of the covariate and the outcome, as well as their spatial distribution, and the effect might therefore be different in another data set. The range of influence might also actually depend on the covariate, yet a potential explanation for the observed pattern is not obvious. In the MAR0 scenarios, the missing observations were directly related to the spatial structure of the data, and a more distinct effect than the observed might have been expected.
No detectable spatial correlation (range ≥ 75 km) occurred mainly among data sets with 50% and/or 75% missing observations. This is where we would expect that any spatial correlation would be most depleted by the missing data. This happened in a maximum of 95 of 1000 data sets, which was in the MNAR OR=3, 75%-scenario, where observations with a positive outcome status were most likely to be missing and hence only very reduced information about model parameters were contained in the data. The MAR1 OR=1/3 scenario was the only scenario where all data sets displayed a spatial correlation, even with 75% missing data. The MAR1 OR=3 scenario, on the contrary, had more data sets displaying no spatial correlation than any other scenario. This could partly be explained by the changed prevalence in the data (increased prevalence in the MAR1 OR=1/3 scenario and vice versa), but the pattern was not as pronounced when the missing data depended on the outcome itself (MNAR scenarios). This suggests that the observations excluded in the MAR1 OR=3 scenario exhibited the strongest spatial correlation. This would again be related to the specific data set.
The RMeSE of the range increased with an increasing number of missing observations in all scenarios. The increase was especially pronounced with more than 50% missing data. Hence, even though the overall median of the range estimates did not change much, more substantial deviations from the estimate obtained from the complete data did occur within single data sets with more than 50% missing data. In the MCAR 75% scenario, for example, the median range was 0.2 km larger than in the complete data, but the median deviation was 5.4 km.
Imputed data
The number of data sets (N) with detectable spatial correlation in Table 2, was not directly comparable to the corresponding number in Table 1. In multiple imputation, the incomplete data set is substituted by a set of complete data sets. If any of the data sets in such a set did not exhibit spatial correlation (i.e. the range estimate was ≥ 75 km), then the whole set was excluded from Table 2. This was done in order to retain the number of imputations in each incomplete data set. For example, 429 sets of imputed data were excluded in the MCAR 75% scenario. This means that at least 429 of 10000 imputed data sets (10 for each incomplete data set) showed no detectable spatial correlation, as compared to 60 of 1000 incomplete data sets. In this scenario actually 970 of 10000 imputed data sets showed no spatial correlation. Overall, a lack of spatial correlation occurred more frequently with imputed data than with missing data (data not shown), and mainly with imputation of more than 50% missing observations.
Summary of parameter estimates obtained after multiple imputation of 50% simulated missing data
Data ^{ 1 } | N | Intercept (SD) | RMeSE ^{ 2 } | Covariate ^{ 3 }(SD) | RMeSE ^{ 2 } | σ ^{ 2 } (SD) | RMeSE ^{ 2 } | Range ^{ 4 }(SD) | RMeSE ^{2} |
---|---|---|---|---|---|---|---|---|---|
MCAR | 732 | -4.1 (0.33) | 0.28 | 0.61 (0.062) | 0.041 | 0.14 (0.16) | 0.35 | 11.9 (16.3) | 3.5 |
MAR0 | |||||||||
OR=1/3 | 963 | -4.2 (0.33) | 0.19 | 0.64 (0.064) | 0.039 | 0.20 (0.16) | 0.30 | 10.1 (7.4) | 3.2 |
OR=3 | 628 | -4.0 (0.33) | 0.29 | 0.58 (0.063) | 0.049 | 0.12 (0.18) | 0.37 | 11.9 (21.3) | 3.7 |
MAR1 | |||||||||
OR=1/3 | 998 | -5.5 (0.44) | 1.20 | 0.87 (0.079) | 0.25 | 0.33 (0.21) | 0.16 | 9.8 (5.3) | 3.0 |
OR=3 | 625 | -3.3 (0.28) | 1.03 | 0.36 (0.057) | 0.26 | 0.10 (0.18) | 0.40 | 9.3 (20.3) | 4.3 |
MNAR | |||||||||
OR=1/3 | 772 | -3.6 (0.29) | 0.69 | 0.59 (0.054) | 0.037 | 0.15 (0.14) | 0.35 | 13.3 (14.1) | 3.1 |
OR=3 | 636 | -4.6 (0.42) | 0.47 | 0.61 (0.078) | 0.063 | 0.14 (0.22) | 0.35 | 10.0 (16.2) | 4.4 |
Conclusion
This simulation study investigated how the estimated range of influence was affected by missing outcomes in binary spatial data. This is a relevant topic since missing data are a common feature in many analyses. The results showed that the effect on the range estimate was to some extent dependent upon the missing data mechanism. When the missing outcomes were MCAR, MAR depending on a covariate not correlated with the outcome, or even MNAR, the range estimates were consistent with ≤25% missing data. When the missing outcomes were MAR depending on a covariate which correlated with the outcome, the range estimate was affected by even a moderate number of missing observations. In this specific study, however, the considered covariate was possibly also related to the range itself. This added to the complexity of the situation and may have also contributed to the effect of the missing outcomes in this scenario. In general, the overall effect of missing observations was small compared to the uncertainty of the range estimate. Multiple imputation of the missing observations provided a potential improvement in the range estimate in the case of ≥ 50% missing data, but with increased uncertainty of the estimate as a consequence.
The range of influence was estimated in a logistic regression model with a spatially structured random effect, using the recently developed INLA approach to Bayesian inference. Overall, this approach worked very efficiently. The range estimate proved to be sensitive to the prior distribution of the hyperparameters when the amount of missing observations was increased. Various computational problems sometimes encountered with many missing observations (≥50%), could be solved by adding prior information to the hyperparameters. Other possibilities for optimising the INLA procedure do exist and could potentially provide a better solution, especially when working with a specific data set as opposed to the automated analyses of a simulation study.
This study was based on the simulation of missing data in a specific complete data set. The “true” range of influence was defined by this complete data set and was not varied within the simulations. To fully explore a possible dependence on for example the extent of the range and the strength of the correlation, would require completely simulated data sets. This should include the spatial locations of the observations, since different spatial patterns may also influence the effect of missing observations.
Methods
Data
The study was based on a complete data set with simulated missing outcome. All information (outcome, covariates, and locations) was taken from the complete data set, and then some of the observations were defined to be missing, according to different simulation scenarios. Data on Salmonella Dublin in Danish cattle herds were used as the complete data set. These data were available, since Denmark has a mandatory surveillance program on Salmonella Dublin. The Salmonella Dublin infection as such was not of any interest in this study.
For the analysis, all cattle herds located in the southern part of Northern Jutland (region NJS) (Figure 3a) in the last quarter of 2008 were included (N=1597). Four herds had no information on herd size (assumed missing completely at random). Since the focus in this study is on missing outcomes, these herds were excluded; resulting in a total of N=1593 (470 dairy, 1123 non-dairy) herds. Among these, 278 herds (17.4%) had a positive Salmonella Dublin status (Figure 3b). The considered covariates were herd size (Figure 3c) and herd density (Figure 3d). Herd size was log-transformed since the distribution was skewed. Herd size correlated with Salmonella Dublin status (corr=0.35, p <0.0001), whereas herd density and Salmonella Dublin status did not significantly correlate (corr.=0.044, p=0.082). Herd size and herd density were uncorrelated (corr=0.035, p=0.17).
Simulation of missing outcome
Let y _{ i }, i=1,…,1593, denote the Salmonella Dublin status (positive/negative) of herd i. Based on the completely observed data vector y=(y _{ i })_{ i=1,…,1593}, missing observations were generated by simulating a vector M=(M _{ i })_{ i=1,…,1593} with M _{ i }∈{0,1}, i=1,…,1593. If M _{ i }=1, the corresponding observation y _{ i } was set to missing. Scenarios with 5%, 10%, 15%, 25%, 50%, and 75% missing observations were considered. Within each scenario, 1000 replications of M were produced. Through the simulation of M, observations within y were defined to be missing in three different ways: 1) missing completely at random (MCAR), 2) depending on an observed covariate (MAR), and 3) depending on the observation itself (MNAR).
with parameters μ, ν chosen as above.
Statistical model
where Δ _{ rs } denotes the distance between location z _{ r } and z _{ s }, and K _{ λ } is the modified Bessel function of the second kind and order λ. The smoothness parameter λ is typically poorly identified and was fixed at 1, κ is a scaling parameter, and σ ^{2} is the marginal variance. This covariance function was verified as providing a suitable model for the data by fitting it to the sample semivariogram of the residuals obtained from fitting the logistic regression (4) without the GF. Based on the covariance function (5), the range of influence is defined as , as in [1]. This corresponds to the distance at which the spatial correlation is close to 0.1 for all λ.
where the matrix A=(A _{ iv }(z _{ i }))_{ i=1,…,N,v=1,…,V } is the projection from the triangulation vertices to the observation locations (which are not necessarily included as vertices).
Inference
To evaluate the estimates obtained in the simulation study, the Root Median Squared Error (RMeSE), alternative to the traditional Root Mean Squared Error, was calculated as the square root of the median of (estimate with complete data - estimate with missing data) ^{2} within each simulation scenario.
The simulated data were analysed using parallel computing. Analyses were run on an external supercomputing facility (i.e. a cluster of computers), due to the size of the simulation study. Parallel computing could, however, be performed on any standard personal computer with multiple CPU cores. Parallel computing is very useful for simultaneous analysis of multiple data sets, for example in simulation studies or with multiple imputed data. It cannot be used when analysing a single data set.
Parallel computing was performed using the R package parallel. With the chosen triangulation and the required output (e.g. predicted values) it took around 12-15 hours to analyse 1000 data sets (16 cores, 2.66Ghz CPU). R code is supplied as Additional file 1.
Multiple imputation
Imputation of the simulated missing outcome was based on model (7), which was fitted to the available data. The available covariates (herd density and (log-) herd size) were included in the model. A predicted probability was sampled from the posterior distribution, and a binary outcome was then generated based on this probability. This was done at each location where the outcome was not observed, whereby a complete data set was created. This process was repeated to produce a number of imputed data sets corresponding to each incomplete data set. The number of imputed data sets created depended on the amount of missing observations. This was done in an attempt to ensure the same efficiency of the estimates across the simulation scenarios. Classical recommendations [4] suggest that only a small number of imputed data sets are needed, hence 3 data sets were created when 5% of data were missing, 5 data sets were created when 10%, 15%, 25% of data were missing, and 10 data sets were created when 50%, 75% of data were missing. Each imputed data set was analysed individually, and estimates were then combined using Rubin’s rules [3] to obtain the overall estimates corresponding to each incomplete data set. In general, each individual estimate should be approximately Gaussian distributed and otherwise transformed prior to combination [9]. The estimates of the variance parameter σ ^{2} and the range of influence had skewed distributions, and were therefore log-transformed. The combined estimate on the original scale was subsequently obtained using standard theory for the lognormal distribution [10]. Hence, if is the individual estimate obtained as the mean value of the log-transformed posterior distribution, and is the combined estimate obtained from , m=1,…,M, then the combined estimate on the original scale is given by
where . The variance of the combined estimate on the original scale is given by
For comparison, the imputation model was changed from model (7) to a standard logistic regression model without inclusion of a spatial component.
Declarations
Acknowledgments
KB was funded by a PhD grant at the Faculty of Health and Medical Sciences, University of Copenhagen. The authors would like to thank Søren Saxmose Nielsen for revising the manuscript, and Nils Toft for commenting on analyses and revising the manuscript.
Authors’ Affiliations
References
- Lindgren F, Rue H, Lindström J: An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differentation approach . J R Stat Soc B 2011, 73 Part 4:423–498.View ArticleGoogle Scholar
- Rue H, Martino S, Chopin N: Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations (with discussion) . J R Stat Soc B 2009,71(2):319–392. 10.1111/j.1467-9868.2008.00700.xView ArticleGoogle Scholar
- Little RJA, Rubin DB: Statistical analysis with missing data. Hoboken, New Jersey: Wiley; 2002.View ArticleGoogle Scholar
- Rubin DB: Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.View ArticleGoogle Scholar
- von Hippel PT: Regression with missing ys: An improved strategy for analyzing multiply imputed data . Sociol Methodol 2007, 37:83–117. 10.1111/j.1467-9531.2007.00180.xView ArticleGoogle Scholar
- Ersbøll AK, Nielsen LR: The range of influence between cattle herds is of importance for the local spread of Salmonella Dublin in Denmark . Prev Vet Med 2008, 84:277–290. 10.1016/j.prevetmed.2007.12.005View ArticlePubMedGoogle Scholar
- Tierney L, Kadane J: Accurate approximations for posterior moments and marginal densities . J Am Stat Assoc 1986,393(81):82–86.View ArticleGoogle Scholar
- R Core Team: R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing; 2013. http://www.R-project.org/
- White IR, Royston P, Wood AM: Multiple imputation using chained equations: Issues and guidance for practice . Stat Med 2011, 30:377–399. 10.1002/sim.4067View ArticlePubMedGoogle Scholar
- Aitchison A, Brown JAC: The Lognormal Distribution. London: Cambridge University Press; 1957.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.