 Methodology
 Open Access
 Published:
Individual level covariate adjusted conditional autoregressive (indiCAR) model for disease mapping
International Journal of Health Geographics volume 15, Article number: 25 (2016)
Abstract
Background
Mapping disease rates over a region provides a visual illustration of underlying geographical variation of the disease and can be useful to generate new hypotheses on the disease aetiology. However, methods to fit the popular and widely used conditional autoregressive (CAR) models for disease mapping are not feasible in many applications due to memory constraints, particularly when the sample size is large. We propose a new algorithm to fit a CAR model that can accommodate both individual and group level covariates while adjusting for spatial correlation in the disease rates, termed indiCAR. Our method scales well and works in very large datasets where other methods fail.
Results
We evaluate the performance of the indiCAR method through simulation studies. Our simulation results indicate that the indiCAR provides reliable estimates of all the regression and random effect parameters. We also apply indiCAR to the analysis of data on neutropenia admissions in New South Wales (NSW), Australia. Our analyses reveal that lower rates of neutropenia admissions are significantly associated with individual level predictors including higher age, male gender, residence in an outer regional area and a group level predictor of social disadvantage, the socioeconomic index for areas. A large value for the spatial dependence parameter is estimated after adjusting for individual and area level covariates. This suggests the presence of important variation in the management of cancer patients across NSW.
Conclusions
Incorporating individual covariate data in disease mapping studies improves the estimation of fixed and random effect parameters by utilizing information from multiple sources. Health registries routinely collect individual and area level information and thus could benefit by using indiCAR for mapping disease rates. Moreover, the natural applicability of indiCAR in a distributed computing framework enhances its application in the Big Data domain with a large number of individual/group level covariates. CI NSW Study Reference Number: 2012/07/410. Dated: July 2012.
Background
The risks of many diseases and health outcomes may vary across geographical locations because of locally varying distributions of socioeconomic, behavioural and environmental risk factors [1]. These spatially correlated risk factors can have important implications for the observed disease rates in small areas. Mapping disease rates over a region offers a visual illustration of geographical variation. These maps are particularly useful for generating new hypotheses through identifying apparently high risk areas or disease clusters [2]. However, producing such maps is complicated by the fact that raw incidence rates are often unstable due to small incidence counts, spatial correlation among rates and also due to the variation in individual patient characteristics [3–5].
Poisson mixed models with conditional autoregressive random effects are commonly used for assessing the relationship between a rare disease outcome and risk factors in the presence of geographical variation [6]. These models can adjust for region specific spatial random effects for correlated disease rates and both individual and regionspecific covariates. However, the fitting of such models is subject to high computational burden, particularly when the sample size is large and when the number of individual and group level covariates are large. To alleviate such problems, investigators often adjust for the age and sex distribution of the underlying population through calculation of an offset in the model [7]. Therefore, the effect of age and sex on disease risk can not be estimated from these models. Moreover, such an approach ignores a large number of potential individual level covariates that may be related to the underlying disease process and readily available in health registries.
Health registries routinely collect geocoded information relating to the patient’s residence at diagnosis, their sociodemographic status and their clinical characteristics. In addition, information on locally varying socioeconomic, behavioral and environmental risk factors for each area under study can also be obtained from other data sources. For example, in Australia, New South Wales (NSW) cancer registries collect cancer treatment and outcome information for each patient diagnosed with cancer, along with their sociodemographic characteristics. Additionally, a socioeconomic index for areas (SEIFA) and an area specific index for remoteness (ARIA) of each patient’s residence can be obtained from the Census Bureau. Combining these individual and area level characteristics in mapping studies can help researchers and policy makers to understand the relative contribution of both individual and group level covariates to the observed cancer rates. In addition, combining such data can also reduce ecological bias, which occurs when the group level exposure–disease relationship does not reflect the individual level relationship. A reduction in this bias leads to improved inference about both our group and individual level covariates [8, 9]. In this paper we propose a novel approach that enables the study of individual level risk factors in mapping studies.
The aim of our current research is to make use of routinely collected administrative cancer treatment and outcome data to explore the possible geographical variation in the rate of neutropenia admissions corresponding to all cancer types across NSW. Neutropenia is a blood disorder with an abnormally low number of neutrophil granulocytes (a type of white blood cell in the blood), often associated with fever. It is a life threatening complication of cancer chemotherapy and a major cause of morbidity and associated healthcare resource costs. Furthermore, neutropenia results in compromised efficacy due to delays and dose reductions in chemotherapy [10].
NSW is the most populated state in Australia with a population of approximately 7.6 million people. Geographical variations in neutropenia admissions are of particular interest because of the uneven geographical concentration of the population within the state. As a result of this uneven population density, the level of access to health care services is not uniform across the whole region [11]. Moreover, neutropenia incidence might also depend on patient age and cancer type, as treatment modalities often vary across different types of cancer and age groups. Therefore, appropriate analysis of geographical variation of neutropenia admissions requires adjustment for both the patient’s demographic characteristics and covariates reflecting the patient’s geographic location of residence. In our current application, we explore whether there is any spatial variation in the rates of neutropenia admissions after adjusting for patients’ individual and clinical characteristics.
In our proposed method, hereafter known as indiCAR, we incorporate individual level covariate information in a two step iterative procedure following an initialization step. At the initialization step, individual level outcome data were fitted against individual level covariates with a Poisson generalized linear model (GLM), ignoring random effects and group level covariates. Then, at the first step, the individual level outcome data were aggregated at the area level and fitted via a Poisson generalized linear mixed model (GLMM) against area level covariates including a conditional autoregressive spatial random effect, and an offset calculated based on individual covariate contributions. At the second step, the individual level outcome data is fitted via a Poisson GLM with individual level covariates and a second offset calculated based on the contribution of area specific covariates and random effects obtained from the previous step. Steps 1 and 2 are repeated until convergence.
We evaluate the performance of our indiCAR method through simulation studies and also compare indiCAR to the traditional method of agesex standardisation [7]. Our simulation results show that the proposed indiCAR approach is able to correctly estimate coefficients associated with both individual and grouplevel covariates. Simulation studies also reveal that our approach is faster than existing approaches such as hlmer with CAR for fitting spatial random effects when the number of individuals within a group is low, and works for large sample sizes where these other methods fail. We illustrate our proposed indiCAR method using data on neutropenia admissions from the NSW Cancer Institute and conclude with some practical guidelines.
Methods
Data
NSW cancer registries were used to identify patients diagnosed with cancer, associated treatment procedures and comorbidities. Specifically, we used data from the NSW Central Cancer Registry (CCR) linked to NSW Admitted Patient Data Collection (APDC). Detailed descriptions of the data items can be obtained from the Centre for Health Record Linkage (CHeReL http://www.cherel.org.au/masterlinkagekey). Data were checked for consistency across data sources and linked by assigning a unique project person number (PPN) to each patient. Our study population comprises all cancer patients that were diagnosed with cancer and were hospitalized during the period between 2001 and 2009.
Demographic variables including age at diagnosis, gender, residence at diagnosis, postal area of residence, and the ARIA were obtained from the CCR database. The ARIA variable was recorded at individual level rather than postal area level because the ARIA index varies within postal areas. The SEIFA (an index of social disadvantage) and the geocoded shape files for mapping corresponding to 2006 census postal areas were obtained from the Australian Bureau of Statistics (ABS). Individual level clinical characteristics such as type of cancer were also obtained from the CCR. The diagnosis of neutropenia admissions and comorbidity were obtained using data from the APDC. The ICD10AM (International Statistical Classification of Disease and Related Health problem, 10th revision, Australian modification) code D70 (agranulocytosis) was used to identify admissions with possible neutropenia.
The model
Suppose the total area under study is divided into M contiguous regions and the number of neutropenia admissions for the ith (\(i=1,2,\ldots ,n_j\)) individual in the jth \((j=1,2,\ldots ,M)\) region is denoted by \(\{y_{ij}\}\). Let \({{\varvec{Y}}}\) be a vector with elements \(\{y_{ij}\}\) that represents the number of neutropenia admissions for all individuals in the study regions of interest. Similarly, let \({\varvec{X}}=(X_1,X_2,\ldots ,X_p)\) and \({\varvec{U}}=(U_1,U_2,\ldots ,U_q)\) represent individual and area level covariate matrices with dimensions \(n\times p\) and \(M\times q\), respectively, where n is the total sample size i.e., \(n=\sum \nolimits _{j=1}^M n_j\). We define a replication matrix, \({\varvec{Z}}\) of dimension \({n\times M}\) to map group level covariates and random effects to the individual level as
Under the above specifications, conditional on the area specific random effect vector, \({\varvec{b}}\), the number of neutropenia admissions for each cancer patient is assumed to be a Poisson random variable with mean \({\varvec{\mu}}\), given by
where \({\varvec{\beta}}\) and \({\varvec{\gamma }}\) are the vectors of regression coefficients associated with the individual level and group level covariates, respectively. Of course, it is possible to express model (1) by replicating group level covariate data to the individual level and including them within the design matrix, \({\varvec{X}}.\) However, such a formulation often results in high computational burden and a large amount of storage memory allocation. Instead, formulation (1) helps to fit individual and group level data separately in a distributed computing framework as will be shown at the end of the current section.
Many different choices for modelling the random effect, \({\varvec{b}}\) are available in the mapping literature (see [6], for a recent review). Among these, the method of Leroux et al. [7] is appealing because it allows varying weights between spatially structured and unstructured variation [7]. Within this framework, the random effect vector, \({\varvec{b}}\) has a multivariate normal distribution with mean \({\varvec{0}}\) and a covariance matrix, \({\varvec{D}}\) delivered through its MoorePenrose generalized inverse, \({\varvec{D}}^{}=\sigma ^{2}\{(1{\varvec{\lambda }})\varvec{I}+\lambda {\varvec{R}}\}\), where \(\varvec{I}\) is the identity matrix, \({\varvec{R}}\) is the intrinsic auto regression matrix reflecting the neighbourhood structure. Typically, neighbours are those areas which share a common boundary, but distance based neighbourhood structures can also be used [12]. Underlying the Leroux et al. [7] approach is the specification of the generalized inverse of the covariance matrix \({\varvec{D}}\). This formulation therefore avoids inverting the covariance matrix \({\varvec{D}}\). Alternatively, one can restrict \({\varvec{\lambda }}\) to the range (0, 1), thus ensuring that \({\varvec{D}}\) is invertible. The typical element of \({\varvec{R}}\) is given by
where \(m_j\) is the number of neighbours of region j, and \(I\{j\sim j^{\prime }\}\) is an indicator function that takes value 1 if regions j and \(j^{\prime }\) are neighbours and 0 otherwise. The parameters characterising the random effect distribution, \({\varvec{\theta }}= (\sigma ^2>0,{\varvec{\lambda }}\in [0,1])\) quantify overdispersion and spatial dependence respectively. A larger value of \(\lambda \in [0,1]\) indicates a higher degree of spatial correlation among proximal areal units. This specification results in two extreme cases: (1) completely independent random effects when \({\varvec{\lambda }}=0\) and (2) the intrinsic autoregressive model when \({\varvec{\lambda }}=1\) [4]. In cases where \(0< {\varvec{\lambda }}< 1\), a weighted combination of these extreme cases is assumed.
Since the random effects, \({\varvec{b}}\) are unobserved, inference about \({\varvec{\beta}}\), \({\varvec{\gamma }}\) and \({\varvec{\theta }}\) can be made by integrating out the distribution of the random effects, \({\varvec{b}}\). The corresponding integrated quasilikelihood function is equal to (see equation (2) of Breslow and Clayton [13])
where \(d(Y,{\varvec{\mu }})\) refers to the deviance residual associated with observation Y.
The maximum likelihood estimates of \({\varvec{\beta}}\), \({\varvec{\gamma }}\) and \({\varvec{\theta }}\) are simply those values which maximize the above quasilikelihood. However, no simple closed form expression exists for the integral. Instead, Breslow and Clayton [13] proposed the penalized quasilikelihood (PQL) approach for parameter estimation and inference. The PQL uses Laplace’s method for integral approximation and jointly maximizes the following quasilikelihood function to obtain estimates for \({\varvec{\beta}}\), \({\varvec{\gamma }}\) and \({\varvec{b}}({\varvec{\theta }})\) (see equation (6) of Breslow and Clayton [13])
Under the above specification the approximate loglikelihood can be expressed as
Differentiating (3) with respect to \({\varvec{\beta}}\), \({\varvec{\gamma }}\) and \({\varvec{b}}\) using vector matrix calculus [14], we obtain the following score equations
and
Iterative reweighted least squares (IRLS) can be applied to solve the above equations for \({\varvec{\beta}}\), \({\varvec{\gamma }}\) and \({\varvec{b}}\). However, high computational costs and memory space constraints often make it difficult to apply these iterative procedures to data sets with a very large number of cases. An alternative computational strategy is the use of the Gauss–Seidel algorithm. In this method, at each iteration, one of the parameters is estimated while keeping other parameters fixed at current values. The advantage of such an approach is that substantial simplifications can be obtained at each step. Using this approach, we first initialize \({\varvec{\beta}}\) and then obtain updated estimates for \({\varvec{\gamma }}\) and \({\varvec{b}}\) in the following two step procedure:
Step 0
Set the coefficients corresponding to area level covariates, \({\varvec{\gamma }}\) and random effects, \({\varvec{b}}\) to \({\varvec{0}}\) in Eq. (4). Then we have
This equation is the estimating equation for a Poisson generalized linear model [14] and thus can be fitted using the existing glm function in the \({\varvec{R}}\) statistical computing environment [15]. This gives initial estimates of the regression coefficient \({\varvec{\beta}}\) associated with individual level covariates.
Step 1
Substitute the current estimated individual level coefficients, \({\widehat{\varvec{\beta}}}\) in Eqs. (5) and (6) and with some simple algebra, we have
and,
where \({\varvec{Y}}_c^{\mathrm{T}}= {\varvec{Y}}^{\mathrm{T}}{\varvec{Z}}\) is a vector of aggregated disease counts of length M at the group level and \(\text {O}_1=\log \{{\varvec{Z}}^{\mathrm{T}}\exp ({\varvec{X}}{\widehat{{\varvec{\beta}}}})\}\) is a vector of offset with length M.
The above two equations are well known PQL estimating equations for the Poisson mixed model [13]. Since, the outcome \({\varvec{Y}}_c\), offset \(\text {O}_1\), covariate \({\varvec{U}}\) and random effects \({\varvec{b}}\) are all measured at the group level, estimates of parameters for the group level coefficient \({\widehat{{\varvec{\gamma }}}}\) and random effects \({\varvec{b}}\) can be estimated using the PQL method [7, 13] with only group level data. The detailed procedure is described in Appendix 1.
Step 2
Now substitute the estimated areaspecific regression coefficient, \({\widehat{{\varvec{\gamma }}}}\) and random effect parameter, \({\widehat{\varvec{b}}}\) estimated at step 1 in (4). Then we have
where \(\text {O}_2={\varvec{Z}}({\varvec{U}}{\widehat{\varvec{\gamma}}}+{\widehat{\varvec{b}}})\) is an offset vector of length n. Under the above specification, the individual level coefficients estimate \({\widehat{{\varvec{\beta}}}}\) can then be updated using ordinary Poisson regression with individual level data.
Steps 1 and 2 are then repeated until the algorithm converges. Estimates obtained by this iterative procedure will be the same, aside from rounding error as the solution obtained by a standard IRLS algorithm.
Estimation of standard error
The approximate standard error estimates for \({\widehat{{\varvec{\gamma }}}}\) and \({\widehat{\varvec{\beta}}}\) in steps 1 and 2 assume fixed \({{\varvec{\beta}}}\) and fixed \({{\varvec{\gamma }}}\), respectively. Therefore, we recalculated the standard error of these regression coefficients by adjusting the variability of the estimated \({\widehat{\varvec{\beta}}}\) and \({\widehat{\varvec{\gamma}}}\). This can be done via the IRLS estimation of score equations (4–6). The IRLS estimation requires us to define a working dependent variable and a weight matrix that are updated at each iteration and solved via Fisher scoring [13].
Let the GLM adjusted dependent variable, \({\varvec{Y}}_{pseudo}\) be
where \({\varvec{W}}\) is a \(n\times n\) diagonal matrix with diagonal elements \({\varvec{\mu }}\). Harville [16] and Robinson [17] showed that the Fisher scoring corresponding to the score equations (4–6) and GLM dependent variable as in (7), is identical to the normal equation of the best linear unbiased predictors (BLUPs) of \({\varvec{\beta}}\), \({\varvec{\gamma }}\) and \({\varvec{\theta }}\) corresponding to the following linear mixed model
where the pseudoerror \(\varvec{\epsilon }_{pseudo}\sim N(0,{\varvec{W}}^{1})\). Following [17], the estimated regression coefficients for the fixed effects, \(({\varvec{\beta}}, {\varvec{\gamma }})\) and BLUP estimate for the random effect \({\varvec{b}}\) can be obtained as
where \({\varvec{C}}=[XZU]\) and \({\varvec{V}}={\varvec{Z}}{\varvec{D}}{\varvec{Z}}^{\mathrm{T}}+{\varvec{W}}^{1}\), the variance of pseudoresponse \({\varvec{Y}}_{pseudo}\). Thus, the variance–covariance matrix for the fixed effect \(({\widehat{\varvec{\beta}}}, {\widehat{\varvec{\gamma}}})\) can be estimated by
Note that Eq. (9) suggests that estimates of the regression coefficients and variance components can be obtained using the Leroux et al. [7] model with appropriate specification of the design matrix (\({\varvec{Z}}\)) associated with spatial random effect (1). Indeed, a backfitting approach such as indiCAR will be effective in situations where memory constraints may prohibit fitting a single model consisting of all individual and group level covariates. A useful feature of our indiCAR method is that we can calculate the above standard error in a distributed computing framework. This is because \({\varvec{V}}^{1}\) can be expressed as \({\varvec{W}} {\varvec{W}}{\varvec{Z}}{\varvec{D}}(I + {\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}})^{1}{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}\) [18]. Therefore, the above variance–covariance matrix can be written as
where \(a_{11}={\varvec{X}}^{\mathrm{T}}{\varvec{W}}{\varvec{X}} {\varvec{X}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}}(\varvec{I}+{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}})^{1}\times {\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{X}}\), \(a_{12}={\varvec{X}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{U}} {\varvec{X}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}}(\varvec{I}+{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}})^{1}\times {\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{U}}\), \(a_{21}= a_{12}^{\mathrm{T}}\), \(a_{22}={\varvec{U}}^{\mathrm{T}}{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{U}} {\varvec{U}}^{\mathrm{T}}{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}}\times (\varvec{I}+{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{D}})^{1}{\varvec{Z}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}{\varvec{U}}\). Among the various components of the above variancecovariance matrix, \({\varvec{X}}^{\mathrm{T}}{\varvec{W}}{\varvec{X}}\) and \({\varvec{X}}^{\mathrm{T}}{\varvec{W}}{\varvec{Z}}\) are the only terms involving individual level data, and the rest of the terms involve a lower dimension corresponding to the group level data. These components are therefore straightforward to calculate. Hence, upon convergence, calculation of the variance–covariance matrix is also carried out in a distributed computing framework for individual and grouplevel data separately.
The covariance matrix for \({\widehat{\varvec{b}}}\) was obtained from the Fisher information matrix from step 2 in the usual way, assuming that parameters for the individual and area specific covariates are fixed. Of course there is additional variability due to the fact that the individual and area specific covariates parameters are estimated. However, following Breslow and Clayton [13] we ignore this additional variability when making inference about the parameters which characterise the random effect distribution, \({\widehat{{\varvec{\theta }}}}\). The detailed procedure is given in Appendix 1.
In the next section we describe a simulation study to evaluate the performance of our method.
Simulation studies
To evaluate our proposed method we design a simulation study involving 400 regions in a \(20 \times 20\) square lattice grid with varying sample sizes. Specifically, we consider cases with (i) 10–1000 and (2) 10–50 subjects in each area. We declare two regions to be neighbours if they share a common border. The random effects are generated following a multivariate normal distribution with mean 0 and covariance matrix \({\varvec{D}}=[\sigma ^{2}\left\{ (1{\varvec{\lambda }})\varvec{I}+{\varvec{\lambda }}{\varvec{R}}\right\} ]^{1}\). The value of \(\sigma\) is set to 0.4 and five different values of spatial dependence parameters, \({\varvec{\lambda }}= \{0, 0.25,0.50, 0.75, 0.99\}\) are considered in order to represent different strengths of spatial correlation. We then generate three individual level covariates (one binary, one categorical and one continuous) and one group level covariate. The binary covariate represents the distribution of sex in the area and is generated following a Bernoulli random variable with probability ranging from 0.45 to 0.55 across groups. The categorical variable with six categories is generated to represent the age distribution of the neutropenia admissions data with prespecified probabilities (similar to the neutropenia admissions data). The continuous individual level variable is generated as Uniform (0.2, 1). The group level covariate is generated from a standard normal distribution. The outcome variable is then generated using model (1). The full list of the parameters used to generate data is given in Table 1. The binary and the categorical individual level variables help us to compare our simulation results for the indiCAR with the agesex adjusted Leroux et al. [7] approach.
Results and discussion
In this section we discuss our results obtained from the simulation study and present an application to the neutropenia admissions data. We compare the results obtained by indiCAR to those from the existing Leroux et al. [7] method. When applying indiCAR to the simulated data, we adjust for all individual and areal covariates. However, in the existing Leroux et al. [7] method we were only able to incorporate the binary and categorical variable by calculating offsets based on direct standardization of these covariates.
Simulation results
Table 1 displays the average estimated regression coefficients along with their estimated standard errors for the indiCAR and Leroux et al. [7] methods based on 1000 simulation runs based on simulation scenario (1). We estimated two different standard errors of estimated regression coefficients: namely, (1) empirical standard errors i.e., taking the standard deviation of the 1000 simulated regression coefficient estimates, (2) average of model based standard errors. The first column of Table 1 specifies the spatial dependence parameter used in that particular simulation. The next eight columns list the estimated regression coefficients for the individual level covariates using the indiCAR method. The 10th, 11th and 12th columns list the estimated group level regression coefficients, the estimated overdispersion parameters and estimated spatial dependence parameters for the spatial random effect using the indiCAR method. The last three columns list the estimated regression coefficients for the group specific covariate and estimated overdispersion and spatial dependence parameters using the Leroux et al. [7] method. The Leroux et al. [7] method adjusts only for the binary and categorical variables.
As expected, the indiCAR method provides reliable estimates of the individual level and region specific regression parameters and the parameters in the spatial random effect. Although the Leroux et al. [7] method provides similar reliable estimates of the true regionspecific regression parameters, the random effect parameters are slightly biased.
To evaluate the performance of the proposed method under small sample settings, we also conducted simulations with only 10–50 subjects per region as outlined in simulation scenario (2).These results are given in Table 2. As indicated in the table, the proposed method performs very well in this setting, providing reliable estimates of all the parameters. In contrast, the Leroux et al. [7] method provides slightly less efficient estimates of the spatial dependence parameters.
Following reviewer suggestions, we also compared the indiCAR method with three other methods; a group specific random intercept model (1) using the lme4 [19] and (2) using the hglm [20] packages in R and (3) a CAR model implemented using the hlmer function in the hglm R package. The three methods were compared in terms of both approximate conditional AIC [21] and computation time. The results are given in Tables 3 and 4. In Table 3, the data were generated with \(\lambda =0\), which means that a random intercept only model is appropriate. In Table 4, the data were generated with \(\lambda =0.75\), which means that a CAR component is necessary for an accurate model fit. Note that the conditional AIC values are approximate as these are calculated ignoring the constant term in the loglikelihood. The hlmer approach is faster when block effects are represented by a random intercept, but is slower for a conditional autoregressive random effect specification. The fitting of hlmer with such a random effect specification is not even feasible for large sample sizes on our standard desktop computer due to large memory requirements. In addition, we note that another R package: sdep has similar feasibility issues when fitting a conditional autoregressive random effect model for large datasets [22]. Our proposed indiCAR method in general provides lower conditional AIC compared to other models considered here and is faster than the hlmer approach when using a CAR random effect specification, as we do in our application.
Application to the neutropenia data
We applied our methodology to the data on neutropenia admissions. One of the key objectives of this analysis is to assess the geographical variation of neutropenia admission rates and its association with area level measures of socioeconomic status. Data also includes patient age, gender, year of diagnosis, ARIA, cancer type at diagnosis, number of major comorbidities and geographic location reported via postcode of residence.
Table 5 shows the descriptive statistics for cancer patients treated between years 2001 and 2009 in New South Wales, Australia. The proportion of neutropenia admissions decreases gradually with increasing age (9.2 % for 20–30 years of age to 1.7 % for 80+ years of age). Overall, the rates are similar (≈5 %) across the years 2001–2008 but are considerably lower (3.0 %) in the year 2009. This is likely due to the fact that the data are date limited for those patients diagnosed with cancer and treated with chemotherapy in 2009. Cancer treatment often has a long duration, and subsequent neutropenia admissions may have happened beyond the study period. The proportion of neutropenia is highest (4.9 %) in the major cities followed by inner regional Australia (3.9 %). Among the various types of cancer, the highest proportion of neutropenia admissions are observed for haematological malignant cancer patients (25.0 %) followed by lung (6.2 %) and breast cancer (5.3 %). The proportion of neutropenia admissions are very similar across various SEIFA index categories.
Table 6 reports the multivariable analysis of neutropenia admissions data using both the indiCAR approach and the Leroux et al. [7] method based on agesex adjustments. We calculate agesex adjusted standardized incidence ratios (SIR) by dividing the observed number of neutropenia admissions by the agesex adjusted expected number of neutropenia admissions [23]. Our results reveal significantly lower rates of neutropenia for patients with higher age, male gender, residence in an outer regional or remote area and higher socioeconomic status. The estimated overdispersion \((\sigma )\) and spatial dependence parameters \((\lambda )\) with indiCAR are 0.204 and 0.992, respectively compared to 0.210 and 0.989 for the Leroux et al. [7] method. This means that both models identified a very strong spatial correlation in the neutropenia risk.
Although advanced age has been identified as a significant predictor for neutropenia admissions in previous studies [24], we observed a lower risk of neutropenia admissions associated with increasing age. This might be due to the fact that the current guidelines for prophylactic administration of colony stimulating factor (CSF) already account for age [25]. CSF is an effective treatment strategy to reduce neutropenia.
The relationship between average neutropenia rates and ARIA and SEIFA are in the opposite direction, which is counter intuitive as remote areas in NSW are mostly associated with disadvantaged SEIFA categories. However, the observed contrast in estimated regression coefficients might be due to the differences in the health care practices. Patients in the remote areas are likely to be geographically distant to the treating medical oncologist and hence managed by their primary care physicians. Consequently, these patients may be treated with lower doses of chemotherapy [26]. On the contrary, patients in the major cities might get intensive and aggressive chemotherapy, and are better managed due to availability of resources. Previous studies also indicate that remoteness has a great effect on the quality of cancer treatment [27] and that it affects treatment choices made by both patients and clinicians [28].
Figure 1 shows the SIR of neutropenia admissions in NSW. Six postal areas in NSW had an estimated SIR >3 as shown in the map. Figure 2 shows a that neutropenia rates across NSW exhibit a very high spatial dependence. The white region in the map of NSW is the Australian Capital Territory (ACT), which is a distinct territory not included in our dataset. Two other Australian states, Queensland (QLD) and Victoria (VIC) are located to the NorthEast and SouthWest of NSW, respectively. The strong spatial correlation after adjusting for individual and group specific covariates indicates that geographical variation of neutropenia might be due to differences in health care practices or access to care across NSW. Further investigation at the hospital level would be needed in order to provide a comprehensive explanation of these findings. In some cases, a lower spatial random effect might be the result of low numbers of cancer patients being recruited in our study due to a border effect (i.e., getting admitted for neutropenia in other states: ACT, Victoria or Queensland) or due to areas being dominated by private cancer facilities.
Variation across clinical practices of neutropenia have been identified in Australia in a previous survey [29]. The authors showed that the treatment approach for management of neutropenia varies across oncologists, hematologists and clinicians as well as different sectors of cancer care. Therefore, it might be interesting to explore whether the observed variation is due to variation across different hospitals (for example, metropolitan vs. nonmetropolitan hospitals) in NSW or across various healthcare providers. However, relevant data for such analysis are not collected in the registry and further exploration is beyond the scope of our present paper.
Our study was based on data linked from a statebased cancer registry and administrative data from the APDC. An advantage of such linked data is that it provides us with a large, population based sample. Registry based analysis is more comprehensive than that based on single centre studies, and provides more complete information than may be obtained from clinical trials where patient selection and loss to followup may impact validity and generalizability of study findings. However, it is important to keep in mind that the resulting data quality may be inferior to that obtained from prospective studies.
We should note that in some cases, separate admissions for the same individual may be correlated, and thus the Poisson assumptions for the number of admissions may not be appropriate. In such cases, one could fit a subject specific random effect model at the individual level data rather than a generalized linear model [30]. In our application, we do not have such issues, because neutropenia is a very rare event and we do not have any cases with recurrent neutropenia admissions. Therefore, it is suitable to use a Poisson approximation to the Binomial distribution for our dataset.
In our simulations, the estimation of the intercept \(\beta _0\) is biased. This is consistent with the observation of Hodges and Reich [31] that an intercept is poorly identified in the model with the presence of spatial random effect. The authors further argued that adding spatially correlated errors can attenuate the fixed effect estimation. However, they only considered one observation per areal unit rather cases with replicated data such as that in our application. There may be other explanations for attenuations, for example, Huque et al. [32] argued that such attenuation is likely due to covariate measurement error.
Despite various limitations, indiCAR is an useful addition to the existing methodology to explore clinical variation across geographical locations. One of the major advantages of our proposed method is the ability to analyze age as a continuous variable rather than grouping them using an arbitrary cutoff. The results of such an analysis are given in Appendix 2, though they are very similar to those using age groups. However, in many applications age grouping might induce residual confounding and result in spurious relationships between age and the outcome variable [33]. In our simulation study, we evaluate our proposed method for a continuous area level covariate; however, interpretation of the SEIFA index is difficult as a continuous variable. Therefore, to ease our interpretation we considered SEIFA as a categorical variable. We also conducted an analysis of neutropenia admissions data using continuous SEIFA index. The results are quite similar and indicate a significant negative relationship between high SEIFA score and neutropenia admissions (result not shown in table).
Conclusions
In this paper we propose a novel method for incorporating individual level covariate information in disease mapping studies. As indicated in our simulation studies, our proposed method yields reliable estimates of individual and area level covariate effects. Our proposed method also has potential for Big Data implementations due the natural applicability of indiCAR in a distributed computing framework. This could speed up the process and reduce large computational costs. Furthermore, indiCAR also provides a framework for fitting correlated Big Data using recently developed statistical methodology for uncorrelated Big Data [34, 35]. Cancer registries routinely collect individual level cancer information and thus could benefit by using our proposed method to incorporate individual level information in the analysis and mapping of disease rates.
Abbreviations
 CAR:

Conditional AutoRegressive
 indiCAR:

individual level covariate adjusted conditional autoregressive model
 SEIFA:

SocioEconomic Index For Areas
 ARIA:

Accessibility/Remoteness Index of Australia
 NSW:

New South Wales
References
 1.
Elliott P, Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environ Health Perspect. 2004;112(9):998–1006.
 2.
Snow J. On the mode of communication of cholera. London: John Churchill; 1855.
 3.
Clayton D, Kaldor J. Empirical Bayes estimates of agestandardized relative risks for use in disease mapping. Biometrics. 1987;43(3):671–81.
 4.
Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics (with discussion). Ann Inst Stat Math. 1991;43(1):1–20.
 5.
Cressie N. Statistics for spatial data. Wiley series in probability and mathematical statistics: applied probability and statistics. New York: Wiley; 1993 (revised edition).
 6.
Lee D. A comparison of conditional autoregressive models used in Bayesian disease mapping. Spat SpatioTemporal Epidemiol. 2011;2(2):79–89.
 7.
Leroux BG, Lei X, Breslow N. Estimation of disease rates in small areas: a new mixed model for spatial dependence. In: Halloran ME, Berry D, editors. Statistical models in epidemiology, the environment, and clinical trials. New York: Springer; 1999. p. 179–91.
 8.
Jackson C, Best N, Richardson S. Improving ecological inference using individuallevel data. Stat Med. 2006;25(12):2136–59.
 9.
Haneuse S, Bartell S. Designs for the combination of group and individuallevel data. Epidemiology. 2011;22(3):382–9.
 10.
Cameron D. Management of chemotherapyassociated febrile neutropenia. Br J Cancer. 2009;101(Suppl 1):S18–22.
 11.
Australian Bureau of Statistics. Regional population growth, Australia, 2013–14 (cat. no. 3218.0). Canberra: Australian Bureau of Statistics; 2015.
 12.
Earnest A, Morgan G, Mengersen K, Ryan L, Summerhayes R, Beard J. Evaluating the effect of neighbourhood weight matrices on smoothing properties of conditional autoregressive (CAR) models. Int J Health Geogr. 2007;6(1):54–65.
 13.
Breslow N, Clayton D. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993;88(421):9–25.
 14.
Wand M. Vector differential calculus in statistics. Am Stat. 2002;56(1):55–62.
 15.
R Core Team. R: a Language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. 2013. http://www.Rproject.org/.
 16.
Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc. 1977;72(358):320–38.
 17.
Robinson GK. That BLUP is a good thing: the estimation of random effects. Stat Sci. 1991;6(1):15–32.
 18.
Henderson HV, Searle SR. On deriving the inverse of a sum of matrices. SIAM Rev. 1981;23(1):53–60.
 19.
Bates D, Maechler M, Bolker B, Walker S. Fitting linear mixedeffects models using lme4. J Stat Softw. 2015;67(1):1–48. doi:10.18637/jss.v067.i01.
 20.
Rönnegård L, Shen X, Alam M. hglm: a package for fitting hierarchical generalized linear models. R J. 2010;2(2):20–8.
 21.
Vaida F, Blanchard S. Conditional Akaike information for mixedeffects models. Biometrika. 2005;92(2):351–70.
 22.
Bivand R, Bernat A, Carvalho M, Chun Y, Dormann C, Dray S, Halbersma R, LewinKoh N, Ma J, Millo G, et al. The spdep package. Comprehensive R archive network, version, 0583. 2005. https://rforge.rproject.org/projects/spdep/. Accessed 23 May 2016.
 23.
Breslow N, Day N. Statistical methods in cancer research. In: Davis W, editor. The design and analysis of cohort studies, vol II. New York: Oxford University Press; 1987.
 24.
Klastersky J, Paesmans M, Rubenstein EB, Boyer M, Elting L, Feld R, Gallagher J, Herrstedt J, Rapoport B, Rolston K, et al. The multinational association for supportive care in cancer risk index: a multinational scoring system for identifying lowrisk febrile neutropenic cancer patients. J Clin Oncol. 2000;18(16):3038–51.
 25.
Aapro M, Bohlius J, Cameron D, Dal Lago L, Donnelly JP, Kearney N, Lyman G, Pettengell R, TjanHeijnen V, Walewski J, et al. 2010 update of EORTC guidelines for the use of granulocytecolony stimulating factor to reduce the incidence of chemotherapyinduced febrile neutropenia in adult patients with lymphoproliferative disorders and solid tumours. Eur J Cancer. 2011;47(1):8–32.
 26.
Fox P, Boyce A. Cancer health inequality persists in regional and remote Australia. Med J Aust. 2014;201(8):445–6.
 27.
Jong KE, Smith DP, Xue QY, O’Connell DL, Goldstein D, Armstrong BK. Remoteness of residence and survival from cancer in New South Wales. Med J Aust. 2004;180(12):618–22.
 28.
Nattinger AB, Kneusel RT, Hoffmann RG, Gilligan MA. Relationship of distance from a radiotherapy facility and initial breast cancer treatment. J Natl Cancer Inst. 2001;93(17):1344–6.
 29.
Lingaratnam S, Slavin M, Mileshkin L, Solomon B, Burbury K, Seymour J, Sharma R, Koczwara B, Kirsa S, Davis I, et al. An Australian survey of clinical practices in management of neutropenic fever in adult cancer patients 2009. Intern Med J. 2011;41(1b):110–20.
 30.
Gibbons RD, Hedeker D, DuToit S. Advances in analysis of longitudinal data. Annu Rev Clin Psychol. 2010;6:79.
 31.
Hodges JS, Reich BJ. Adding spatiallycorrelated errors can mess up the fixed effect you love. Am Stat. 2010;64(4):325–34.
 32.
Huque MH, Bondell HD, Ryan LM. On the impact of covariate measurement error on spatial regression modelling. Environmetrics. 2014;25(8):560–70. doi:10.1002/env.2305.
 33.
Rothman K, Greenland S, Lash T. Modern epidemiology. 3rd ed. Philadelphia: Lippincott, Williams & Wilkins; 2008.
 34.
Lumley T. biglm: bounded memory linear and generalized linear models. R package version 0.8. 2011. https://cran.rproject.org/web/packages/biglm/. Accessed 20 Mar 2016.
 35.
Enea M. speedglm: fitting linear and generalized linear models to large data sets. R package version 0.1. 2012. https://cran.rproject.org/web/packages/speedglm/. Accessed 20 Mar 2016.
Authors' contributions
MHH, CA, RW, LR contributed to the study design, MHH executed the analysis and drafted the manuscript. CA, RW and LR contributed to the interpretation of the results. All the coauthors read and approved the final version. All authors read and approved the final manuscript.
Acknowlegements
The authors thank the Cancer Institute NSW and the Ministry of Health for making the data available. We gratefully acknowledge the helpful suggestions made by the referees, which have improved the motivation for and content of this paper.
Competing interests
The authors declare that they have no competing interests.
Ethics approval
Ethical approval for this project was received from NSW Population and Health Services Research Ethics Committee (HREC/12/CIPHS/58).
Funding
MHH, CA and LR were supported by the University of Technology Sydney and by the ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). RW was supported by the Cancer Institute NSW.
Author information
Affiliations
Corresponding author
Appendices
Appendix 1: Implementation of PQL in step 1
The PQL estimation procedure is a iterative approach where at each step one must define a working dependent variable and a weight matrix which are then updated at each iteration and solved via Fisher scoring [7, 13]. The detailed procedure has been illustrated elsewhere [7, 13].
The GLM adjusted dependent variable (\({\varvec{Y}}_{cpseudo}\)) at group level is calculated as
where \(\eta _c=g(\mu _c)=\text {O}_1+{\varvec{U}}{\varvec{\gamma }}+{\varvec{b}}\) and \(\text {O}_1=\log \{{\varvec{Z}}^{\mathrm{T}}\exp ({\varvec{X}}{\widehat{\varvec{\beta}}})\}\) is an offset vector with dimension \(M\). The Poisson link \((g(\mu _c)=\log \mu _c)\) and variance function \(V(\mu _c)=\mu _c\) are used. The covariance matrix of \({\varvec{Y}}_{cpseudo}\) is then approximated by
where \({\hat{D}}\) is the covariance matrix of the random effects, \({\varvec{b}}\), evaluated at the current estimate for the variance parameters and \({\hat{W}}_c\) is the \(M\times M\) diagonal matrix with diagonal elements \({\hat{\mu }}_c\). Updated estimates of the fixed effect vector \({\varvec{\gamma }}\) and random effect vector \({\varvec{b}}\) are then can be obtained from the solution of the following mixed model equations:
and
The updated estimates of the variance parameters, \(\lambda\) and \(\sigma\) are obtained by a Newton–Raphson iterative procedure as follows:
where \({\varvec{S}}\) is the score vector and \({\varvec{I}}\) is the expected information matrix based on REML likelihood for \({\varvec{Y}}_{cpseudo}\). The expression for the elements of the score vector and information matrix, letting \({\varvec{\theta }}=(\theta _1, \theta _2)=(\sigma ,\lambda )\) are given by
and
where \(P= V_c^{1}V_c^{1}{\varvec{U}}({\varvec{U}}^{\mathrm{T}}V_c^{1}{\varvec{U}})^{1}{\varvec{U}}^{\mathrm{T}}V_c^{1}\). The derivatives of \(V_c\) with respect to \(\sigma\) and \(\lambda\) are given below:
where \({\varvec{R}}_{\lambda }=(1\lambda )\varvec{I}+\lambda {\varvec{R}}\) and \({\varvec{R}}\) is the intrinsic autoregression matrix.
Repeated iteration of Eqs. (11)–(15) allow us to obtain reliable estimates of the region specific fixed effect and random effect parameters. Convergence is achieved when the change in parameter estimates are less than a prespecified tolerance level (<1e−3, in the simulation study reported). Approximate standard errors for \(\lambda\) and \(\sigma\) are obtained from the above information matrix in the usual way.
Appendix 2
See Table 7.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Huque, M.H., Anderson, C., Walton, R. et al. Individual level covariate adjusted conditional autoregressive (indiCAR) model for disease mapping. Int J Health Geogr 15, 25 (2016). https://doi.org/10.1186/s1294201600557
Received:
Accepted:
Published:
Keywords
 Covariate adjustment
 Disease mapping
 Geographical variation
 Neutropenia
 Spatial model