International Journal of Health Geographics Open Access a Multilevel Model for Cardiovascular Disease Prevalence in the Us and Its Application to Micro Area Prevalence Estimates

Background: Estimates of disease prevalence for small areas are increasingly required for the allocation of health funds according to local need. Both individual level and geographic risk factors are likely to be relevant to explaining prevalence variations, and in turn relevant to the procedure for small area prevalence estimation. Prevalence estimates are of particular importance for major chronic illnesses such as cardiovascular disease.


Background
Estimates of prevalence of disease and health behaviours for different areas are increasingly required for the equitable allocation of health funds according to local need and to target interventions. As stressed by Bazos et al [1] community health need assessments are ideally based on locally disaggregated (i.e. small area) health status and disease prevalence information. To estimate prevalence in different small areas, a commonly adopted approach involves synthetic estimation whereby prevalence rates for demographic subgroups of the population are obtained (e.g. from national health surveys) and an indicative rate then obtained based on the demographic composition of each area. Thus prevalence of most health conditions varies considerably with age, and often also by sex and race: so a synthetic estimate may be obtained by using age, sex and race specific prevalence rates.
However, synthetic estimates of this kind do not take account of geographic context, exemplified by interactions between demographic risk factors and geographic location, or by independent effects of geographic variables (e.g. area poverty or urbanity-rurality) on prevalence that remain even after taking account of patient level risk factors. By contrast, the multilevel prevalence model for cardiovascular outcomes proposed here as a basis for small area prevalence estimates incorporates the modifying effects of geographic context as well as patient risk factors.
In the US, a number of population health surveys are carried out and provide cumulative evidence on CVD trends and epidemiology. Thus the National Health Interview Survey (NHIS) for 2005 estimated the prevalence of cardiovascular disease (CVD) at 68 million among adults aged 18 years and over in the US, which includes coronary heart disease, hypertension, stroke, angina pectoris or heart attack. The analysis here is concerned with a positive response to one or more of three questions included in the 2005 Behavioral Risk Factor Surveillance System (BRFSS) survey; these questions encompass the different forms of CVD, namely, had the subject ever been told by a health professional that they had experienced a heart attack, or told they had undergone a stroke, or told they had CHD or angina.
The epidemiology of these conditions differ to some degree, for example in terms of male-female differentials in prevalence and incidence [2], in trends through time [3], and in ethnic group differentials. However, for these and related conditions there is evidence for a role of geographic context, in terms of wide geographic disparities by region, state and urbanity [4][5][6][7][8]. In particular, there is evidence of direct effects of area variables after controlling for person level risk factors, and evidence of interactions between place and person variables. For example, Cubbin et al [9] report higher levels of hypertension and diabetes among African American women living in socioeconomically deprived neighborhoods as against African American women from more affluent neighborhoods, after allowing for individual-level socioeconomic status, while Halverson et al [10] report local clustering of excess CVD mortality after controlling for area population composition. As for place-person interactions, Barnett et al [6] and Casper et al [11] report that ethnic disparities in CHD mortality vary by area of residence.
The prevalence model and small area prevalence estimates described here are based on around 336,000 survey responses, and on a regression analysis relating CVD status both to individual level risk factors and to county level measures of poverty and urban-rural status. The analysis further adjusts for differentiation at US state level in the impact of ethnicity on prevalence. Thus adjustment for geographic context is much more comprehensive than is possible using disease status data from the Health Survey for England where only broad regional identifiers are available -an example being the work of Congdon [12] on CHD prevalence. One goal of the analysis here is to develop prevalence estimates for micro areas, namely 32000 ZIP Code Tabulation Areas (ZCTAs) for which certain population tabulations are provided by the US Census Bureau [13]. Inclusion in the prevalence model of patient risk categories such as gender and ethnicity (and interactions between them) therefore requires that such categories are available in these tabulations for micro area populations.

Methods
The regression model for prevalence includes person level attributes (age, gender, ethnicity, education level) that are known to have significant CVD risk gradients. A pronounced gradient in CVD prevalence by age is reported by Neyer et al [14]; thus the MI rate among 18-44 year olds is 0.8%, among 45-64 year olds is 4.8% and among the over 65s is 12.9%. In terms of the main ethnic groups in the US (white non-hispanic, black, hispanic, other) elevated CVD mortality and morbidity for nonwhite groups are reported by Barnett et al [6] and Caspar et al [11], though ethnic differentials may to some degree express socioeconomic disadvantage. Certain subgroups such as black females, have more clearly elevated CVD prevalence [15]. As to education level, Neyer et al [14] report that prevalence of one or more of an MI history or a CHD/ angina history decreases with educational attainment: of persons with less than a high school diploma, 9.8% report a history of one or more of the conditions, nearly twice the proportion (5%) among college graduates. Education is interrelated with issues such as linguistic competence and health literacy that affect health status [16], and with health insurance [17].

Methods: Translating Survey Model to Small Area Prevalence Estimates
However, to permit small area (ZCTA) prevalence estimation, inclusion of risk variables (and interactions between them) in the regression model is subject to the constraint that such variables are available both in the BRFSS and in tabulations for ZCTA populations. So an interaction between risk factors requires a matching cross-tabulation in the ZCTA population. Impacts of age group, gender and ethnic group are straightforward to include since they are available as BRFSS variables and in a ZCTA level cross-tabulation of adult populations by gender, ethnicity, and quinquennial age. For particular gender-ethnic-age subgroups, parameters from the survey model (e.g. relative risk for white males aged 65-69) can then be applied to the ZCTA sub-population.
For other person level variables (e.g. education, marital status), either primary ZCTA tabulations are available from the 2000 census, or a restricted cross tabulation (e.g. adult population by education, ethnicity and gender in US census tabulation P148), but not tabulations involving cross-hatching against all other risk factors. A small area prevalence adjustment can be applied only for the main effect of such variables, or for a partial interaction. Thus the BRFSS regression models include gender-education effects, and so gradients in CVD relative risk can be applied to ZCTA male and female adult populations subdivided by education level. Gender-education-ethnic interactions are not adopted as the relevant ZCTA cross tabulation often includes very small numbers.

Methods: The Prevalence Model
The regression involves 129 thousand male and 207 thousand female respondents, and is confined to adults aged 18 and over. Separate regressions are carried out for males and females, in view of evidence of gender effect modification over a range of risk variables [18]. The regression also takes account of varying survey weights w for different respondents to account for differential response between demographic categories and for different sampling rates in different US states. The detailed derivation of weights is discussed in CDC [19] and is based on the inverse of the sampling fraction in each area stratum and age-by-raceby-gender category.
Let y = 1 if a subject reports a particular CVD symptom, with y = 0 otherwise, and denote p as the probability that a respondent reports a symptom. Then a weighted likelihood [20] over subjects i and gender r (r = 1 for males, 2 for females) is used, giving greater weight to undersampled demographic categories or areas, namely To facilitate a relative risk interpretation for parameters a log link is used in the binary regression [21] -see Appendix 1 for model details. In Winbugs this requires (a) a model regression statement linking log( ) to risk factor covariates and any random effects and (b) a statement selecting the minimum of 1 and as the actual probability p ir that y ir = 1. The occurrence of values > 1 was confined within the first hundred or so MCMC iterations (depending on how close the starting parameter values are to the posterior means), and thereafter convergence was straightforward.
Three types of regression model are applied in order to assess geographic effects. The first baseline model (model 1) includes only person level risk variables. It allows first for differential risks of each CVD symptom for black, hispanic and other ethnic groups as against whites as the reference category. Second, it allows differential risk according to education attainment with categories 1 = never attended, elementary only, or some high school; 2 = high school graduate; 3 = some college or technical school; 4 = college graduate (with level 1 as reference category for statistical estimation). Finally, since age gradients are known to vary by ethnic group, differential risks are assumed specific to combinations of age group (12 levels) and the four ethnic groups; the age bands are 18-24,25-29,30-34,..,70-74, and 75+.
The second type of model (model 2) includes geographic effects but without any interaction between area and person attributes (except for gender). Although prevalence is to be estimated for ZCTAs, the ZCTA of residence for BRFSS respondents is not available for confidentiality reasons. However, county and state of residence are provided, and one may model their impact on CVD prevalence. Since there are over 3000 US counties, some counties are sparsely represented in the survey, and so random effects at this level are not adopted. However, county level variables are used as predictors, these being the 2005 percent of population in poverty and a category variable, namely the 9 category rural-urban continuum coding [22] -see Table 1.
Many geographic influences may be unobserved (e.g. various environmental and health behavioral influences) and these are represented in the second and third models by state level random effects. These are modelled using a random effects approach (see Appendix 2) that allows both for spatial correlation between effects for contiguous states and for the presence of spatially isolated states. It is sensible to allow unobserved state influences to be spatially correlated to reflect smoothly varying risk factors in space [23]. However, application of conditional autoregressive spatial schemes [24], with spatial interaction typically based on contiguity of areas, is complicated by the presence of two spatially isolated states (Alaska, Hawaii). A different approach based on Congdon [25] is applied instead, which allows for varying strength in spatial clustering over the mainland states and also encompasses spa- The third model (model 3) allows for area-person interactions, in that state random effects are taken to be ethnicity specific. Differentiation of area effects by ethnicity reflects epidemiological evidence such as that from Casper et al [11] that CVD mortality and prevalence disparities between ethnic groups vary by place of residence. Let C i and S i respectively denote the county and state in which subject i is resident. Let r i denote a subject's gender, g i denote their ethnic group, x i denote their age group, and e i denote their education level. Then the prevalence probability is specified under the full model as where a r are gender specific intercepts measuring the overall prevalence level, the b rg parameters measure varying prevalence by ethnicity, the h re terms measure varying prevalence by education, the g rxg measure ethnic specific age gradients, k r is the coefficient for county poverty, the d ru terms reflect the effect of different categories U in the rural-urban continuum, and the w rsg terms are state random effects specific for ethnic group. County poverty rates (for all ages in 2005) are expressed as proportions and range from 0.025 to 0.51, and are centred around the average poverty rate.

Methods: ZCTA Prevalence Rates
To translate the prevalence model parameters into ZCTA level estimates requires categorisations of the ZCTA populations that match the survey derived individual and geographic risk factors used in the prevalence model. The goal is to obtain ZCTA age-sex-ethnic prevalence rates (and case totals) that reflect not only demographic gradients, but also reflect the impact that the location and socioeconomic character of the ZCTA have on prevalence. Among important socioeconomic influences on disease (including CVD) that are available for ZCTAs in 2000 Census tab-ulations are education, income, poverty status, and household tenure.
Here education is used as a socioeconomic measure of small area populations because of established CVD prevalence gradients by education level [14], and because it is available both as a BRFSS survey question and in ZCTA census tabulations. Education has been used as a measure of socioeconomic status in other area health studies [26].
Essentially the age-sex-ethnic rates obtained from the survey prevalence model (for the reference education group) are adjusted according to a sex-specific education effect that is also estimated in the model.
Let C j and S j respectively denote the county and state in which ZCTA j is located. Let r denote gender, g denote ethnic group and x denote age group. Then given a particular county C j and state of residence S j , prevalence rates for ZCTA j specific to age-sex-ethnic group, but unadjusted for that ZCTA's education mix, are obtained from the full model as This is the model for the reference education group (namely, the group with less than high school education).
As described in Appendix 1, the b and g parameters represent ethnic and age-ethnic effects for gender r; the parameters k and d represent county poverty and urban-rural effects, and the w parameters are state level random effects.
To take account of the impact on CVD prevalence of edu- be the survey model estimate of CVD relative risk at education level e after controlling for age, ethnicity and geo- Counties in metro areas of 250,000 to 1 million Metropolitan 3 Counties in metro areas of fewer than 250,000 Metropolitan 4 Urban population of 20,000 or more, adjacent to a metropolitan area Non-metro 5 Urban population of 20,000 or more, not adjacent to a metropolitan area Non-metro 6 Urban population of 2,500 to 19,999, adjacent to a metropolitan area Non-metro 7 Urban population of 2,500 to 19,999, not adjacent to a metropolitan area Non-metro 8 Completely rural or less than 2,500 urban population, adjacent to a metropolitan area Non-metro 9 Completely rural or less than 2,500 urban population, not adjacent to a metropolitan area Non-metro * see http://www.ers.usda.gov/Briefing/Rurality/RuralUrbCon/ graphic effects (county and state effects). The composite relative risk associated with the educational mix in ZCTA j can be represented as a weighted total of the relative risks for each education level, namely Finally, age-sex-ethnic prevalence rates p a [j, r, x, g] in ZCTA j adjusted for its education mix are obtained as

Results
Estimation of the three models follows the Bayesian method, whereby pre-existing knowledge regarding parameters is expressed in prior densities, and updated or posterior knowledge is obtained by combining the prior densities with the likelihood (1) of the observed data. Estimation uses iterative Monte Carlo Markov Chain sampling methods [27], as provided in the WINBUGS program [28]. Goodness of fit is assessed by the Deviance Information Criterion or DIC [29], whereby the average deviance is adjusted to account for model complexity. The DIC is the average deviance plus the complexity, with lower DICs representing better fit. Summaries of parameters (means and 95% intervals) are based on the second halves of two chain runs of 5000 iterations, with dispersed initial values. Convergence was achieved in all models using Brooks-Gelman-Rubin criteria [30]. Table 2 summarises the fit of the models, while Tables 3  and 4 show gender-specific es-timates of the parameters {a r , b rg , h re , k r , d ru } from the three models. The DIC criteria in Table 2 show a gain in introducing geographic contextual variables (model 2 vs model 1), and a clear gain also in making state random effects specific to ethnic groups (model 3 vs model 2).

Results: Person Level Attributes
In terms of person-level attributes, it can be seen from Tables 3 and 4 that there is a steeper educational gradient for females than males. In model 3, the relative risk for female college graduates is exp(h 24 ) = 0.40 is under a half that of the first education category, those with limited education (elementary education only or did not graduate from high school). Black females also show excess CVD risk (an excess that remains after controlling for socioeconomic and geographic effects), whereas black males do not. However, both males and females in the other ethnic group have elevated risk. The ethnic specific age gradients for males (g 1xg ) and for females (g 2xg ) under model 3 are shown in Figures 1 and 2. The age gradients are presented in the form namely probabilities of CVD caseness by gender, age and ethnicity at reference levels of education and county urbanity and average county poverty. There are cross-over effects between black and white males with higher rates for black males up to early old age, and but lower rates thereafter. This reflects a wider finding that blacks "experience heart disease and die of heart-related problems at earlier ages than whites" [31]. For black females prevalence rates exceed those among white females except among the very old.
Probabilities of CVD by gender, age, ethnicity and education at reference levels of county urbanity and average county poverty are obtained as The overall age adjusted prevalence p rge for ethnic groups g at education level e may be obtained by using age weights w x for a standard population (e.g. the European Standard Population), namely Table 5 contains posterior summaries (expressed as percents CVD caseness) of the p rge over the four ethnic groups and four education levels. The widest contrast is among women, exemplified by the rates for white, college-educated women (mean prevalence of 3.0%), as opposed to women of other ethnicity with limited education (mean prevalence of 11.8%). The stronger effect of education on female risk means that the male to female risk ratio is higher for college graduates than those with lesser education. Tables 3 and 4 show that the county poverty effect is more pronounced for female than male CVD caseness. Whereas all county poverty effects are significant, many of the coefficients for the county urban-rural category are not significant. Significance of urban-rural category differs whether  The absence of clear patterns may be because the association between urban status and health is linked to the uneven distribution of poverty in the US, which tends to be disproportionately concentrated in metropolitan centres as well as in some rural areas [33]. So rural-urban prevalence gradients may be attenuated once poverty levels are controlled for.

Results: Geographic Variables
State level random effects are included in both models 2 and 3 (see Appendix 2). A summary expression of unobserved state level influences applicable across all ethnic groups is obtainable from the additive person and area effects model 2 -see Table 6. These are residual relative risks in the form   Tables  3 and 4). For males, there is greater variability in black and hispanic unexplained relative risk than for non-hispanic whites, while for females variability is greatest for hispanic and other ethnicities. To summarise the relative risk patterns, and in particular the location of states with two or more r rsg = exp(w rsg ) significantly above 1, the nine Census Bureau Regional Divisions (listed in Table 6) are used to categorise the states (Table 7). There are consistent patterns, with multiple elevated residual effects tending to occur in the South (South Atlantic, East South Central) and East North Central divisions; this pattern shows similarities with that found by studies such as [8], though here the pattern is one that persists after controlling for important person and county risk factors.

Conclusion
Geographic variations in the prevalence of chronic disease partly reflect the demographic composition of area populations. However, prevalence variations may also show distinct geographic 'contextual' effects that are differentiated between ethnic and other demographic categories. Studies of cardiovascular disease in the US have found major geographic variations that do not seem to be explicable by area demography alone.
The present study has demonstrated by formal modelling methods applied to BRFSS data that improved explanation is obtained by allowing for distinct geographic effects (for counties and states) and for interaction between geographic and person variables. There are significant spatial effects (e.g. county poverty effects, state residual effects) after adjusting for CVD gradients over person level variables, namely age, education, ethnicity.
This has direct implications for an appropriate methodology to estimate prevalence at small area level, with the focus here being ZIP Code Tabulation Areas. Thus -on the basis of the model estimates in the above analysisprevalence estimates for a ZCTA need to reflect its region of location (e.g. in a South East state as opposed to a northern or mountain state) and the poverty level of the county containing it.
In methodological terms, this paper is distinct in using a log link multilevel binary regression model that takes account of both person level risk factors and the spatial context for a major chronic disease. The use of a log link allows straightforward inferences on relative risks and potentially allows the incorporation into the model of cumulative prior evidence (e.g. on relative CVD risks over ethnic groups). Statewide contextual effects have been represented by a structured random effect, that allows for spatial correlation in unobserved risk factors but also extends to include spatially isolated areas (see Appendix 2). In an extended model (model 3) state random effects Ethnic specific age gradients, males Figure 1 Ethnic specific age gradients, males. Variations and extensions to the models presented above are possible. One option is state or county averages in the person level variables such as ethnicity and education level (e.g. county percent black or county percent college graduates). This has been proposed as a way of measuring contextual effects [34], though there is likely to be a positive correlation with the already included county poverty rate. Another possibility would be a longitudinal analysis over a sequence of successive surveys, which can indicate Ethnic specific age gradients, females Figure 2 Ethnic specific age gradients, females.  whether gradients over person level risk factors are changing, or whether geographic variability is changing.

Appendix 1 Formal statement of model
Let C i , S i and U i denote the county, state and (county level) rural-urban category of residence for respondent i. Also let {x i , g i , e i } denote the age, ethnicity and education level of respondent i. Then prevalence models are specific for gender r, and one may write prevalence model 3 (with ethnicspecific state effects) as y ir ~Bin(1,p ir ) (A1.1) where Bin(n, p) denotes the binomial density, the param- Thus excess risk or unduly low risk may reflect geographic variations in prevalence that remain even after the impact of a range of important person and county attributes has been allowed for. Excess risk can be defined in terms of the 95% estimation interval for r rsg being confined to values above 1.

Appendix 2 State random effects
The 51 states in the model are the mainland US states (k = 1,.., 49) arranged alphabetically (Alabama to Wyoming, including the District of Columbia), together with Alaska and Hawaii (k = 50, 51). The presence of these two spatially isolated states complicates applications of standard approaches for spatially correlated effects, at least those based on a spatial contiguity matrix. It would still be possible to use a spatial model based on interstate distances, but this means that a spatial decay function in distance has to be specified and its parameters estimated. Here we follow the most common approach to spatial clustering, based on contiguity of areas, with a spatial effect that "should describe the fact that areas close to each other tend to behave similarly" [26].
One option that brings in all 51 states would be to follow the convolution approach of Besag et al [35] and assume there are two effects, one of which follows a conditional autoregressive scheme and applies only to the mainland states (k = 1,.., 49), while the other effect, applying to all 51 states is unstructured in the sense of not incorporating spatial structure.
Thus for states k = 1, 49 the total state effect would be where h k represents spatially unstructured heterogeneity, and w k represents a conditional autoregressive scheme based on contiguity. The suffix r for gender is omitted for simplicity. Specified conditionally on effects w [-k] in the remaining 48 states, one has for mainland states k = 1,..,49 where t w is a variance parameter, L k is the number of states adjacent to state k, and W k is the average of w m over states m = 1,.., L k adjacent to state k. For example, W 1 (for Alabama) would be an average of the four w effects for the contiguous states, Mississippi, Georgia, Florida and Tennessee. The prior for the h k would be over all 51 states, rather than the mainland 49 states, and typically specified as While this approach is an option when a collection of areas includes spatial isolates, it is not used here. The problems that occur with the model (A2.1) include identifiability, since only the total h k + w k is identified by the data, and the heavy (i.e. non-parsimonious) parameterisation. Leroux et al [36] propose an alternative more parsimonious model that uses a single random effect, with a conditional form where m ~ k denotes states m adjacent to state k. This reduces to a purely spatial model, as in (A2.2), when l = 1 and to pure heterogeneity (i.e. no spatial clustering) when l = 0. The l parameter can be estimated and provides a measure of spatial dependence actually present in the data.
Congdon [25] extends model (A2.4) to allow the spatial dependence parameters to vary by area, and a version of such an approach is used in the CVD prevalence modelling here. This extension allows spatial dependence to vary over sub-regions of the total region or nation being considered, and also allows for spatial outliers, distinct from their neighbours in terms of outcome level such as disease risk. Outliers would have relatively low l k values, since spatial pooling (towards the neighbourhood average) is contra-indicated by the disparity between an area's risk and that of its neighbours. By contrast, areas surrounded by areas with similar levels of the outcome would have relatively high l k values, since spatial pooling (towards the neighbourhood average) is supported by the data.
The conditional specification now takes the form This model for spatial effects adapts to spatial outliers by taking l k = 0, so that for the subset of areas which are not connected to other areas one has where F is a symmetric matrix of dimension G. Allowing for varying spatial dependence over the entire region/ nation being considered, one has In the application of (A2.5) in model 2, it is assumed that 1/t w is gamma distributed a priori, namely 1/t w~G a(1, 0.001). This is approximately equivalent to assuming 1/t w to be uniformly distributed while constrained to positive values. Such a choice of gamma prior for 1/t w follows the strategy of studies such as [35] and [37]. In the application of (A2.9) in model 3, it is assumed that F is Wishart distributed, with G degrees of freedom and an identity scale matrix.
The varying spatial parameters in models 2 and 3 are assumed to be beta distributed l k ~ Be(n 1 , n 2 ) where n 1 and n 2 are positive quantities equal to or exceeding 0.5. Thus n 1 = n 2 = 1 corresponds to a diffuse uniform prior l k ~ U(0, 1), while more informative priors are obtained for n 1 > 1 and n 2 > 1. A baseline is provided when n 1 = n 2 = 0.5, equivalent to a prior sample size of 1. It is assumed that n 1 ~ U(0.5, 5) and n 2 ~ U(0.5, 5). The average value of the l k over all contiguous states can be obtained as l a = n 1 /(n 1 + n 2 ).