- Methodology
- Open Access
- Published:

# A multilevel model for cardiovascular disease prevalence in the US and its application to micro area prevalence estimates

*International Journal of Health Geographics*
**volume 8**, Article number: 6 (2009)

## Abstract

### Background

Estimates of disease prevalence for small areas are increasingly required for the allocation of health funds according to local need. Both individual level and geographic risk factors are likely to be relevant to explaining prevalence variations, and in turn relevant to the procedure for small area prevalence estimation. Prevalence estimates are of particular importance for major chronic illnesses such as cardiovascular disease.

### Methods

A multilevel prevalence model for cardiovascular outcomes is proposed that incorporates both survey information on patient risk factors and the effects of geographic location. The model is applied to derive micro area prevalence estimates, specifically estimates of cardiovascular disease for Zip Code Tabulation Areas in the USA. The model incorporates prevalence differentials by age, sex, ethnicity and educational attainment from the 2005 Behavioral Risk Factor Surveillance System survey. Influences of geographic context are modelled at both county and state level, with the county effects relating to poverty and urbanity. State level influences are modelled using a random effects approach that allows both for spatial correlation and spatial isolates.

### Results

To assess the importance of geographic variables, three types of model are compared: a model with person level variables only; a model with geographic effects that do not interact with person attributes; and a full model, allowing for state level random effects that differ by ethnicity. There is clear evidence that geographic effects improve statistical fit.

### Conclusion

Geographic variations in disease prevalence partly reflect the demographic composition of area populations. However, prevalence variations may also show distinct geographic 'contextual' effects. The present study demonstrates by formal modelling methods that improved explanation is obtained by allowing for distinct geographic effects (for counties and states) and for interaction between geographic and person variables. Thus an appropriate methodology to estimate prevalence at small area level should include geographic effects as well as person level demographic variables.

## Background

Estimates of prevalence of disease and health behaviours for different areas are increasingly required for the equitable allocation of health funds according to local need and to target interventions. As stressed by Bazos et al [1] community health need assessments are ideally based on locally disaggregated (i.e. small area) health status and disease prevalence information. To estimate prevalence in different small areas, a commonly adopted approach involves synthetic estimation whereby prevalence rates for demographic subgroups of the population are obtained (e.g. from national health surveys) and an indicative rate then obtained based on the demographic composition of each area. Thus prevalence of most health conditions varies considerably with age, and often also by sex and race: so a synthetic estimate may be obtained by using age, sex and race specific prevalence rates.

However, synthetic estimates of this kind do not take account of geographic context, exemplified by interactions between demographic risk factors and geographic location, or by independent effects of geographic variables (e.g. area poverty or urbanity-rurality) on prevalence that remain even after taking account of patient level risk factors. By contrast, the multilevel prevalence model for cardiovascular outcomes proposed here as a basis for small area prevalence estimates incorporates the modifying effects of geographic context as well as patient risk factors.

In the US, a number of population health surveys are carried out and provide cumulative evidence on CVD trends and epidemiology. Thus the National Health Interview Survey (NHIS) for 2005 estimated the prevalence of cardiovascular disease (CVD) at 68 million among adults aged 18 years and over in the US, which includes coronary heart disease, hypertension, stroke, angina pectoris or heart attack. The analysis here is concerned with a positive response to one or more of three questions included in the 2005 Behavioral Risk Factor Surveillance System (BRFSS) survey; these questions encompass the different forms of CVD, namely, had the subject ever been told by a health professional that they had experienced a heart attack, or told they had undergone a stroke, or told they had CHD or angina.

The epidemiology of these conditions differ to some degree, for example in terms of male-female differentials in prevalence and incidence [2], in trends through time [3], and in ethnic group differentials. However, for these and related conditions there is evidence for a role of geographic context, in terms of wide geographic disparities by region, state and urbanity [4–8]. In particular, there is evidence of direct effects of area variables after controlling for person level risk factors, and evidence of interactions between place and person variables. For example, Cubbin et al [9] report higher levels of hypertension and diabetes among African American women living in socioeconomically deprived neighborhoods as against African American women from more affluent neighborhoods, after allowing for individual-level socioeconomic status, while Halverson et al [10] report local clustering of excess CVD mortality after controlling for area population composition. As for place-person interactions, Barnett et al [6] and Casper et al [11] report that ethnic disparities in CHD mortality vary by area of residence.

The prevalence model and small area prevalence estimates described here are based on around 336,000 survey responses, and on a regression analysis relating CVD status both to individual level risk factors and to county level measures of poverty and urban-rural status. The analysis further adjusts for differentiation at US state level in the impact of ethnicity on prevalence. Thus adjustment for geographic context is much more comprehensive than is possible using disease status data from the Health Survey for England where only broad regional identifiers are available – an example being the work of Congdon [12] on CHD prevalence. One goal of the analysis here is to develop prevalence estimates for micro areas, namely 32000 ZIP Code Tabulation Areas (ZCTAs) for which certain population tabulations are provided by the US Census Bureau [13]. Inclusion in the prevalence model of patient risk categories such as gender and ethnicity (and interactions between them) therefore requires that such categories are available in these tabulations for micro area populations.

## Methods

The regression model for prevalence includes person level attributes (age, gender, ethnicity, education level) that are known to have significant CVD risk gradients. A pronounced gradient in CVD prevalence by age is reported by Neyer et al [14]; thus the MI rate among 18–44 year olds is 0.8%, among 45–64 year olds is 4.8% and among the over 65s is 12.9%. In terms of the main ethnic groups in the US (white non-hispanic, black, hispanic, other) elevated CVD mortality and morbidity for nonwhite groups are reported by Barnett et al [6] and Caspar et al [11], though ethnic differentials may to some degree express socioeconomic disadvantage. Certain subgroups such as black females, have more clearly elevated CVD prevalence [15]. As to education level, Neyer et al [14] report that prevalence of one or more of an MI history or a CHD/angina history decreases with educational attainment: of persons with less than a high school diploma, 9.8% report a history of one or more of the conditions, nearly twice the proportion (5%) among college graduates. Education is interrelated with issues such as linguistic competence and health literacy that affect health status [16], and with health insurance [17].

### Methods: Translating Survey Model to Small Area Prevalence Estimates

However, to permit small area (ZCTA) prevalence estimation, inclusion of risk variables (and interactions between them) in the regression model is subject to the constraint that such variables are available both in the BRFSS and in tabulations for ZCTA populations. So an interaction between risk factors requires a matching cross-tabulation in the ZCTA population. Impacts of age group, gender and ethnic group are straightforward to include since they are available as BRFSS variables and in a ZCTA level cross-tabulation of adult populations by gender, ethnicity, and quinquennial age. For particular gender-ethnic-age subgroups, parameters from the survey model (e.g. relative risk for white males aged 65–69) can then be applied to the ZCTA sub-population.

For other person level variables (e.g. education, marital status), either primary ZCTA tabulations are available from the 2000 census, or a restricted cross tabulation (e.g. adult population by education, ethnicity and gender in US census tabulation P148), but not tabulations involving cross-hatching against all other risk factors. A small area prevalence adjustment can be applied only for the main effect of such variables, or for a partial interaction. Thus the BRFSS regression models include gender-education effects, and so gradients in CVD relative risk can be applied to ZCTA male and female adult populations subdivided by education level. Gender-education-ethnic interactions are not adopted as the relevant ZCTA cross tabulation often includes very small numbers.

### Methods: The Prevalence Model

The regression involves 129 thousand male and 207 thousand female respondents, and is confined to adults aged 18 and over. Separate regressions are carried out for males and females, in view of evidence of gender effect modification over a range of risk variables [18]. The regression also takes account of varying survey weights *w* for different respondents to account for differential response between demographic categories and for different sampling rates in different US states. The detailed derivation of weights is discussed in CDC [19] and is based on the inverse of the sampling fraction in each area stratum and age-by-race-by-gender category.

Let *y* = 1 if a subject reports a particular CVD symptom, with *y* = 0 otherwise, and denote *p* as the probability that a respondent reports a symptom. Then a weighted likelihood [20] over subjects *i* and gender *r* (*r* = 1 for males, 2 for females) is used, giving greater weight to undersampled demographic categories or areas, namely

To facilitate a relative risk interpretation for parameters a log link is used in the binary regression [21] – see Appendix 1 for model details. In Winbugs this requires (a) a model regression statement linking *log*(${p}_{ir}^{\ast}$) to risk factor covariates and any random effects and (b) a statement selecting the minimum of 1 and ${p}_{ir}^{\ast}$ as the actual probability *p*_{
ir
}that *y*_{
ir
}= 1. The occurrence of values ${p}_{ir}^{\ast}$ > 1 was confined within the first hundred or so MCMC iterations (depending on how close the starting parameter values are to the posterior means), and thereafter convergence was straightforward.

Three types of regression model are applied in order to assess geographic effects. The first baseline model (model 1) includes only person level risk variables. It allows first for differential risks of each CVD symptom for black, hispanic and other ethnic groups as against whites as the reference category. Second, it allows differential risk according to education attainment with categories 1 = never attended, elementary only, or some high school; 2 = high school graduate; 3 = some college or technical school; 4 = college graduate (with level 1 as reference category for statistical estimation). Finally, since age gradients are known to vary by ethnic group, differential risks are assumed specific to combinations of age group (12 levels) and the four ethnic groups; the age bands are 18–24,25–29,30–34,..,70–74, and 75+.

The second type of model (model 2) includes geographic effects but without any interaction between area and person attributes (except for gender). Although prevalence is to be estimated for ZCTAs, the ZCTA of residence for BRFSS respondents is not available for confidentiality reasons. However, county and state of residence are provided, and one may model their impact on CVD prevalence. Since there are over 3000 US counties, some counties are sparsely represented in the survey, and so random effects at this level are not adopted. However, county level variables are used as predictors, these being the 2005 percent of population in poverty and a category variable, namely the 9 category rural-urban continuum coding [22] – see Table 1.

Many geographic influences may be unobserved (e.g. various environmental and health behavioral influences) and these are represented in the second and third models by state level random effects. These are modelled using a random effects approach (see Appendix 2) that allows both for spatial correlation between effects for contiguous states and for the presence of spatially isolated states. It is sensible to allow unobserved state influences to be spatially correlated to reflect smoothly varying risk factors in space [23]. However, application of conditional autoregressive spatial schemes [24], with spatial interaction typically based on contiguity of areas, is complicated by the presence of two spatially isolated states (Alaska, Hawaii). A different approach based on Congdon [25] is applied instead, which allows for varying strength in spatial clustering over the mainland states and also encompasses spatial isolates. In model 2, effects of county poverty and urbanity are included together with random effects for the 51 states.

The third model (model 3) allows for area-person interactions, in that state random effects are taken to be ethnicity specific. Differentiation of area effects by ethnicity reflects epidemiological evidence such as that from Casper et al [11] that CVD mortality and prevalence disparities between ethnic groups vary by place of residence. Let *C*_{
i
}and *S*_{
i
}respectively denote the county and state in which subject *i* is resident. Let *r*_{
i
}denote a subject's gender, *g*_{
i
}denote their ethnic group, *x*_{
i
}denote their age group, and *e*_{
i
}denote their education level. Then the prevalence probability is specified under the full model as

*p*[*r*_{
i
}, *g*_{
i
}, *e*_{
i
}, *x*_{
i
},*C*_{
i
}, *S*_{
i
}] = *exp*(*α*[*r*_{
i
}] + *β*[*r*_{
i
}, *g*_{
i
}] + *η*[*r*_{
i
}, *e*_{
i
}] + *γ*[*r*_{
i
}, *x*_{
i
}, *g*_{
i
}] + *κ*[*r*_{
i
}]*Pov*[*C*_{
i
}] + *δ*[*r*_{
i
}, *U*[*C*_{
i
}]] + *w*[*r*_{
i
}, *S*_{
i
}, *g*_{
i
}]), (2)

where *α*_{
r
}are gender specific intercepts measuring the overall prevalence level, the *β*_{
rg
}parameters measure varying prevalence by ethnicity, the *η*_{
re
}terms measure varying prevalence by education, the *γ*_{
rxg
}measure ethnic specific age gradients, *κ*_{
r
}is the coefficient for county poverty, the *δ*_{
ru
}terms reflect the effect of different categories *U* in the rural-urban continuum, and the *w*_{
rsg
}terms are state random effects specific for ethnic group. County poverty rates (for all ages in 2005) are expressed as proportions and range from 0.025 to 0.51, and are centred around the average poverty rate.

### Methods: ZCTA Prevalence Rates

To translate the prevalence model parameters into ZCTA level estimates requires categorisations of the ZCTA populations that match the survey derived individual and geographic risk factors used in the prevalence model. The goal is to obtain ZCTA age-sex-ethnic prevalence rates (and case totals) that reflect not only demographic gradients, but also reflect the impact that the location and socioeconomic character of the ZCTA have on prevalence. Among important socioeconomic influences on disease (including CVD) that are available for ZCTAs in 2000 Census tabulations are education, income, poverty status, and household tenure.

Here education is used as a socioeconomic measure of small area populations because of established CVD prevalence gradients by education level [14], and because it is available both as a BRFSS survey question and in ZCTA census tabulations. Education has been used as a measure of socioeconomic status in other area health studies [26]. Essentially the age-sex-ethnic rates obtained from the survey prevalence model (for the reference education group) are adjusted according to a sex-specific education effect that is also estimated in the model.

Let *C*_{
j
}and *S*_{
j
}respectively denote the county and state in which ZCTA *j* is located. Let *r* denote gender, *g* denote ethnic group and *x* denote age group. Then given a particular county *C*_{
j
}and state of residence *S*_{
j
}, prevalence rates for ZCTA *j* specific to age-sex-ethnic group, but unadjusted for that ZCTA's education mix, are obtained from the full model as

*p*[*j*, *r*, *x*, *g*] = *exp*(*α*[*r*] + *β*[*r*, *g*] + *γ*[*r*, *x*, *g*] + *κ*[*r*]*Pov*[*C*_{
j
}] + *δ*[*r*, *U*[*C*_{
j
}]] + *w*[*r*, *S*_{
j
}, *g*]). (3)

This is the model for the reference education group (namely, the group with less than high school education). As described in Appendix 1, the *β* and *γ* parameters represent ethnic and age-ethnic effects for gender *r*; the parameters *κ* and *δ* represent county poverty and urban-rural effects, and the *w* parameters are state level random effects.

To take account of the impact on CVD prevalence of education attainment mix, let *π*[*j, r, e*] be the 2000 census data relative proportions at education level *e* in each gender's adult population in ZCTA *j*. Also let

*λ*[*r, e*] = *exp*(*η* [*r, e*]) (4)

be the survey model estimate of CVD relative risk at education level *e* after controlling for age, ethnicity and geographic effects (county and state effects). The composite relative risk associated with the educational mix in ZCTA *j* can be represented as a weighted total of the relative risks for each education level, namely

Finally, age-sex-ethnic prevalence rates *p*_{
a
}[*j, r, x, g*] in ZCTA *j* adjusted for its education mix are obtained as

*p*_{
a
}[*j, r, x, g*] = *p*[*j, r, x, g*]*L*[*j, r*]. (6)

## Results

Estimation of the three models follows the Bayesian method, whereby pre-existing knowledge regarding parameters is expressed in prior densities, and updated or posterior knowledge is obtained by combining the prior densities with the likelihood (1) of the observed data. Estimation uses iterative Monte Carlo Markov Chain sampling methods [27], as provided in the WINBUGS program [28]. Goodness of fit is assessed by the Deviance Information Criterion or DIC [29], whereby the average deviance is adjusted to account for model complexity. The DIC is the average deviance plus the complexity, with lower DICs representing better fit. Summaries of parameters (means and 95% intervals) are based on the second halves of two chain runs of 5000 iterations, with dispersed initial values. Convergence was achieved in all models using Brooks-Gelman-Rubin criteria [30].

Table 2 summarises the fit of the models, while Tables 3 and 4 show gender-specific es-timates of the parameters {*α*_{
r
}, *β*_{
rg
}, *η*_{
re
}, *κ*_{
r
}, *δ*_{
ru
}} from the three models. The DIC criteria in Table 2 show a gain in introducing geographic contextual variables (model 2 vs model 1), and a clear gain also in making state random effects specific to ethnic groups (model 3 vs model 2).

### Results: Person Level Attributes

In terms of person-level attributes, it can be seen from Tables 3 and 4 that there is a steeper educational gradient for females than males. In model 3, the relative risk for female college graduates is *exp*(*η*_{24}) = 0.40 is under a half that of the first education category, those with limited education (elementary education only or did not graduate from high school). Black females also show excess CVD risk (an excess that remains after controlling for socioeconomic and geographic effects), whereas black males do not. However, both males and females in the other ethnic group have elevated risk. The ethnic specific age gradients for males (*γ*_{1xg}) and for females (*γ*_{2xg}) under model 3 are shown in Figures 1 and 2. The age gradients are presented in the form

*p*_{
rxg
}= *exp*(*α*_{
r
}+ *β*_{
rg
}+ *γ*_{
rxg
}), (7)

namely probabilities of CVD caseness by gender, age and ethnicity at reference levels of education and county urbanity and average county poverty. There are cross-over effects between black and white males with higher rates for black males up to early old age, and but lower rates thereafter. This reflects a wider finding that blacks "experience heart disease and die of heart-related problems at earlier ages than whites" [31]. For black females prevalence rates exceed those among white females except among the very old.

Probabilities of CVD by gender, age, ethnicity and education at reference levels of county urbanity and average county poverty are obtained as

*p*_{
rxge
}= *exp*(*α*_{
r
}+ *β*_{
rg
}+ *η*_{
re
}+ *γ*_{
rxg
}). (8)

The overall age adjusted prevalence *p*_{
rge
}for ethnic groups *g* at education level *e* may be obtained by using age weights *w*_{
x
}for a standard population (e.g. the European Standard Population), namely

Table 5 contains posterior summaries (expressed as percents CVD caseness) of the *p*_{
rge
}over the four ethnic groups and four education levels. The widest contrast is among women, exemplified by the rates for white, college-educated women (mean prevalence of 3.0%), as opposed to women of other ethnicity with limited education (mean prevalence of 11.8%). The stronger effect of education on female risk means that the male to female risk ratio is higher for college graduates than those with lesser education.

### Results: Geographic Variables

Tables 3 and 4 show that the county poverty effect is more pronounced for female than male CVD caseness. Whereas all county poverty effects are significant, many of the coefficients for the county urban-rural category are not significant. Significance of urban-rural category differs whether model 2 or model 3 is considered, and also differs to some extent by gender. Under model 3, male risks are significantly low in the non-metropolitan category "urban population with over 20 thousand or more, adjacent to a metropolitan area", while under model 2, significantly lower risk prevails in both categories of "urban population with over 20 thousand or more". These may be interpreted as categories intermediate between highly metropolitan and highly rural settings, and the lower risks there fit with the view of Ingram & Franco [32] that metropolitan and rural areas tend to have worse health than intermediate area types. However, for females under model 3, counties in smaller metropolitan areas, as well as those with urban populations over 2500 and adjacent to a metropolitan area, have a significantly elevated risk. The absence of clear patterns may be because the association between urban status and health is linked to the uneven distribution of poverty in the US, which tends to be disproportionately concentrated in metropolitan centres as well as in some rural areas [33]. So rural-urban prevalence gradients may be attenuated once poverty levels are controlled for.

State level random effects are included in both models 2 and 3 (see Appendix 2). A summary expression of unobserved state level influences applicable across all ethnic groups is obtainable from the additive person and area effects model 2 – see Table 6. These are residual relative risks in the form

*ρ*_{
rs
}= *exp*(*w*_{
rs
}), (10)

over states *s*, and amount to residual effects after controlling for the age, ethnic and educational composition of populations, and also for county poverty and urbanity. High residual relative risks, namely those significantly exceeding 1 (in the sense that the 95% credible interval is confined to values over 1) tend to occur in the South East and South of the US. For males elevated unexplained risks are present in Indiana, Kentucky, Louisiana and Virginia, and for females in Kentucky, Mississippi, Tennessee, Texas and West Virginia. Significantly low relative risks, those significantly under 1, occur for males in California and Colorado, and for females in Colorado, Minnesota, New York and Hawaii.

When residual state effects are made ethnic-specific in model 3, there are clear contrasts in variability between ethnic groups (see the spatial variance estimates in Tables 3 and 4). For males, there is greater variability in black and hispanic unexplained relative risk than for non-hispanic whites, while for females variability is greatest for hispanic and other ethnicities. To summarise the relative risk patterns, and in particular the location of states with two or more *ρ*_{
rsg
}= *exp*(*w*_{
rsg
}) significantly above 1, the nine Census Bureau Regional Divisions (listed in Table 6) are used to categorise the states (Table 7). There are consistent patterns, with multiple elevated residual effects tending to occur in the South (South Atlantic, East South Central) and East North Central divisions; this pattern shows similarities with that found by studies such as [8], though here the pattern is one that persists after controlling for important person and county risk factors.

### Results: ZCTA Prevalence Estimates

As discussed above, the model provides estimates of *p*_{
a
}[*j, r, x, g*] for approximately 32 thousand ZCTAs in 51 states. These are gender-ethnic-age prevalence rates adjusted for the education mix of each ZCTA. Summary ZCTA prevalence rates for gender-ethnic combinations may then be obtained by applying standard population age weights *w*_{
x
}, namely

Implications for prevalence levels and prevalence inequalities by state or county can then be assessed by considering relevant subsets of the gender-ethnic rates. Being able to assess small area inequality in health is important in health needs assessment [1].

Thus Table 8 presents female prevalence levels for the three main ethnic groups across the 51 states, obtained by averaging *p*_{
a
}[*j*, 2, *g*] within states. Also shown are within state variances and ranges of the ZCTA prevalences. Prevalence levels and within state variability both tend to be higher in southern states such as Alabama, Kentucky, Louisiana, Mississippi, Texas and West Virginia.

## Conclusion

Geographic variations in the prevalence of chronic disease partly reflect the demographic composition of area populations. However, prevalence variations may also show distinct geographic 'contextual' effects that are differentiated between ethnic and other demographic categories. Studies of cardiovascular disease in the US have found major geographic variations that do not seem to be explicable by area demography alone.

The present study has demonstrated by formal modelling methods applied to BRFSS data that improved explanation is obtained by allowing for distinct geographic effects (for counties and states) and for interaction between geographic and person variables. There are significant spatial effects (e.g. county poverty effects, state residual effects) after adjusting for CVD gradients over person level variables, namely age, education, ethnicity.

This has direct implications for an appropriate methodology to estimate prevalence at small area level, with the focus here being ZIP Code Tabulation Areas. Thus – on the basis of the model estimates in the above analysis – prevalence estimates for a ZCTA need to reflect its region of location (e.g. in a South East state as opposed to a northern or mountain state) and the poverty level of the county containing it.

In methodological terms, this paper is distinct in using a log link multilevel binary regression model that takes account of both person level risk factors and the spatial context for a major chronic disease. The use of a log link allows straightforward inferences on relative risks and potentially allows the incorporation into the model of cumulative prior evidence (e.g. on relative CVD risks over ethnic groups). Statewide contextual effects have been represented by a structured random effect, that allows for spatial correlation in unobserved risk factors but also extends to include spatially isolated areas (see Appendix 2). In an extended model (model 3) state random effects are differentiated by ethnic group, reflecting evidence from other sources that ethnic relativities are not constant geographically.

Variations and extensions to the models presented above are possible. One option is state or county averages in the person level variables such as ethnicity and education level (e.g. county percent black or county percent college graduates). This has been proposed as a way of measuring contextual effects [34], though there is likely to be a positive correlation with the already included county poverty rate. Another possibility would be a longitudinal analysis over a sequence of successive surveys, which can indicate whether gradients over person level risk factors are changing, or whether geographic variability is changing.

## Appendix 1 Formal statement of model

Let *C*_{
i
}, *S*_{
i
}and *U*_{
i
}denote the county, state and (county level) rural-urban category of residence for respondent *i*. Also let {*x*_{
i
}, *g*_{
i
}, *e*_{
i
}} denote the age, ethnicity and education level of respondent *i*. Then prevalence models are specific for gender *r*, and one may write prevalence model 3 (with ethnic-specific state effects) as

*y*_{
ir
}~*Bin*(1,*p*_{
ir
})

*log*(*p*_{
ir
}) = *α*_{
r
}+ *β*_{
r
}[*g*_{
i
}] + *η*_{
r
}[*e*_{
i
}] + *γ*_{
r
}[*g*_{
i
}, *x*_{
i
}] + *κ*_{
r
}*Pov*[*C*_{
i
}] + *δ*_{
r
}[*U*_{
i
}] + *w*_{
r
}[*S*_{
i
}, *g*_{
i
}],

where *Bin*(*n, p*) denotes the binomial density, the parameters {*α*, *β*, *δ*, *η*, *κ* } are fixed effects, and the parameters {*γ*,*w*} are random. This model is run separately for males and females.

Since the parameters operate on the log relative risk scale, state level relative risks by ethnic group *ρ*_{
rsg
}(after controlling for known person and county attributes) may be obtained by exponentiating the state effect, namely

*ρ*_{
rsg
}= *exp*(*w*_{
rsg
}).

Thus excess risk or unduly low risk may reflect geographic variations in prevalence that remain even after the impact of a range of important person and county attributes has been allowed for. Excess risk can be defined in terms of the 95% estimation interval for *ρ*_{
rsg
}being confined to values above 1.

The baseline model 1 (with person level risk factors only) is

*log*(*p*_{
ir
}) = *α*_{
r
}+ *β*_{
r
}[*g*_{
i
}] + *η*_{
r
}[*e*_{
i
}] + *γ*_{
r
}[*g*_{
i
},*x*_{
i
}].

The intermeiate model (model 2), including county regression terms, and state random effects, but not including area-ethnicity interactions is

*log*(*p*_{
ir
}) = *α*_{
r
}+ *β*_{
r
}[*g*_{
i
}] + *η*_{
r
}[*e*_{
i
}] + *γ*_{
r
}[*g*_{
i
},*x*_{
i
}] + *κ*_{
r
}*Pov*[*C*_{
i
}] + *δ*_{
r
}[*U*_{
i
}] + *w*_{
r
}[*S*_{
i
}].

Thus the unobserved state effects are assumed to be equal across ethnic groups.

For the unknown fixed effects parameters, namely {*α*_{
r
}, *β*_{
rg
}, *η*_{
re
}} in model 1, and {*α*_{
r
}, *β*_{
rg
}, *η*_{
re
}, *κ*_{
r
}, *δ*_{
ru
}} in models 2 and 3, diffuse normal priors with mean zero and variance 1000 are adopted. Corner constraints are used for the *β*_{
rg
}, *η*_{
re
}and *δ*_{
ru
}parameters for identifiability, namely *β*_{r 1}= *η*_{r 1}= *δ*_{r 1}= 0. To pool strength across the age pro les of different ethnic groups, a first order random walk prior is used for the *G*-dimensional vector *γ*_{
rx
}= (*γ*_{r 1x},.., *γ*_{
rGx
}), *x* = 1,.., *X* of age effects across *G* ethnic groups. This has conditional form

where the *G* × *G* matrix ${\Omega}_{r}^{-1}$ represents covariation between age mortality profiles of ethnic groups. The precision (inverse covariance) matrices Ω_{
r
}are assigned a Wishart prior with identity scale matrix and *G* degrees of freedom, namely Ω_{
r
}~ *Wish*(*I*,*G*).

## Appendix 2 State random effects

The 51 states in the model are the mainland US states (*k* = 1,.., 49) arranged alphabetically (Alabama to Wyoming, including the District of Columbia), together with Alaska and Hawaii (*k* = 50, 51). The presence of these two spatially isolated states complicates applications of standard approaches for spatially correlated effects, at least those based on a spatial contiguity matrix. It would still be possible to use a spatial model based on interstate distances, but this means that a spatial decay function in distance has to be specified and its parameters estimated. Here we follow the most common approach to spatial clustering, based on contiguity of areas, with a spatial effect that "should describe the fact that areas close to each other tend to behave similarly" [26].

One option that brings in all 51 states would be to follow the convolution approach of Besag et al [35] and assume there are two effects, one of which follows a conditional autoregressive scheme and applies only to the mainland states (*k* = 1,.., 49), while the other effect, applying to all 51 states is unstructured in the sense of not incorporating spatial structure.

Thus for states *k* = 1, 49 the total state effect would be

*h*_{
k
}+ *w*_{
k
},

where *h*_{
k
}represents spatially unstructured heterogeneity, and *w*_{
k
}represents a conditional autoregressive scheme based on contiguity. The suffix *r* for gender is omitted for simplicity. Specified conditionally on effects *w*_{[-k]}in the remaining 48 states, one has for mainland states k = 1,..,49

*p*(*w*_{
k
}|*w*_{[-k]}) ~ *N*(*W*_{
k
}, *τ*_{
w
}/*L*_{
k
}), *k* = 1,.., 49

where *τ*_{
w
}is a variance parameter, *L*_{
k
}is the number of states adjacent to state *k*, and *W*_{
k
}is the average of *w*_{
m
}over states *m* = 1,.., *L*_{
k
}adjacent to state *k*. For example, *W*_{1} (for Alabama) would be an average of the four *w* effects for the contiguous states, Mississippi, Georgia, Florida and Tennessee. The prior for the *h*_{
k
}would be over all 51 states, rather than the mainland 49 states, and typically specified as

*h*_{
k
}~ *N*(0,*τ* _{
h
}) *k* = 1,.., 51,

where *τ*_{
h
}is a variance parameter. Under this convolution approach, for states 50 and 51 (Alaska and Hawaii) the state effect would consist of *h*_{
k
}only.

While this approach is an option when a collection of areas includes spatial isolates, it is not used here. The problems that occur with the model (*A* 2.1) include identifiability, since only the total *h*_{
k
}+ *w*_{
k
}is identified by the data, and the heavy (i.e. non-parsimonious) parameterisation. Leroux et al [36] propose an alternative more parsimonious model that uses a single random effect, with a conditional form

where *m* ~ *k* denotes states *m* adjacent to state *k*. This reduces to a purely spatial model, as in (*A* 2.2), when *λ* = 1 and to pure heterogeneity (i.e. no spatial clustering) when *λ* = 0. The *λ* parameter can be estimated and provides a measure of spatial dependence actually present in the data.

Congdon [25] extends model (*A* 2.4) to allow the spatial dependence parameters to vary by area, and a version of such an approach is used in the CVD prevalence modelling here. This extension allows spatial dependence to vary over sub-regions of the total region or nation being considered, and also allows for spatial outliers, distinct from their neighbours in terms of outcome level such as disease risk. Outliers would have relatively low *λ*_{
k
}values, since spatial pooling (towards the neighbourhood average) is contra-indicated by the disparity between an area's risk and that of its neighbours. By contrast, areas surrounded by areas with similar levels of the outcome would have relatively high *λ*_{
k
}values, since spatial pooling (towards the neighbourhood average) is supported by the data.

The conditional specification now takes the form

This model for spatial effects adapts to spatial outliers by taking *λ*_{
k
}= 0, so that for the subset of areas which are not connected to other areas one has

*w*_{
k
}~ *N*(0, *τ*_{
w
}).

This approach extends to a multivariate random effect *w*_{
k
}= (*w*_{k 1},.., *w*_{
kG
}) for *G* ethnic groups. With a uniform value of *λ* over areas the conditional mean under the Leroux et al [36] model is

with inverse dispersion matrix (precision matrix)

*Prec*(*w*_{
k
}|*w*_{[-k]}) = [1 - *λ* + *λL*_{
k
}]Φ,

where Φ is a symmetric matrix of dimension *G*. Allowing for varying spatial dependence over the entire region/nation being considered, one has

In the application of (*A* 2.5) in model 2, it is assumed that 1/*τ*_{
w
}is gamma distributed a priori, namely 1/*τ*_{
w
}*~Ga*(1, 0.001). This is approximately equivalent to assuming 1/*τ*_{
w
}to be uniformly distributed while constrained to positive values. Such a choice of gamma prior for 1/*τ*_{
w
}follows the strategy of studies such as [35] and [37]. In the application of (*A* 2.9) in model 3, it is assumed that Φ is Wishart distributed, with *G* degrees of freedom and an identity scale matrix.

The varying spatial parameters in models 2 and 3 are assumed to be beta distributed

*λ*_{
k
}~ *Be*(*ν*_{1}, *ν*_{2})

where *ν*_{1} and *ν*_{2} are positive quantities equal to or exceeding 0.5. Thus *ν*_{1} = *ν*_{2} = 1 corresponds to a diffuse uniform prior *λ*_{
k
}~ *U*(0, 1), while more informative priors are obtained for *ν*_{1} > 1 and *ν*_{2} > 1. A baseline is provided when *ν*_{1} = *ν*_{2} = 0.5, equivalent to a prior sample size of 1. It is assumed that *ν*_{1} ~ *U*(0.5, 5) and *ν*_{2} ~ *U*(0.5, 5). The average value of the *λ*_{
k
}over all contiguous states can be obtained as

*λ*_{
a
}= *ν*_{1}/(*ν*_{1} + *ν*_{2}).

## References

- 1.
Bazos D, Weeks W, Fisher E, DeBlois H, Hamilton E, Young M: The development of a survey instrument for community health improvement. Health Serv Res. 2001, 36: 773-792.

- 2.
Gale C, Martyn C: The conundrum of time trends in stroke. Journal of the Royal Society of Medicine. 1997, 90: 138-143.

- 3.
Lampe F, Morris R, Whincup P, Walker M, Ebrahim S, Shaper A: Is the prevalence of coronary heart disease falling in British men?. Heart. 2001, 86: 499-505. 10.1136/heart.86.5.499.

- 4.
Neyer J, Greenlund K, Denny C: Prevalence of heart disease-United States, 2005. Morb Mortal Wkly Rep. 2007, 56: 113-118.

- 5.
Barnett E, Halverson J: Disparities in premature coronary heart disease mortality by region and urbanicity among black and white adults ages 35–64, 1985–1995. Public Health Rep. 2000, 115: 52-64.

- 6.
Barnett E, Casper M, Halverson J: Men and Heart Disease: An Atlas of Racial and Ethnic Disparities in Mortality. CDC. 2001

- 7.
Eberhardt M, Pamuk E: The importance of place of residence: examining health in rural and nonrural areas. American Journal of Public Health. 2004, 94: 1682-1686. 10.2105/AJPH.94.10.1682.

- 8.
Pickle L, Gillum R: Geographic variation in cardiovascular disease mortality in US blacks and whites. Journal of the National Medical Association. 1999, 91: 545-556.

- 9.
Cubbin C, Hadden W, Winkleby M: Neighborhood context and cardiovascular disease risk factors: the contribution of material deprivation. Ethnicity and Disease. 2001, 11: 687-700.

- 10.
Halverson J, Barnett E, Casper M: Geographic disparities in heart disease and stroke mortality among black and white populations in the Appalachian region. Ethnic Diseases. 2002, 12 (S3): 82-91.

- 11.
Casper M, Barnett E, Halverson J, Elmes G, Braham V, Majeed Z, Bloom A, Stanley S: Women and Heart Disease: An Atlas of Racial and Ethnic Disparities in Mortality. 2000, Morgantown, WV: Office for Social Environment and Health Research, West Virginia University, 2

- 12.
Congdon P: Estimating the Prevalence of Coronary Heart Disease in Local Areas: Integrating Information from Health Surveys and Area Mortality. Health & Place. 2008, 14: 59-75. 10.1016/j.healthplace.2007.04.003.

- 13.
Grubesic T, Matisziw T: On the use of ZIP codes and ZIP code tabulation areas (ZC-TAs) for the spatial analysis of epidemiological data. Int J Health Geogr. 2006, 5: 58-10.1186/1476-072X-5-58.

- 14.
Neyer J, Greenlund K, Denny C, Keenan N, Labarthe D, Croft J: Prevalence of heart disease United States, 2005. Morbidity and Mortality Weekly Report. 2007, 56: 113-118.

- 15.
American Heart Association: Heart Disease and Stroke Statistics. 2008, http://www.americanheart.org

- 16.
Yancey C, Benjamin E, Fabunmi R, Bonow R: Discovering the full spectrum of cardiovascular disease: Minority Health Summit 2003. Circulation. 2005, 111: e140-e149. 10.1161/01.CIR.0000157744.30181.FF.

- 17.
Mensah G, Mokdad A, Ford E, Greenlund K, Croft J: State of disparities in cardiovascular health in the United States. Circulation. 2005, 111: 1233-1241. 10.1161/01.CIR.0000158136.76824.04.

- 18.
Cabrera C, Wilhelmson K, Allebeck P, Wedel H, Steen B, Lissner L: Cohort differences in obesity-related health indicators among 70-year olds with special reference to gender and education. Eur J Epidemiol. 2003, 18: 883-890. 10.1023/A:1025687102375.

- 19.
Centre for Disease Control. 2009, http://www.cdc.gov/brfss/technical_infodata/weighting.htm

- 20.
Graubard B, Korn E, Midthune D: Testing goodness-of-fit for logistic regression with survey data. Proceedings of the Section on Survey Research Methods, American Statistical Association. 1997, 170-174.

- 21.
Robbins A, Chao S, Fonseca V: What's the Relative Risk? A Method to Directly Estimate Risk Ratios in Cohort Studies of Common Outcomes. Annals of Epidemiology. 2002, 12: 452-454. 10.1016/S1047-2797(01)00278-2.

- 22.
Cossman R, Cossman J, Cosby A, Reavis R: Reconsidering the Rural-Urban Continuum in Rural Health Research: A Test of Stable Relationships Using Mortality as a Health Measure. Population Research and Policy Review. 2008, 27: 459-476. 10.1007/s11113-008-9069-6.

- 23.
Richardson S, Monfort C: Ecological correlation studies. Spatial Epidemiology Methods and Applications. Edited by: Elliott P, Wakefield J, Best N, Briggs D. 2000, Oxford University Press

- 24.
Rasmussen S: Modelling of discrete spatial variation in epidemiology with SAS using GLIMMIX. Computer Methods and Programs in Biomedicine. 2004, 76: 83-89. 10.1016/j.cmpb.2004.03.003.

- 25.
Congdon P: A spatially adaptive conditional autoregressive prior for area health data. Statistical Methodology. 2008, 5: 552-563. 10.1016/j.stamet.2008.02.005.

- 26.
Catelan D, Biggeri A, Lagazio C: On the clustering term in ecological analysis: how do different prior specifications affect results?. Statistical Methods and Applications. 2008,

- 27.
Gelfand A, Smith A: Sampling based approaches to calculate marginal densities. J Amer Statist Assoc. 1990, 85: 398-409. 10.2307/2289776.

- 28.
Lunn D, Thomas A, Best N, Spiegelhalter D: WinBUGS a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing. 2000, 10: 325-337. 10.1023/A:1008929526011.

- 29.
Spiegelhalter D, Best N, Carlin B, Linde van der A: Bayesian measures of model complexity and fit. J Roy Stat Soc B. 2002, 64: 583-639. 10.1111/1467-9868.00353.

- 30.
Brooks S, Gelman A: Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics. 1998, 7: 434-455. 10.2307/1390675.

- 31.
Holmes J, Arispe I, Moy E: Heart Disease and Prevention Race and Age Differences in Heart Disease Prevention, Treatment, and Mortality. Medical Care. 2005, 43: I-33-I-41. 10.1097/00005650-200503001-00006.

- 32.
Ingram D, Franco S: NCHS Urban-Rural Classification Scheme for Counties. 2006, Hyattsville, MD: National Center for Health Statistics

- 33.
Auchincloss A, Hadden W: The health effects of rural-urban residence and concentrated poverty. J Rural Health. 2002, 18: 319-336. 10.1111/j.1748-0361.2002.tb00894.x.

- 34.
Mellor J, Milyo J: Individual health status and racial minority concentration in US states and counties. Am J Public Health. 2004, 94: 1043-1048. 10.2105/AJPH.94.6.1043.

- 35.
Besag J, York J, Mollie A: Bayesian image restoration, with two applications in spatial statistics. Ann Inst Statist Math. 1991, 43: 1-59. 10.1007/BF00116466.

- 36.
Leroux B, Lei X, Breslow N: Estimation of disease rates in small areas: a new mixed model for spatial dependence. Statistical Models in Epidemiology, the Environment and Clinical Trials. Edited by: Halloran M, Berry D. 1999, Springer-Verlag: New York, 135-178.

- 37.
Gschlößl S, Czado C: Modelling count data with overdispersion and spatial effects. Statistical Papers. 2008, 49: 531-552. 10.1007/s00362-006-0031-6.

## Author information

## Additional information

### Competing interests

The author declares that they have no competing interests.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Behavioral Risk Factor Surveillance System
- Spatial Outlier
- Geographic Effect
- County Poverty
- Mainland State