Study setting and population
Data came from the Heart and Vascular Health (HVH) study, an ongoing population-based case-control study in the Puget Sound Region of Washington State . Subjects lived in King, Kitsap, Pierce, Snohomish, and Thurston counties; King County, the most populous of these, contains the City of Seattle. Although much of the land area included is rural, 97 percent our study population lived in non-rural areas (defined as a residential density ≥ 96.5 units/km2 [250 units/mi2] ).
The HVH study was designed for investigating pharmacological and genetic influences on cardiovascular disease, but we used data on 1,608 control participants to examine the effects of the built environment on walking for exercise. The controls from this study were a stratified random sample of 30 to 79 year old members of Group Health, a large health maintenance organization serving approximately 500,000 Washington State residents. Participants gave informed consent, and the human subjects review committee at Group Health and the University of Washington approved all study procedures.
Only controls were included in this analysis, to limit possible recall bias or confounding by preclinical cardiovascular disease. Participants were also excluded if they had a documented history of myocardial infarction, stroke, congestive heart failure or angina, or if they reported fair or poor health prior to their reference date. These exclusions were designed to identify a healthy population in which physical activity might be important for primary prevention of disease, while excluding those with major health limitations that could influence both place of residence and physical activity patterns.
We randomly assigned each participant a reference date within the year of selection as a control (1995 to 2001). Information preceding the reference date was collected from medical records and telephone interviews; the reference date was used by the original study to ensure comparable data quality for myocardial infarction cases and frequency matched controls. Telephone interviews took place from 1995 to 2004, an average of about two years (standard deviation: 0.7 years) after the assigned reference date; 76 percent of eligible, contacted controls agreed to participate in a telephone interview. Compared with participants who allowed us only to examine their medical record, participants completing the telephone interview were more likely to have treated hypertension, treated diabetes, or a body mass index above 30 and less likely to be residents of King County (chi-squared test p < 0.05).
Physical activity and participant characteristics
The telephone interview included questions on physical activity derived from the Minnesota Leisure-Time Physical Activity questionnaire . The Minnesota Leisure-Time Physical Activity questionnaire has a high test-retest reliability  for physical activity over the last year, with one month interval between tests, but has been modified for our study. Participants in the HVH study were asked to report the frequency and duration of their participation in 26 types of physical activity, including "walking for exercise", for a one-month period before their reference date. Frequency and average duration were used to estimate the minutes per week spent walking for exercise. Previous studies have found that data from this questionnaire on physical activity or walking for exercise are associated with incident myocardial infarction in this study population [46, 50], which suggests the modified questionnaire has predictive validity and relevance to cardiovascular health.
The telephone interview also included questions on the participant's race, general health status (classified as excellent, very good, good, fair, or poor), smoking status, employment status, education, and income. Data from Group Health medical and pharmaceutical records were used to assess whether each participant had treated hypertension or treated diabetes. Measured height and weight were taken from the medical record and used to calculate body mass index (weight in kilograms/height in meters, squared). Obesity was defined as a body mass index above 30.
Addresses and geocoding
Residential addresses were obtained from Group Health's archived end-of-year membership files for the December before each participant's reference date. An automated process in Maptitude software , version 4.7 (Caliper Corporation, Newton/MA, 2004), successfully geocoded 97 percent of addresses, and an additional two percent were geocoded following manual cleaning of the address data. Participants were excluded if they had no address or only a Post Office box available (n = 79); an address that could not be geocoded (n = 4); or an address located outside of the five-county study area (n = 72).
One-kilometer airline buffers (circles with one kilometer radius surrounding each address) were created using ArcView 3.2 (ESRI, Redlands/CA, 1999). Airline buffers based on Euclidean distance were used instead of network buffers based on empirical evidence from the same geographic region  and the high permeability of urban environments to pedestrians . One kilometer buffers were selected because of the relatively small territory typically covered on foot [8, 29] and the lack of correlation between perceived and objective measures of the built environment beyond one kilometer [20, 42].
Addresses were also allocated to census block groups, census tracts and ZIP codes using a point-in-polygon joining process . Census block groups in the US contain approximately 1,000 residents, census tracts 4,000 residents, and ZIP codes 30,000 residents .
Built environment data
For each of the five study counties digital maps of street networks, parks, and tax parcels (defined as buildings or units of land that are taxed or exempt from taxation) were obtained through the Washington State Geospatial Data Archive , county agencies, or cities (sidewalks, for King County only). Built environment data sources used were produced between 1998 (the midpoint of the study period) and 2005; although data from 1998 were sought in all cases, more recent data were used for several built environment characteristics because older data had not been archived, were of poor quality, or did not exist for a given county.
Residential density was calculated as housing units per square kilometer, with a housing unit defined as a house, apartment, mobile home, or other dwelling intended for occupancy as separate living quarters . Residential density of each one-kilometer buffer was estimated using an area-weighted average of densities from census block groups intersecting or contained in the buffer. For example, a subject might have 30 percent of their one-kilometer buffer in census block group A, and 70 percent in census block group B. The estimated density for the one-kilometer buffer would then be 0.3 * (density of A) + 0.7 * (density of B). As a measure of connectivity, block size was calculated using local street maps. For sidewalk availability, the total length of sidewalk-lined streets within each one-kilometer buffer was calculated. Sidewalk data were only available for King County.
We estimated proximity to several potential walking destinations (grocery stores, schools, restaurants and bars, banks, grocery-restaurant-retail complexes, office complexes, school-church combinations, fitness facilities, and parks), calculating the distance to the closest destination of each type and the number of destination of each type within one kilometer. For the destination combinations (grocery-restaurant-retail complexes, office complexes, and church-school combinations), the area of the nearest one was also calculated. Park access was measured as the proportion of the one-kilometer buffer covered by parks. With the exception of parks, which were identified using digital maps of parks in each county, destinations were identified using tax parcel land use codes. The categorization of the land use codes differed by county, but consistent rules were applied to categorize land uses across counties.
Built environment characteristics were tested as predictors of walking for exercise. All participants were included in analyses of logistic models predicting some walking versus no walking, and those who walked were included in linear models to predict amount of walking (average minutes per week). Time spent walking for exercise was log-transformed to moderate the effects of skewness and heteroscedasticity.
We tested single built environment characteristics and models using multiple built environment characteristics to predict walking. Some built environment characteristics may be associated with walking in our sample by chance alone, raising concerns about multiple comparisons. If we fit a model to our data, and then tested the model using the same data, our estimates of model fit would be artificially high because any chance associations unique to our data would be incorporated into our model. This would overestimate our ability to predict walking in a different sample of individuals from the same population. A holdout approach was used to avoid this bias [44, 45]. Models developed in a training set were tested in a validation set, with estimates of model fit based on the validation set considered to be more accurate.
The training set (a stratified random sample of 2/3 of participants) and validation set (the remaining 1/3 of participants) were similar with regard to demographic, socioeconomic, health, and built environment characteristics. The random sampling was stratified by King County residence, because we decided a priori to separately create and evaluate models for the subset that lived in King County, in addition to pooled models for the entire region. More than half of area residents and a majority of our study participants (58%) lived in King County.
Built environment characteristics were modeled within categories or log-transformed in order to reduce the influence of outliers. Proximity to destinations of each type was categorized as within 500 m, 500 m to 1000 m, or more than 1000 m away. Density, connectivity, sidewalk availability, and park access were log transformed. Regression models were used to calculate the predicted probability of walking for exercise or predicted minutes per week of walking for exercise. These predicted variables were proportional to the linear predictors from the corresponding models: a constant (alpha) added to the product of each built environment characteristic (x) and the corresponding slope parameter (beta coefficient): predicted minutes/week of walking = α + Σxiβi. Slope parameters were estimated from training set data.
In addition, models were created using the Walkable and Bikeable Community (WBC) study model components: residential density; household and average block size; sidewalk availability; number of schools, restaurants or bars, grocery stores, and grocery-restaurant-retail complexes; distance to the closest restaurant or bar; distance to the closest grocery store; and area of the closest office complex [27, 42]. We evaluated regression models with slope parameters for these 11 characteristics based our study's training set or on the WBC study data [27, 42] (reanalyzed with exclusions, adjustments, and regression techniques parallel to those used for the present study).
For logistic regression models, model fit was evaluated using Hosmer-Lemeshow tests  and C-statistics (based on the area under the receiver operating characteristic curve). Under the null hypothesis, the logistic model predicts walking no better than expected by chance, and one would expect a C-statistic of 0.5; a model with perfect prediction would lead to a C-statistic of 1.0. Predictive utility of linear models was assessed through the percent of variation explained: r-squared * 100 percent.
Unadjusted models were compared with models adjusted for age, sex, self-reported health status, income, and education. For adjusted models, missing values for income (10 percent) and education (less than one percent) were estimated through multiple imputation . Because unadjusted and adjusted models were similar, we have presented the unadjusted models in our tables. All regression models were run using robust variance estimates in Stata 8.2 (StataCorp, College Station/TX, 2003), and variance estimates accounted for clustering within county of residence.
Intra-class correlation coefficients (ICCs) were used to evaluate how characteristics varied between versus within ZIP codes, census tracts, and census block groups . These ICCs can be interpreted as the maximum proportion of variation explained at the given group-level. If a characteristic was constant within each group, the only variation would be between groups and the ICC would be 1.0. In contrast, if the characteristic was randomly distributed with respect to group, the ICC would be close to zero. These estimates were based on one-way analysis of variance (ANOVA) models. Continuous variables were log-transformed to more closely meet the normality assumption of the ANOVA model. The ANOVA ICC estimator was also used for dichotomous variables, for which the ICC estimation remains asymptotically valid and unbiased .