U.S. census unit population exposures to ambient air pollutants

Background Progress has been made recently in estimating ambient PM2.5 (particulate matter with aerodynamic diameter < 2.5 μm) and ozone concentrations using various data sources and advanced modeling techniques, which resulted in gridded surfaces. However, epidemiologic and health impact studies often require population exposures to ambient air pollutants to be presented at an appropriate census geographic unit (CGU), where health data are usually available to maintain confidentiality of individual health data. We aim to generate estimates of population exposures to ambient PM2.5 and ozone for U.S. CGUs. Methods We converted 2001-2006 gridded data, generated by the U.S. Environmental Protection Agency (EPA) for CDC's (Centers for Disease Control and Prevention) Environmental Public Health Tracking Network (EPHTN), to census block group (BG) based on spatial proximities between BG and its four nearest grids. We used a bottom-up (fine to coarse) strategy to generate population exposure estimates for larger CGUs by aggregating BG estimates weighted by population distribution. Results The BG daily estimates were comparable to monitoring data. On average, the estimates deviated by 2 μg/m3 (for PM2.5) and 3 ppb (for ozone) from their corresponding observed values. Population exposures to ambient PM2.5 and ozone varied greatly across the U.S. In 2006, estimates for daily potential population exposure to ambient PM2.5 in west coast states, the northwest and a few areas in the east and estimates for daily potential population exposure to ambient ozone in most of California and a few areas in the east/southeast exceeded the National Ambient Air Quality Standards (NAAQS) for at least 7 days. Conclusions These estimates may be useful in assessing health impacts through linkage studies and in communicating with the public and policy makers for potential intervention.


Background
Air pollution monitoring data has customarily been compiled and maintained by the EPA and/or state and local agencies. These data have been used in several studies that found ambient air pollutants associated with mortality [1][2][3][4] and morbidity [5][6][7][8][9]. However, air monitoring sites are typically sparsely located in very limited geographic areas -only 20% of U.S. counties have at least one monitoring station for PM 2.5 -and the temporal resolution and type of pollutants measured vary by station (e.g., PM 2.5 data is only available about every 3-6 days). Thus, studies based on monitoring data were usually limited to high population density areas such as cities or urban/suburban centers, where most monitoring stations are located.
To expand geographic coverage and increase temporal resolution of air pollution data, several studies have recently estimated ambient air pollution concentrations using various data sources and advanced modeling techniques [10][11][12][13]. Thus, areas with very sparse or no monitoring data now have gridded data with a variety of spatial (e.g., 4 km, 36 km) and temporal (e.g., hourly, daily) resolutions. However, these data have not been widely accepted by health researchers partly because studies of possible effects of ambient air pollutants on human health often require population exposures to ambient air pollutants to be presented at certain census geographic levels (e.g., census tract, county), where health data are usually available to maintain confidentiality of individual health data [14,15]. Other socioeconomic and demographic data are also routinely collected at such geographic resolutions [16].
Ideally, concentration should be presented at the finest CGU possible, at which air pollution concentration may approximate the potential population exposure to a certain kind of ambient air pollutant, whereas actual population exposure may be close to zero in certain places where few people live (e.g., mountains), no matter how high the concentration of pollutants. From a public health perspective, it is the exposure that makes people sick. The goal of this study is, therefore, to estimate CGU population exposures to ambient PM 2.5 and ozone. Two major steps are taken to achieve this goal: 1) estimate BG daily ambient PM 2.5 and ozone concentrations from the gridded data and conduct data comparisons against ground-based monitoring values; and 2) aggregate BG concentrations to generate population exposure estimates for larger CGUs using BG population as a weighting factor. We choose BG (instead of census block) as the basic unit because BG is the lowest CGU where population data are available on an annual basis. BGs generally contain between 600 and 3,000 people, with an optimum population size of 1,500 [17].

Materials and methods
Data source: gridded PM 2.5 and ozone concentrations Gridded PM 2.5 (μg/m 3 ) and ozone (ppb) concentrations were obtained using a hierarchical Bayesian model developed by the EPA for CDC's EPHTN [12], which provide 24-hour maximum PM 2.5 and 8-hour maximum ozone concentrations on a daily basis (2001)(2002)(2003)(2004)(2005)(2006). The model uses source-based Community Multiscale Air Quality (CMAQ) model outputs and monitoring data. It accounts for spatial and temporal dependencies of air pollutants through a hierarchical Bayesian approach. The spatial resolution of data was inherited from CMAQ modeling outputs. CMAQ considered information about emission inventories, meteorological information, and land use. The detailed information about CMAQ and monitoring data can be obtained from http://epa.gov/asmdnerl/CMAQ [18] and http://airnow. gov [19], respectively. The model resulted in two sets of gridded data: 36 km grid-cells for the contiguous U.S. and 12 km grid-cells for an eastern portion of the U.S., which includes the Northeast census region and the South Atlantic and East South Central divisions of the South census region (excluding part of south Florida) and part of Arkansas and Louisiana; portions of the Midwest census region, which includes the entire East North Central division and part of Minnesota, Iowa and Missouri (http://www.census.gov/geo/www/us_regdiv. pdf) [20]. The gridded data fill "holes" in both time (when data are missing on certain days) and space (locations where data are not available). Information on CDC's ongoing EPHTN has been described elsewhere [21,22] and is also available from http://www.cdc.gov/ ephtracking [23].

and ozone concentrations
We used a distance-weighting method to estimate BG daily PM 2.5 and ozone concentrations for all U.S. BGs based on 36 km-gridded data (12 km-gridded data for an eastern portion of the U.S.). Empirical studies, which compared different methods of areal interpolation, suggested that distance-weighting was an appropriate method in calculating population exposure estimates [10,24]; distanceweighting relaxes the homogeneity assumption associated with area-weighting method and overcomes bias introduced by the equal contribution assumption associated with internal or nearest neighboring method.
To make the calculations, the following steps were taken. First, the distance between the centroid of each BG and the corresponding four nearest grids (centroids) were calculated using the newly developed GEODIST function available in SAS software, version 9.2 [25]. The GEODIST function uses the Vincenty distance formula to compute the geodetic distance between any two arbitrary latitude and longitude coordinates in terms of degrees or in radians [26]. The Vincenty-based computation used by the GEODIST function is more accurate than the most commonly used method of the Haversine distance formula [27]. In this study, we used degrees in the GEODIST function. Each BG is associated with the nearest four neighboring grids for both 36 km and 12 km data. Figure 1 demonstrates some possible spatial Figure 1 The demonstration of spatial relationships between BGs and grids. The four nearest neighboring grids (g) for BGs 1-6: BG1: g5, g6, g9, and g10; BG2: g5, g6, g9, and g10; BG3: g2, g3, g6, and g7; BG4: g3, g6, g7, and g11; BG5: g6, g7, g10, and g11; BG6: g7, g8, g11, and g12. relationships between BGs and grids. In urban areas, many BGs are located within a 36 km or 12 km grid.
Second, BG daily PM 2.5 and ozone concentrations were calculated using the inverse of the squared distance as a weighting factor. The inverse of the squared distance is the most commonly used format of distance-weighting, which gives higher weight to closer observations [10,24]. The weight w i for one of the four neighboring grids nearest to a BG centroid was calculated as where d i is the distance between a grid centroid and a BG centroid. Two separate estimates were derived for the U.S. (from 36km gridded data) and an eastern portion of the U.S. (from 12km gridded data). Daily PM 2.5 or ozone concentration μ i for a BG was estimated as where P i is the corresponding neighboring grid's concentration measure for PM 2.5 or ozone. The variance δ i 2 associated with a BG was estimated as where δ i is the standard error associated with BG estimate μ i ; and σ i is the standard error associated with the original grid's concentration measure P i .
Third, the derived 2001-2006 BG daily estimates for PM 2.5 and ozone were compared with monitoring data observed at ground stations within each BG boundary. The comparison was restricted to the area equivalent to 12 km grid-cell coverage (i.e, an eastern portion of the U.S.) for simplicity of having estimates from both 36 km-and 12 km-gridded data. The number of monitoring sites in each BG ranges from 0 to 2. The comparison was conducted for those BGs containing 1 or 2 monitoring sites. The majority of BGs contained only one monitoring site (e.g., 1055 BGs contains one versus 26 BGs contains two PM 2.5 monitoring sites). We calculated two statistics for data comparison: mean absolute deviation (MAD), an intuitive measure of absolute fit; and correlation coefficient (R), a measure of relative fit. MAD measures the average absolute deviations of the estimates from their corresponding observed data [28]. In time series analysis, MAD measures the average absolute deviation of observations from their forecasts.

Estimating daily population exposures to ambient PM 2.5 and ozone for larger CGUs
Unlike concentration, exposure comes with population; no population, no exposure. Population exposure could substantially differ from concentration itself depending on population distribution within a CGU. Therefore, when estimating population exposure for a CGU, population distribution needs to be factored in. Using BG as a base, we aggregated BG concentration to generate larger CGU population exposure measure by taking into account the population location of BG. This method has been used in generating population exposure to ambient air pollutant for larger graphic areas based on estimates at a fine spatial resolution [8,29,30]. The daily population exposure to ambient air pollutant (PE) at a larger CGU, such as census tract, was estimated as where n is the number of BGs within a CGU and pop i is the BG population; similarly the standard error (ψ) associated with the population exposure to ambient air pollutant (PE) at a CGU was estimated as where δ i is the standard error associated with BG estimate, μ i . In this study, we generated daily population exposure estimates (PEs and their standard errors ψs) (2001-2006) for census tract, county, state and the U.S. accordingly by aggregating BG daily estimates weighted by BG population distribution across corresponding larger CGUs.
We mapped the 98 th percentiles of 2006 daily population exposures to ambient PM 2.5 and ozone for census tract and county to demonstrate geographic variation in population exposures to ambient PM 2.5 and ozone and highlight where severe population exposures to these two ambient pollutants could potentially occurs. The 98 th percentile of 2006 daily population exposure estimate shows the seventh-highest daily population exposure (i.e., 2% of 365 days equals to~7 days) that the population in a CGU has experienced in that year. We grouped population exposure to ambient air pollutants into five categories. The cut point for the second highest category is adjusted to match the NAAQS (daily 24hour standard of 35 μg/m 3 for PM 2.5 and daily 8-hour standard of 75 ppb for ozone) [31] and the highest one and the lowest three categories are set at equal lengths of 10 units above or below the standards (e.g., cut points for the highest ones are set as 45 for PM 2.5 and 85 for ozone). ArcGIS software was used in mapping [32]. Similarly, we mapped the 90 th percentiles of 2006 daily population exposures to ambient PM 2.5 and ozone for census tract and county with five manually grouped categories (the highest category is set to match the NAAQS) to allow a spatial pattern to emerge. The 90 th percentile of 2006 daily population exposure estimate corresponds to the thirty fifth-highest daily population exposure (i.e., 10% of 365 days equals to~35 days) that the population in a CGU has experienced in that year.
We further calculated the number of days when PM 2.5 or ozone concentration exceeded the NAAQS for each BG. The populations at risk for larger CGUs were calculated by aggregating BG populations with exposure to ambient PM 2.5 or ozone exceeding the NAAQS.

Results
Data comparison: BG estimates against ground-based monitoring data The mean of BG 24-hour maximum PM 2.5 and 8-hour maximum ozone estimate for an entire year across the U.S. is 13 μg/m 3 and 44 ppb, respectively. We presented two comparison statistics between BG daily estimates and ground-based monitoring data in an eastern portion of the U.S., by year, in Table 1 along with number of monitoring sites (M) and number of observations (N). On average, the estimates deviated by 2 μg/m 3 (for PM 2.5 ) and 3 ppb (for ozone) from their corresponding observed values. The MADs of BG estimates based on the 36 km-gridded data were similar to those resulted from 12 km-gridded data. Although the former was slightly smaller than the later, the difference between the two was very small. Similar patterns were observed for correlation coefficient R (Table 1). In addition to MAD, we compared other distributional statistics of absolute deviation between predicted and observed (i.e., minimum, median, maximum, and the 5 th , 10 th , 90 th , and 95 th percentiles); and we observed little discrepancy by season, by year, or by urban vs. rural status between the two (Additional file 1).
Estimating CGU population exposures to ambient PM 2.5 and ozone Figure 2 shows the 98 th percentiles of 2006 daily potential population exposure to ambient PM 2.5 and ozone for census tracts and counties. The patterns showed by census tract (upper two panels) were similar to those captured at county level (lower two panels) for most of the U.S., especially the eastern U.S., for both PM 2.5 and ozone. However, there were visible differences between patterns revealed at census tract level and at county level for west coast areas (e.g., California) for both PM 2.5 and ozone ( Figure 2). Daily potential population exposure to ambient PM 2.5 was the lowest in the southwest (< 15 μg/m 3 ) except California; such exposure increased when moving east; whereas the highest daily potential population exposure to ambient PM 2.5 occurred in west coast and northwest areas, which exceeded the NAAQS of 35 μg/m 3 for 7 days. There were some obvious locations in the east where the daily potential population exposure to ambient PM 2.5 was above the standard for 7 days (Figure 2, left two panels). Daily potential population exposure to ambient ozone was the lowest in the northwest and increased when moving south and southeast (or southwest). Daily potential population exposure to ambient ozone in most of California and a few areas in the east/southeast U.S. exceeded the standard of 75 ppb for 7 days (Figure 2, right two panels). The patterns showed by the 90 th percentiles were similar to those observed in the 98 th percentiles for both PM 2.5 ( Figure 3, left two panels) and ozone ( Figure 3, right two panels). to both ambient PM 2.5 (0.77 million) and ozone (6 million). In addition to California, Oregon, Idaho, and Washington also experienced 4 weeks exposure to ambient PM 2.5 greater than the standard, whereas Texas experienced 4 weeks exposure to ambient ozone exceeding the standard (Table 2).

Discussion
The study has three important results. First, daily BG estimates for ambient PM 2.5 and ozone concentration were comparable to data observed at monitoring sites, which suggested that inverse-distance weighing was an appropriate method to generate estimates for BGs from gridded data. The second important result was that we generated daily potential population exposure estimates, for both PM 2.5 and ozone, for various CGUs from BG to state and the U.S. Such population exposure estimates for small areas such as census tracts and counties are very valuable for conducting health impact studies. Moreover, this result highlights the need for investigation and intervention in places with higher estimated daily potential population exposures (not concentrations) and/or longer duration. The geographical patterns of PM 2.5 and ozone found (especially at census tract level) were generally consistent with the ranking of most polluted cities (by year round particle pollution and ozone, respectively), provided by the American Lung Association -available at http:// www.stateoftheair.org/ [33]. The highest daily potential population exposure to ambient PM 2.5 in the west coast and northwest U.S. may be largely contributed by organic carbon due to high biomass burning such as wildfires, waste burning, and woodstoves [34,35], though nitrate, sulfate, or crustal material may also represent substantial components of PM 2.5 for the western U.S. [36]. The higher daily potential population exposure to PM 2.5 in other areas and ozone in general may mainly occur in those megacities or large metropolitan areas where ozone precursors such as volatile organic compounds and oxides of nitrogen produced by heavy traffic (also contribute to organic carbon and nitrite for PM 2.5 ) and electric utilities and industrial boilers (also contribute to sulfate and nitrite for PM 2.5 ) are concentrated [36,37].
The third important result was that we generated population at risk for each CGU from BG to state and the U.S. based on the NAAQS for PM 2.5 and ozone. This result provides a hierarchical structure that links hazardous pollution to population affected at different geographic levels. For example, population at risk presented at the state level could be easily traced back to specific CGUs, where information on potential population exposures to ambient air pollutants and population size is needed at smaller CGUs. Such detailed information on potential population exposure level and size of population affected could be used to facilitate communications among public health professionals and/or policy makers across different levels of jurisdiction and help them prioritize resources based on size of population affected and duration of exposures to ambient air pollutants.
There are several limitations. First, we assumed independence among the nearest four grids. This could potentially underestimate the standard errors associated with BG estimates. Second, we used the BG centroid to represent the entire BG area, which on average  contains about 39 census blocks [38]. However, in reality, ambient PM 2.5 or ozone may vary within a BG.
Although we thought to convert gridded concentration data to census blocks (the smallest CGU in the U.S.), we were limited to BGs because population data were not available on an annual basis at block level to allow us to generate potential population exposure estimates for larger CGUs. Third, like other studies, we could not account for net population gain or loss for a BG on a daily basis due to population movement across BGs. An additional limitation was associated with the uncertainty of 36 km-versus 12 km-gridded data. For example, BG estimates from 36 km-grids were slightly more approximate to ground monitoring data than those estimated from 12 km-grids. This may be explained by different sets of input variables included in 36 km-versus 12 km-CMAQ modeling system [18]. We compared 12 km-and 36 km-gridded data against values observed at the nearest monitoring site within specific grids (in an eastern portion of the U.S.) and the comparison statistics (e.g., MAD and R) showed the same pattern as in Table 1 (data not shown): 36 kmgridded data were more approximate to the observed values than 12 km-gridded data. Thus, interpretations of results found must be considered in the context of the limitations of this study.

Conclusions
We presented a method to allocate gridded data to BGs based on spatial proximities between BGs and their four nearest grids. We used a bottom-up (fine to coarse) strategy to generate CGU population exposures to ambient air pollutants based on BG estimates. Given that BG concentration derived from inverse-distance weighting was comparable to the ground-based monitoring data, using BG as a building block not only provided comparable population exposure estimates across CGUs, but also guaranteed that patterns shown at different geographic levels were consistent, with finer geographic resolution showing more detailed location for potential population exposures to ambient air pollutants. These estimates may be useful in communicating to the general public about the amount and duration of potential population exposures to ambient air pollutants and size of population affected for various geographic levels.

Additional material
Additional file 1: Distribution of absolute deviation between BG daily estimates and ground-based monitoring data in an eastern portion of the U.S. (A: PM 2.5 estimated from 36 km-grid; B: PM 2.5 estimated from 12 km-grid; C: Ozone estimated from 36 km-grid; D: Ozone estimated from 12 km-grid). The file contains distributional statistics of absolute deviation between predicted and observed by season, by year, and by urban/rural status. In addition to mean (MAD), it contains minimum, median, maximum, and the 5 th , 10 th , 90 th , and 95 th percentiles of absolute deviation between the two.
Abbreviations BG: census block group; CDC: Centers for Disease Control and Prevention; CGU: census geographic unit; CMAQ: Community Multiscale Air Quality; EPA: U.S. Environmental Protection Agency; EPHTN: Environmental Public Health Tracking Network; MAD: mean absolute deviation; NAAQS: National Ambient Air Quality Standards; PM 2.5 : particulate matter with aerodynamic diameter < 2.5 μm; R: correlation coefficient.
Authors' contributions YH was responsible for study design, data analyses, result interpretation and manuscript writing. HF was responsible for study design and result interpretation. MM was responsible for result interpretation and manuscript writing. JQ was responsible for study design, data analyses, and result interpretation. All authors read and approved the final manuscript.