Use of community-level data in the National Children’s Study to establish the representativeness of segment selection in the Queens Vanguard Site

Background The WHO Multiple Exposures Multiple Effects (MEME) framework identifies community contextual variables as central to the study of childhood health. Here we identify multiple domains of neighborhood context, and key variables describing the dimensions of these domains, for use in the National Children’s Study (NCS) site in Queens. We test whether the neighborhoods selected for NCS recruitment, are representative of the whole of Queens County, and whether there is sufficient variability across neighborhoods for meaningful studies of contextual variables. Methods Nine domains (demographic, socioeconomic, households, birth rated, transit, playground/greenspace, safety and social disorder, land use, and pollution sources) and 53 indicator measures of the domains were identified. Geographic information systems were used to create community-level indicators for US Census tracts containing the 18 study neighborhoods in Queens selected for recruitment, using US Census, New York City Vital Statistics, and other sources of community-level information. Mean and inter-quartile range values for each indicator were compared for Tracts in recruitment and non-recruitment neighborhoods in Queens. Results Across the nine domains, except in a very few instances, the NCS segment-containing tracts (N = 43) were not statistically different from those 597 populated tracts in Queens not containing portions of NCS segments; variability in most indicators was comparable in tracts containing and not containing segments. Conclusions In a diverse urban setting, the NCS segment selection process succeeded in identifying recruitment areas that are, as a whole, representative of Queens County, for a broad range of community-level variables.


Background
The National Children's Study (NCS) is a prospective cohort study designed to identify preventable causes of childhood disease in the United States, with the full cohort to include 100,000 children enrolled from 105 counties (or groups of counties) across the country. A major premise of the NCS is that findings could be extrapolated to represent the American experience, and inform public policy [1][2][3][4]. Seven "pilot" or Vanguard Centers began recruitment in 2009 and Duplin County, NC and Queens County, NY were the first to enumerate and screen potential subjects residing within predetermined geographic areas, referred to as segments. These segments were selected to produce a representative subsample of the county that would, given estimated recruitment rates, result in recruitment into the study of approximately 1,000 mothers giving birth over a four year period.
The World Health Organization (WHO) has identified neighborhood contextual exposures as a central element in its Multiple Exposures Multiple Effects (MEME) framework for studying childhood health [5]. Multiple neighborhood contextual characteristics have been shown to affect a range of developmental and health outcomes across childhood and adolescence, with cognitive functioning being one of the most widely investigated. The socioeconomic composition of neighborhood residents is associated with cognitive functioning [6][7][8][9][10][11], and there is some evidence that this effect differs by race and ethnicity [8,9]. As is the case with cognitive functioning, school achievement has also been associated with neighborhood socioeconomic status (SES) [12], and gender-specific effects have been shown [13]. In addition to effects on cognitive function, a growing literature demonstrates neighborhood effects on both physical and mental health and behavior. Proximity to and quality of parks, playgrounds, and recreational facilities have been associated with physical activity, health behaviors and body size [14][15][16][17][18][19][20][21][22][23][24]. Similarly, indices of neighborhood walk-ability, such as population density and land use, are associated with physical activity through walking and active travel among youth [25,26]. Some studies demonstrate that physical deterioration (e.g. graffiti, litter or abandoned buildings) is associated with lower physical activity, higher rates of overweight in children, and lower parental support for children playing in local playgrounds [27][28][29][30]. Other important associations between neighborhood characteristics and physical outcomes include traffic-related respiratory symptoms [31] and injuries [32]. Among the mental health and behavioral outcomes influenced by neighborhood conditions (primarily SES-related), are psychological distress [33], substance use [34], and behavioral problems [10,35,36].
Consistent with the WHO MEME framework, one of the goals of the NCS is to understand how neighborhood environments influence child development and disease risk [1]. However, there are ethical and logistical challenges to the achievement of this goal within the context of a national, multi-site study such as the NCS, which is coordinated by a central data center. Outside of data collected nationally by the Census Bureau through the American Community Survey the availability and quality of geo-spatially aligned data describing neighborhood contexts varies tremendously across cities in the United States. In addition, our experience has been that negotiating licenses with local Governmental agencies for geo-spatial data and the sharing of geospatial data is often facilitated by relationship building, trust, involvement in the community and personal connections, suggesting a substantial role for local research teams in the acquisition of geo-spatial data. It is also common that licenses for such data specify that the data not be further shared with other groups, such as the NCS data center.
One approach to conducting neighborhood health studies within the NCS would be for the central NCS data repository to release to authorized investigators analytical data sets with the residential longitude and latitude of the study subjects so the investigators could create their own neighborhood context variables. This would provide the investigators with the flexibility of creating their own neighborhood definitions (e.g. to use administrative units such as postal codes or radial buffers around the address) for study subjects and of using locally available geo-spatial data to create neighborhood measures, but could compromise the confidentiality of the study subjects. Alternatively the data center could centrally perform geo-processing functions as requested by NCS investigators and provide analytical data sets that include neighborhood context variables but not residential identifiers. For analyses using unique or not universally available geo-spatial data sources, the logistics of the NCS data center sourcing and centrally negotiating data license agreements could be a serious barrier to research. Thus, strategies to protect subject confidentiality, efficiently acquire geo-spatial data and adhere to data licensing agreements will be needed to support analyses of neighborhood health effects. Other large scale national studies in the United States have taken a variety of approaches to these issues, and we suggest that a working group of interested parties be formed to study how neighborhood effect studies can best be conducted within the NCS.
From an international perspective efforts to establish procedures for the conduct of neighborhood effects research within the NCS should be cognizant of international efforts to coordinate the conduct of new large scale birth-cohort studies [37]. The World Health Organization (WHO) is currently working to strengthen, international cooperation in the conduct of birth cohort studies, with a focus on harmonizing disease outcome, biomarker and exposure measures so that study data may be pooled [37]. Under WHO's MEME framework, the measurement of community contextual exposures is one of the "four ingredients required for the monitoring of children's environmental health" [38]. The development of compatible methods to define and measure neighborhood contexts across birth cohort studies and cross-cultural research to identify contextual constructs that are salient across cultures and regions are areas that warrant consideration within WHO birth cohort coordination activities.
Early work from Queens has described extant data sources at the national, county and local level that can be used to estimate chemical exposures for the children enrolled in the NCS [39]. Following the WHO MEME framework, we here broaden the discussion of neighborhood environment to include the social and built environment of the Queens Vanguard site recruitment segments and Queens as a whole [1,40]. Our goal here is to identify multiple domains of neighborhood context and key variables describing the dimensions of these domains. Important considerations for conducting neighborhood health studies are whether the neighborhoods are representative of the larger area to which study results will be generalized, and whether there is sufficient variability across neighborhoods for comparisons of contextual variables to be meaningful. Consideration and analysis of these issues for neighborhood effects studies should be part of the WHO efforts to coordinate and harmonize new birth cohort studies within the MEME framework [37,38]. Here we compare the Queens segment areas to the whole of Queens to determine if the segments are representative of Queens County on the selected indicator variables and assess the extent to which the segments vary in neighborhood conditions.

Methods
Based on the literature researchers with the Columbia University Built Environment and Health Research Group (BEH), the Columbia Children's Center for Environmental Health and the NCS Queens Vanguard site identified nine major domains of social and built environment contexts of interest. The domains of interest and key indicator variables are described in Table 1. Available geo-spatial data from the Census and other sources have been gathered by BEH, and subsequently cleaned and geo-processed for use with the Queens Vanguard segments. Some of the data sources are available nationally, while others are unique to NYC (see Table 1).
The overall strategy for segment selection in the NCS has been reported previously elsewhere [41]. The Queens Vanguard Center comprises 18 geographic areas, referred to as segments, which are noncontiguous, relatively homogeneous areas from which study subjects are recruited. Historical birth counts from the New York City (NYC) and New York State (NYS) Vital Statistics Registries (2000)(2001)(2002)(2003)(2004) at the census tract level and NYC Housing Department data were used to predict births within census blocks. Census blocks were chosen to be representative in terms of race/ethnicity, poverty status, age distribution and foreign born status of women of child bearing age. These blocks were then combined to achieve eighteen segments that would produce 250 live births per year. The eighteen segments were selected in a two-phase stratified sampling approach that attempted to equalize the probability of selection of segments with diverse sociodemographic and other characteristics. The segment boundaries were guided by boundaries of historical neighborhoods as catalogued by the NYC Department of City Planning, and examination of proposed segment maps to ensure that selected boundaries did not cross major roadways, parks or other entities around which communities are formed. Between March-August, 2008 all dwelling units (DU) within the segments were identified (N = 11,116) and to date, 44 newly constructed DUs have been included in the sample, resulting in a total of 11,160 households [2].
Summary statistics were generated for these neighborhood context variables. Queens NCS segments were then compared to the remainder of Queens County to determine the degree to which segment selection (based on relatively few birth and demographic variables) yielded areas that were representative of Queens as a whole. Mean, median, quartile and minimum and maximum values were calculated for each variable; and segmented and non-segmented areas were compared using t-tests and non-parametric tests.
To preserve the confidentiality of the study subjects during the recruitment phase of the NCS, the locations of the Queens segments are not disclosed. Thus summary statistics for Census tracts that include Census blocks that are part of the segments were calculated and compared to summary statistics for Census tracts that do not include segment Census blocks (see Figures 1  and 2). To ensure the stability of summary statistics calculated at the tract level, tracts containing a total population less than 500 were excluded from analysis (n = 75); one additional tract that consisted predominantly of institutionalized individuals unlikely to include children, and this tract was also excluded from analysis. The remaining 640 tracts included in the analysis contained a total population of 2,225,761, or 99.9% of the population of Queens. A total of 43 tracts with a population of 168,503 contained NCS segments; a total of 597 tracts with a population of 2,057,258 included the remaining Queens tracts that did not contain NCS segments. Analyses in this paper did not use human subject data.

Results
Descriptive statistics comparing Census tracts containing portions of NCS segments (N = 43) with those tracts not containing portions of the NCS segments (N = 597) are shown in Table 2. To preserve anonymity of the segments only means and inter-quartile ranges are reported in Table 2. Across the nine domains characterizing Queens communities, the NCS segment-containing tracts were, as a group quite similar to the tracts in Queens not containing portions of NCS segments. More specifically, of the 53 community indicators representing these nine domains, a statistically significant difference (p < 0.05) was found for only 7 indicators, using either a non-parametric (Mann-Whitney U) or a parametric (ttest) statistical test. The indicators with statistically significant differences were as follows: NCS segment- containing tracts had a higher proportion of the Asian and Pacific Islanders, a smaller proportion of individuals reporting that they were members of two or more races, a lower proportion of female residents, a smaller proportion of teen mothers, a smaller percentage of low-birth weight births, fewer bicyclists injured in car accidents, and a lower proportion of Tract area within a ¼ mile of a pollution point source.
Because of the relatively large number of tests involved in these comparisons, many of the seven statistically significant differences are probably not 'significant' in the sense that they indicate that segment tracts are not 'representative' of non-segment tracts. No adjustment was made for multiple comparisons; given the fact that 53 tests of each type were performed, it would be expected that approximately 3 indicators would be significantly different at the 0.05 level for each type of test-a total of seven significant differences-purely by chance.
Variability in measures among tracts is arguably as important as central tendency with respect to 'representativeness'. If the tracts containing NCS segments were systematically less variable than the tracts not containing segments with respect to the community context indicators, the NCS segments could not be considered representative of Queens as a whole, even if the mean level of the indicators was comparable. However, an examination of the ratio of inter-quartile ranges (IQR ratio) for the two groups suggests that the two sets of tracts are in general comparable in terms of variability: as can be seen in Table 2, 25 indicators had an IQR ratio within 20% of 1 (equal variability); an additional 23 had an IQR ratio no more than 1.5 and no less than 0.5 , only 4 indicators had a ratio of <0.5 or greater than 1.5 (two indicators had inter-quartile ranges of zero, so that the IQR could not be calculated).
To provide a sense of the geographic variability of these indicators, two maps are provided, one showing the distribution of percent low birth weight (<2500 g) in Queens Census tracts (Figure 1), the other showing the distribution of percent foreign born individuals. The maps also show the approximate size of an average

Discussion
Previous smaller, longitudinal birth cohorts, both in the US and internationally, have made enormous contributions to our understanding of how maternal nutrition, environmental exposures and social circumstances shape child health and development [42][43][44][45][46][47]. The NCS is designed to expand upon this prior work, at a scale that will allow for analyses of interactions between environmental pollutants, genetics, neighborhood effects and social forces [1]. This scale will facilitate the identification of determinants of childhood disease and the characterization of susceptible sub-populations of children that require a higher level of protection or interventions. Applying the WHO MEME framework to the NCS, we described several domains of neighborhood context variables that may be important determinants of child health and development. While not exhaustive, these domains represent areas of concern identified in the literature, including our own studies of neighborhood effects on health. The variables highlighted here as measures of these domains do not represent the full breadth of dimensions for these domains, but they do represent the key element, and were selected because of the availability of geo-spatial data sets with sufficient spatial resolution in the data to characterize Census tracts.
The analyses presented here document that at the tract level, the Queens NCS segments are representative of Queens overall for a large number of neighborhood level variables. The few differences identified are compatible with chance associations arising across a large number of comparisons, and there are no readily apparent processes to causally explain the differences. The block groups comprising the NCS segments were selected based on a relatively small number of socio- shown on the map, the circles in the legend of the maps represent the total area of the 18 segments (middle circle -6.18 km 2 ) and the average area of a Queens NCS segment (smallest circle -0.34 km 2 ). demographic and vital statistics. However, these variables appear to be correlated with a larger number of other socio-demographic and urban design variables, such that the tracts including portions of the NCS segments are very similar to tracts not including portions of the NCS segments. In addition, these analyses suggest that the Queens segments include a substantial amount of variation in neighborhood context variables for many domains of interest in neighborhood health studies.
Census tracts in NYC are sufficiently small that they are likely to be representative of the segments, but in recognition of disclosure risks, are sufficiently large that tabulated summary statistics won't reveal the location of the segments. Furthermore, since women and children living in the segments are likely to experience social and environmental conditions in area adjacent to the segments, the use of Census tracts encompassing the segments partially accounts for this spatial spillover effect [48]. The analyses show that results of neighborhood context studies derived from the Queens site will likely be generalizable to Queens as a whole.
Measures of many social and economic constructs can be developed from national American Community Survey and Economic Census data and used across NCS sites, leveraging the substantial between-site variation in neighborhood contexts. In addition, other geo-spatial data-sets are often available at the municipal or county level, providing each NCS site with unique spatial data and opportunities to perform neighborhood context analyses. Geo-spatial data sets developed by NYC agencies have been extremely useful in studies of adult health in NYC [49][50][51][52][53] and provide unique opportunities for studying the effects of neighborhood built and social environments on child health and development [49,54].
The issue of differences in availability and quality of neighborhood contextual data across regions is amplified when one considers WHO's efforts to harmonize and coordinate birth cohort studies internationally. Part of  Similarly, cross-cultural research on how the concepts of "neighborhood-level" or "community-level" are defined or are salient needs to be undertaken [40,48,55,56]. Current literature on neighborhood effects commonly uses administrative boundaries (e.g. postal codes or Census tracts) or radial or street network buffers centered on a subject's home to define neighborhoods [48,51,55]. However, just as within a single region individual's conceptualizations of neighborhood can vary, there are likely to be substantial differences in how individuals define neighborhoods across international contexts [57][58][59]. Furthermore, while the definition of a neighborhood used in research should represent the geographic scale over which a neighborhood level phenomena is thought to causally influence health, the geographic scale across which social and physical contexts affect health may vary across cultures.

Conclusions
The WHO MEME framework identifies neighborhood and community contexts as one of the four key indicators of children's environmental health [38]. In applying the MEME framework to the NCS, we have identified multiple domains of neighborhood context and key variables describing the dimensions of these domains that can be used in the National Children's Study (NCS) site in Queens and many of which can be used throughout the NCS. We show that the selection of block groups to form the NCS segments in Queens using a short list of neighborhood contextual indicators (race/ethnicity, poverty status, age distribution and foreign born status of women of child bearing age) produced segments that are representative of Queens County, across many neighborhood variables. The segments also show a substantial amount of variation in neighborhood contextual variables for several  1 A full description of each indicator, including the data source, is provided in Table 1. Except for indicators with small values, results are rounded for display purposes to the nearest whole number; original values (with decimals) were used for purposes of statistical testing. 2 Inter-quartile range ratio: the ratio of the inter-quartile range (segmented tracts divided by non-segmented tracts); a value greater than 1 indicates more variability in segmented tracts than in non-segmented; a value less than 1 indicates more variability in non-segmented tracts.
domains of interest for neighborhood health studies. This suggests that unbiased studies of contextual and individual level risk factor effects on child health outcomes can be conducted within the Queens site. From a larger perspective, the NCS presents a valuable opportunity for conducting studies of the role of neighborhood context on child development and health. The development of strategies to conduct neighborhood health research and protect subject confidentiality, efficiently acquire geospatial data and adhere to data licensing agreements should be a priority. These issues plus the development of an understanding of the content validity of neighborhood contextual measures and the very meaning of "neighborhood" across cultures and regions should be a priority for WHO efforts to coordinate and harmonize data across birth cohorts [37].