An evaluation of edge effects in nutritional accessibility and availability measures: a simulation study

Background This paper addresses the statistical use of accessibility and availability indices and the effect of study boundaries on these measures. The measures are evaluated via an extensive simulation based on cluster models for local outlet density. We define outlet to mean either food retail store (convenience store, supermarket, gas station) or restaurant (limited service or full service restaurants). We designed a simulation whereby a cluster outlet model is assumed in a large study window and an internal subset of that window is constructed. We performed simulations on various criteria including one scenario representing an urban area with 2000 outlets as well as a non-urban area simulated with only 300 outlets. A comparison is made between estimates obtained with the full study area and estimates using only the subset area. This allows the study of the effect of edge censoring on accessibility measures. Results The results suggest that considerable bias is found at the edges of study regions in particular for accessibility measures. Edge effects are smaller for availability measures (when not smoothed) and also for short range accessibility Conclusions It is recommended that any study utilizing these measures should correct for edge effects. The use of edge correction via guard areas is recommended and the avoidance of large range distance-based accessibility measures is also proposed.


Introduction
With an increasing interest in the influence of environmental contexts on health behaviors and outcomes, spatial accessibility and availability indices are increasingly applied in epidemiologic studies including those focusing on the built food environment [1][2][3][4][5][6]. Commonly used availability measures include number or density of outlets, stores, or restaurants in a given location or within a fixed distance of a location. For accessibility, commonly used measures are distance-based; assuming that increased distance acts as a deterrent and reduces the frequency of use of the resource. Frequently, arbitrary administrative boundaries such as Census tracts or block groups are used in lieu of neighborhoods without consideration that resources beyond a given boundary are likely to affect behavior within a spatial unit. Specifically, the effect of edge censoring on such indices has never been fully evaluated. Edge effects occur when the study boundary affects the estimation of a measure and can induce biases which will affect inferences made on the measures [7]. Consider the use of a distance-based measure such as distance to the nearest supermarket. The nearest supermarket may lie outside the study area for locations near the study boundary and thus introducing bias into the spatial measure of distance to the nearest supermarket.
When many observations are close to external boundaries, this effect can be significant. It has been demonstrated that such edge effects can affect the analysis of small area health data [8][9][10]. This is a form of spatial censoring, where data points outside the study area are not observed. This study evaluates edge effect bias via simulation in applications where accessibility and availability measures are used and recommends approaches to correct or allow for edge effects. This paper is structured as follows. We first outline the measures of interest followed by the simulation design and finally provide results both in terms of contoured maps of error as well as distance profiles of bias.

Background to Availability and Accessibility Measures
This study evaluated several accessibility and availability measures. Our choice of measures includes those commonly found in the literature of the built food environment [11][12][13][14][15][16][17][18]. Each measure is available at a spatial location within a study area. We define that location as s, which represents the Cartesian coordinates of the location.

Availability Measures (CI and CI )
The simplest availability measure we examined is the cumulative index (CI), the count of outlets at a location (or within a pre-defined distance of a location such as a distance buffer, a Census tract, or block group). Hence for a spatial location (s), this is defined as CI(s) = n(s). If we index the location as the i th site then CI i = n i . This measure of availability is frequently used [11][12][13][14][15][16][17][18]. Simple derivatives of this index include density measures, either relative to population [16,[19][20][21][22] or to area [20,23]. The variance stabilized form of this count is CI i is often made regularize the variability, and is helpful when there is a need to perform a linear regression model on the square root of the count data [24]. An underlying limitation of the CI is that the spatial unit defines the perimeter of a "neighborhood", i.e. constrains the availability measure to have a "local" nature.
These measures can be computed for a variety of spatial unit sizes. Ultimately the spatial distribution of outlets (or stores or restaurants) is a point process over the study area that may be described by density estimation [25] to provide smoothed local estimates of the density of points. Hence CI is a crude form of a local estimator of density when divided by area. Counts thus are aggregations of outlet locations and maps of counts are smoothed maps of density.
Edge effect censoring can arise with availability measures when counts of outlets are smoothed. For example, averaging of counts within an area will depend on the neighborhood used for the averaging. If part of a neighborhood lies outside the area then some bias will occur in the calculation of the average count near the edge. This is true also for density estimation of point location events [25].

Accessibility indices (C p , distance to the nearest outlet)
Often distance based measures are used to express the idea that potential access to resources diminishes with distance. The distance measured could be road network distance or based on some other relevant distance metric (i.e. Euclidian . This measure provides cumulative evidence for accessibility at a spatial location, and can be calculated for special cases such as CP to the nearest outlet, CP for a specified distance buffer, and CP total (calculated over the entire study region). A related measure is the distance to the nearest outlet: D i = d i itself. Both C p (nearest) and distance to nearest outlet (D i ) can be extended to include a variety of closeness ('distance to') metrics: nearest, second nearest, third nearest, and the 'sum of distances to' these. For example we could specify a cumulative distance to the 3 nearest outlets, or we could also calculate the cumulative opportunity index for the 2 closest outlets to a location.
Clearly with C p measures the smaller the area (A) the more local the measure. One unfortunate feature of the C p is that for larger buffers accessibility is being averaged over areas that are distant from the location leading to over smoothing the measure. Hence it is likely to be more informative to use smaller distance buffers in studies of food access.
Edge effect censoring arises with accessibility measures as measures of distance are only available within the study area. This not only potentially skews the distance distribution but also assumes a travel route to food outlets that may not be relevant for any given individual. When a fixed distance buffer is employed and distances are cumulated within the buffer, then the degree of censoring will increase with buffer size. For availability measures these considerations seem less relevant as distance is not usually included in these measures.

Simulation Study Design
We wish to quantify edge effect bias for these accessibility and availability measures calculated in two spatial environments. We therefore conducted a simulation study to address the nature of the spatial variation of these measures. This study was motivated by and is part of a larger effort on characterizing the built food environment in an eight county region in South Carolina [26].
As is common in evaluation of distance-dependent spatial processes [25] we first defined a unit square study area. This choice allows the evaluation to be carried out without distance scaling and is nondimensional. The effects of scaling of distance are addressed later. A mesh grid placed over the unit square defines grid cells. Uniformly distributed points placed within these grid cells represent s location points. To assess the effects of edges, we partitioned the study area in two: an internal area and an external guard area. The complement of these areas forms the complete study area (figure 1), where the external guard area is bounded by the dashed and solid black line.
Outlets are then simulated based on model assumptions below. The accessibility measures are then computed for the complete study area. A second set of measures are then computed using only the internal area. Hence for all s location points within the internal area there will be two sets of measures: one computed over the entire study area and the other using only the internal area. Hence the effect of censoring at the edges is captured by this design. Comparison of the two sets of measures allows us to evaluate the degree of bias attributable to edge censoring.

Model Assumptions
The simulation design is partially based on characteristics of the local food environment and also more general considerations of applicability to a variety of food environment scenarios. To this end we examined outlet densities in an eight county urban and rural area of South Carolina [26]. Large cities are absent, and the average characteristics of outlet density and its variation between rural and urban areas are highlighted. Here we define 'outlet' to mean either food retail store (convenience store, supermarket, gas station) or restaurant (limited service or full service restaurants). Initial simulations considered total stores and restaurants and assumed an outlet density with mean 14.8 and standard deviation of 13.5 per census tract. These summary values correspond to the South Carolina study which identified 2219 food outlets covering 150 census tracts.
We assumed that the study area was divided into a fine tract grid and then we uniformly distributed 400 location points across the unit square grid. Accessibility and availability measures were calculated from the uniformly distributed s location points to outlets in tracts. The outlet densities in our study area [26] suggest overdispersion relative to a Poisson distribution, and initially we examined simulations where outlets were assumed to have a negative binomial distribution in small areas. This however proved to be too simplistic and did not reflect the clustered nature of the outlet distribution. It is often the case that outlets are found in different clustered arrangements in the food environment and so our simulation would be more appropriate if spatial clustering was included in the design.
To accomplish this we designed cluster simulations where a fixed number of cluster centers are assumed and then clustering of outlets around these centers is specified by the parameter j. The locations of the cluster centers were randomly simulated using a uniform distribution. To then simulate outlets using this clustering process, we simulated potential outlet locations s* also from a uniform distribution. Then we calculated The term |s -x j | is the Euclidean distance between location point s and cluster center x j . We accepted point s* as a location for Note that these forms are closely related to spatial cluster processes [27]. The cluster centers are fixed in the simulation and outlets are simulated around the centers to mimic aggregation of outlets. While it is clear that in some real cases clusters of outlets occur as linear features related to road systems, it is considerably more difficult to simulate generalizable simulation results from linear features. We believe that clustering modeled around centers can act as an adequate approximation to the real aggregation found, but this assumption has yet to be formally evaluated.
We then used different parameters in the clustering process to distinguish between urban and non-urban areas. We assume there are generally more outlets in urban areas as compared to non-urban areas, and we expect there to be more cluster centers in the urban areas but that the outlets are not as tightly aggregated around each cluster center. The cluster centers could represent a large urban development or shopping area, but we would also expect some locations of outlets to be in the general urban area and not just around the big developments. In contrast, we expect fewer cluster centers in the non-urban areas and that these centers would represent "small" or "large" towns within the non-urban areas. We also expect that the outlets will be more tightly clustered around these cluster centers, and that very few outlets will be in the areas outside the cluster centers. Therefore, we specify a smaller j = 0.005 to represent tighter clustering and fewer total outlets in the non-urban areas as compared to j = 0.01 and more outlets in the urban simulations. Figure 2 displays examples of both urban and nonurban simulations of outlets using clustering along with the edge effects boundary. This figure illustrates there will be outlets excluded by edge effects, which will create bias in accessibility and availability measures. We expect differences in bias between urban and non-urban areas, as more outlets are excluded in the urban simulation scenario due to the clustering simulation and total number of outlets in the study area.

Bias and variability
To assess edge effect bias and variability, we calculated the percentage error and absolute bias for each accessibility and availability measure considered. The percentage error and absolute bias for each s location point within the internal boundary area was derived using the calculated spatial measure for the entire grid (internal area + external guard area) versus the calculated spatial measure using only those outlets inside the edge effect boundary (internal area) by the following formula: Percentage error for s = measure(s) total area measure(s) interna − l l measure(s) total area *100  Tables 1 and 2 display the minimum, median, and maximum values for the median absolute bias among locations that are a specified distance from the edge effects boundary. Table 1 is for an urban simulation with 2000 total outlets and table 2 is for a non-urban simulation with only 300 outlets. Regardless of simulation scenario, as the distance to the edge effects boundary increases, all median absolute bias equal zero except for CP total. Even at small distances, all indices except for CP total have little to no median absolute bias. The absolute bias is much larger in the urban simulation as compared to the non-urban simulation for CP total, and this is intuitive due to the large numbers of outlets located in the external guard area for the urban scenario. Figure 3 displays the median percentage error for various accessibility and availability measures depending on how far the location is from the edge guard boundary for an urban simulation with 2000 outlets. Indices that involve only the first two outlets, such as CP for the 3 nearest outlets and distance to the nearest outlet had median percentage errors equal to zero at even small distances from the edge. CI only saw edge effects at the locations closest to the guard area, but this is expected since the total number of outlets for a location depend on the number of outlets in that particular grid cell. Therefore only grid cells divided by the edge boundary would be affected for this count measure. The poorest performing accessibility statistic in term of median percentage error was CP total. The percentage error is higher at locations closest to the edge boundary; however, we still find median errors of 30% at distances farthest from the boundary. Since CP total is a cumulative measure over the entire study area, the percentage errors are expected and alarming high in this urban simulation. Figure 4 displays median percentage errors for CI, CP total, CP for the nearest 3 outlets, and distance to the nearest outlet for a non-urban simulation with only 300 outlets. Median percentage errors for CI range from 0% to over 60% for locations closest to the edge boundary, but there are fewer overall median percentage errors different from 0% in the non-urban simulation versus the  urban scenario. This is attributed to fewer outlets located in the external guard area for the non-urban simulation. Errors for CP to the nearest 3 outlets as well as distance to the nearest outlet are slightly higher in the non-urban scenario. Since there are only 300 total outlets in this simulation, if one of the outlets is located in the external guard area, the next closest outlet may be farther away than one in an urban environment. We find a similar trend regarding the percentage errors for CP total in the rural scenario; however, the errors are generally smaller than what was seen in the urban environment. Since there are fewer outlets in the overall rural simulation as well as in the external guard area, this cumulative CP total is not as affected from edge effects as it is in an urban area.

Mapped Results and Error Profiles
We can also present these edge effect percentage errors in contour plots as shown in figure 5 for an urban environment. Once again, we see higher edge effects in areas closer to the edge boundary, and errors for CP total are the highest as compared to other indices. Similarly, figure 6 displays contour plots for the median percentage error over 500 simulations for the non-urban scenario with only 300 outlets. We see increased errors in locations closest to the boundary edge, but this time we see increased errors around the cluster center locations. These cluster centers could represent small or large town environments, and there are few if any outlets located in areas outside of these small town developments.

Discussion and Conclusions
This paper highlights the importance of edge effects in the analysis of nutritional environment measures. These effects have been of some concern for spatial analysts [7,9,10]. Our simulations demonstrated two sources of bias on analysis results due to edge effects. First, areas close to external boundaries will have additional bias and variance attributable to censoring at the edge. Second, the edge effect can have an overall effect on measure estimation in the map. This means that accessibility measures will be most affected as they use distances as a surrogate for access. Availability measures are less likely to be affected as they are simply local counts of outlets (unless smoothing has taken placed). The median percentage error showed very small or no edge effect percentage errors for spatial accessibility  measures CP to the nearest 1, 2, and 3 outlets as well and the distance to the nearest outlet in both urban and non-urban simulations. However, CP total is greatly affected by edge boundaries regardless of whether the location is close to the boundary edge or not, with over 25% error observed close to the edges and only a marginal decrease to just under 20% at the center of the region. This error is much larger for urban areas than rural areas (see Figure 3 and 4). This suggests that CP total is to be avoided as a measure of choice due to this edge distortion. For availability measures the CI index is greatly affected only at locations next to the edge 0.38 boundary and is generally robust. If smoothing of CI were performed (e. g, by density estimation or non-parametric regression) then the smoothed estimates will have edge effects. Remedies for edge effects are available and usually involve some form of weighting system for edge areas.
Guard areas either external or internal are useful. External areas would be ideal if that extra information is available as they allow the full estimation of internal measures. Internal guard area is always available in any study but this can limit the usefulness of edge areas as they will be used for estimation of non-edge areas only. Weighting based on proximity to the boundary is also possible, as a compromise between internal guard areas and no compensation. From this study it appears that considerable bias appears in the estimates at or close to boundaries. Clearly the use of guard areas would be recommended in any study. The size of such areas would be important to choose carefully.
The implication of this edge effect is clear. When CP measures are used then it is more robust to use short to medium range measures (1 st to 3 rd nearest) than to use CP total. In fact CP total is by far the worst measure for edge bias. The CP total measure has large edge effects while the CI and short range CP measures have relatively minor effects. Confining the study to reporting of  internal areas is important, and so we would recommend that short range measures be used with a guard area of around 10% of the study window, this being the approximate cut off for the effects for short range measures. A further set of measures that combine accessibility with availability are gravity measures. These composite measures use distance friction modified by a measure of attraction (such as sales volume, floor space of outlet). Usually they are defined as a ratio of the form g/d where g is the measure of attraction of the outlet and d is the distance to the outlet. It is beyond the scope of this study to evaluate these measures. However it is  clear that the general behavior of distance-based measures and their behavior at or near boundaries is likely to be found for gravity measures as well in that large distance-based gravity measures will have greater edge biases. Some limitations and caveats should be mentioned also. First, in our simulation study we only considered a variety of clustered outlet distributions. However, outlets may congregate is more arbitrary clusters or associations (e.g. in linear strip malls or in isolated locations). In addition, the assumption of a Euclidean distance measure may be criticized. This is reasonable in a simulation as we cannot hope to represent the arbitrary network distances of real outlet attraction paths. The statistics we have examined are invariant to these transformations of metrics.
In general our test statistics, and our Monte Carlo limits are robust to scale change and ranges of configurations which at least mimic the marginal properties of real outlet configurations. Thus we believe that the results are generalizable to both different spatial scales and distributions. A limitation that we also admit is that we limited our study to spatial summary measures and didn't pursue the application of geostatistical methods to the fields of measures. The decision to do this was made for two pragmatic reasons: summary measures are commonly used and so are more likely to benefit from edge effect evaluation; geostatistical methods are more difficult to apply and it is more difficult to make comparisons of fields between spatial sites.