### Theoretical bases of spatial dependency

Determinants of health operate at multiple levels, including intrapersonal, interpersonal, organizational, community and society levels [2]. It is the interplay of determinants at all levels that affects an individual's health. Geographical analyses focus on risk factors at the community level, especially the geographical aspects of communities, namely neighbourhoods, and their interrelations with lower-level risks in influencing health outcomes. Neighbourhoods refer to spatial areas that represent individuals' places of residence and the surrounding environments where people live and interact with each other. Influenced by personal choices or external forces, people with common socio-economic status, ethnicity or demographic characteristics tend to live close to each other within a city. They consider certain surrounding areas as their neighbourhoods or their communities that provide safe places and serve as important sources of support and sociability [3].

Although it is argued that individuals are isolated in modern industrialized urban life and local neighbourhoods lack social ties to provide solidarity and assistance [4], neighbourhoods are still functioning as containers or constraints that limit local residents' access to service facilities, resources, education and job opportunities, constrain their interactions and even the selection of marriage partners [5], and affect and regulate their behaviours or beliefs through common norms. Within their neighbourhood, residents are continually involved in neighbourhood affairs, activities, and interactions through community-level organizations and day-to-day communications. In addition, potential environmental risks, such as air pollution, may also affect the health conditions of residents living in nearby neighbourhoods. Due to these impacts, homogeneities may be found within neighbourhoods or surrounding areas, in terms of living conditions, socio-economic status, health-related behaviours, psycho-social status, and general health. These geographically associated neighbourhood characteristics provide directions for public health interventions to address population health issues.

Neighbourhood characteristics (or collectives) are often associated with individual-level characteristics (or members), but are not necessarily based on individual characteristics [6]. Some collective characteristics can be obtained by aggregating information from individual members, such as the average income of an area. Some are based on the relationships between members, such as the social network of a neighbourhood. Others are the characteristics of the collective itself, such as health care provisions or policies. However, the collective properties at the neighbourhood level are sometimes difficult to obtain, or obtain directly. Samples for most health surveys are individual-based, rather than collectivity-based. Some neighbourhood properties are multidimensional and hard to measure, such as social capital, which includes aspects of social interactions, social network, trust, and mutual benefits. Some other properties may be affected by spatial and social distances and therefore may be hard to divide among neighbourhoods, such as health provisions.

In the absence of directly measured neighbourhood-level variables or indicators, an alternative method to obtain health-related neighbourhood properties is through the aggregation of individual-level characteristics in surveys so that common characteristics may be found to represent collective properties. For example, although social capital is a collective property of neighbourhoods, it may be represented by local residents' average sense of belonging to local communities.

### Spatial interpolation

The spatial dependencies or homogeneities of neighbourhood characteristics legitimate the use of spatial interpolation methods to obtain small-area variables. Based on spatial dependencies, spatial interpolation methods predict values for specified spatial locations using a limited number of sample data points at nearby locations. These methods can be grouped into deterministic and stochastic methods.

Deterministic methods interpolate values based on either the extent of similarity, such as the Inverse Distance Weighting (IDW) method, or the degree of smoothing as in the use of radial basis functions, such as spline and multiquadric functions. The methods can be either global or local, and can be exact (such as IDW, and spline) or inexact (such as polynomial interpolators). On the other hand, stochastic interpolation techniques, such as kriging, quantify the spatial autocorrelation among sample points based on the entire dataset within the study area, and account for the spatial configuration of the sample points around the prediction locations for interpolation. Although kriging is sometimes described as a local interpolation method since technically more weights are given to neighbouring observations, stochastic interpolation methods are by nature global interpolation techniques. The discussion and analysis in this paper focus on two commonly adapted interpolators, namely IDW and kriging.

A widely used deterministic interpolation method is Inverse Distance Weighting (IDW). It is a local exact interpolator that interpolates values based only on the surrounding measured values of the interpolating location and functions of the inverse distances between the interpolating location and locations of the surrounding sample. Shepard's IDW interpolation method [

7] seeks to find an interpolated value

*z* at a given point

x based on samples

*z*_{
i
}*= z(x*_{
i
}*)* for

*i = 0,1, ..., n*. This method can be written as:

$z(x)={\displaystyle \sum _{i=0}^{n}\frac{{w}_{i}(x)}{{\displaystyle {\sum}_{i=0}^{n}{w}_{i}(x)}}{z}_{i}}$

where ${w}_{i}(x)=\frac{1}{d{(x,{x}_{i})}^{p}}$ is the weighting function denoted by the inverse Euclidean distance between the interpolating point *x* and the neighbouring data point *x*_{
i
}, raised to the power *p*.

Modifications to the weighting function for interpolation are also suggested. One modification is Liszka's method [8], which can be expressed as ${w}_{i}(x)=\frac{1}{{(d(x,{x}_{i})+{\delta}^{2})}^{p}}$, where *δ*^{
2
}is a constant dependent on the measurement error to control the smoothness of interpolation. The interpolated surface becomes smoother with the increase of the tuning parameter *δ*^{
2
}. The other modification is the Lukaszyk-Karmowski metric, which is defined by ${w}_{i}(x)=\frac{1}{{D}_{RD}{(x,{x}_{i})}^{p}}$, where *D*_{
RD
}*(x,x*_{
i
}*)* is the probability metric of random values *x* and *x*_{
i
}assuming they have uniform and Dirac delta distribution respectively [9]. This approach is physically based, allowing the real uncertainty in the location of the sample point to be considered.

The IDW interpolation depends on the selection of the power value, *p*, and the neighbourhood search strategy. It is an exact interpolator, where the interpolated surface goes through the sample points and maximum/minimum values in the surface only occur at sample points. IDW assumes that the spatial process at the interpolating locations is being driven by the local variation only, which can be captured through the surrounding sample points. The output values therefore are sensitive to local spatial clustering and the presence of outliers. Health related variables that are determined by local sources of environmental risks, such as pollution from local industries, may fall into this category.

On the other hand, stochastic methods, such as kriging [

10,

11], interpolate values not only based on the surrounding data values, but also based on the overall autocorrelation calculated by applying statistical models to all the known data points. Because of this, not only do stochastic methods have the capability of producing a prediction surface, but they also provide some measure of the certainty or accuracy of the predictions. The kriging method can be similarly constructed as:

$\widehat{z}(x)={\displaystyle \sum _{i=0}^{n}{w}_{i}(x)z({x}_{i})}$

where the interpolated value

$\widehat{z}$ at point

*x* is the weighted sum of the neighbouring observed values

*z*_{
i
}*= z(x*_{
i
}*)* with weights

*w*_{
i
}*(x), i = 1, ..., n*, chosen such that the kriging variance is minimized, subject to the unbiasedness condition:

$E[\widehat{z}(x)-z(x)]={\displaystyle \sum _{i=0}^{n}{w}_{i}(x)\mu ({x}_{i})-\mu (x)}=0$

where *μ(x) = E[z(x)]* is the expected value of *z(x)*. This means that the interpolated value at a given location is the sum of two components: an unknown underlying surface defined by *μ(x)*, plus some additional noise.

Based on different assumptions on *μ(x)* and the unbiasedness condition for calculating the weights, different types of kriging apply, including simple kriging, ordinary kriging, universal kriging, IRFk-kriging, indicator kriging, disjunctive kriging, lognormal kriging, and co-kriging.

The most commonly used type of kriging is ordinary kriging. It assumes a wide sense stationary process with a constant but unknown mean, *μ(x) = μ*, over the study region. Similar with the concepts of the IDW interpolation, the weights, *w*_{
i
}*(x)*, decline as distances between the points at which the surface is being estimated and the locations of the data points increase. The difference is that, instead of using the local inverse distance function, the kriging method uses the globally calculated semi-variogram [12] to calculate the weights.

The semi-variogram is a function describing the degree of spatial dependence of a variable

*z(x)*. It is usually represented as a graph that models the autocorrelation among sample locations to identify how values vary with distance of separation. If there are enough observations and there is no directional effect, a semi-variogram can be empirically estimated by a semi-variogram cloud,

*γ(h)*, which is defined as

$r(h)=\frac{1}{2}\frac{1}{{n}_{h}}{\displaystyle \sum _{i=0}^{{n}_{h}}{(z({x}_{i}+h)-z({x}_{i}))}^{2}}$

where h is lag distance, *n*_{
h
}is the number of paired observations at the distance h, and z is the observed value at a particular location. The semivariance at lag distance h, *γ(h)*, is half the variance of the differences *z*(*x*_{
i
}+ *h*) - *z*(*x*_{
i
}), which is equivalent to the whole variance of z-values at distance *h*[13]. If a directional effect exists, the selection of paired observations to construct the semi-variogram is determined not only by their Euclidian distances but also by their spatial directions.

The semi-variogram can then be plotted against a number of lags. To ensure validity for kriging, the *γ(h)* are often approximated by model functions, such as exponential, spherical, and Gaussian functions. The parameters that affect a typical fitted semi-variogram function are nugget, range, and sill.

An appropriately fitted semi-variogram should be able to reveal the real scale-dependent spatial correlation. In addition to function and parameter selections, the choice of lag sizes and number of lags also affects the fit of the selected function to the empirical semi-variogram. Excessively large lag sizes may mask short-range autocorrelation while excessively small lag sizes may be associated with small sample sizes within each lag to achieve representative averages of *γ(h)*. The number of lags should also be selected carefully so that the range of the semi-variogram function is less than the maximum estimating distance, obtained by multiplying the lag size by the number of lags. If the range of the fitted semi-variogram model is small relative to the extent of the empirical semi-variogram, the lag size or number of lags can be reasonably decreased to reduce estimating time without sacrificing general accuracy.

Once the semi-variogram model is fit, it can then be used to calculate the weights, *w*_{
i
}*(x)*, for kriging. Since kriging is estimated using neighbourhood points of the predicting location, the selection of neighbourhood structure or number of neighbourhood points also has impact on the kriging results.

The assumption of ordinary kriging that the study region has a wide sense stationary process with a constant mean is largely consistent with current regional-level health status in developed countries, such as Canada. Due to the already advanced medical care and the fact that chronic diseases are the leading cause of mortality and morbidity, health status and life expectancies remain relatively stable in most developed societies. Neighbourhood differences in health-related status may be seen as spatially autocorrelated random effects over or below the average health status. It may therefore appropriate to use ordinary kriging for interpolating most of the CCHS variables. However, as mentioned earlier, for variables that are potentially influenced by one or several common sources of risks, a globally applicable semi-variance function of spatial dependency may not exist. In this case, local estimators, such as IDW, may be more appropriate to use. Values at unknown locations can be estimated locally using only nearby samples and distance decay factions. In addition, if a recognizable spatial trend exists, for example, a distance decayed environmental influence on health from a common source of risk or potential rural-urban differences for neighbourhood incomes, other kriging methods, such as universal kriging, may be applied to model the trend as a polynomial surface and perform kriging on the residuals. Thus, the selection of interpolation methods and parameters are determined by the spatial processes of specific variables. In this paper, the effects of the IDW and kriging interpolation methods and different choices of parameters are compared to seek well-performing solutions for CCHS data interpolation.

Using a cross validation method, which takes a data point out of the fitting and then predicts its value and compares the prediction to its actual value, the accuracy and unbiasedness of different interpolation methods or different selection of parameters of the same method can be compared. In particular, the root-mean-square prediction error, which quantifies the root mean square difference between predicted and measured values at sample locations, can be compared between different models as a way of choosing one model over another or adjusting parameter values. Since neighbourhood-level socio-economic status (such as income, education, and occupation) can be easily obtained from census data, it is also feasible to compare directly the interpolated CCHS socio-economic status variables, such as household income, with the corresponding variables from the Census to demonstrate the accuracy of the spatial interpolation using CCHS data or other potential data sources.

Once an effective interpolation method and its corresponding parameters are determined after comparison, a two-step procedure is taken to obtain desired neighbourhood-level variables. First, a smoothed surface with interpolated values at regular grid points over the study area is created using the selected method and parameters. Second, based on defined neighbourhood-level or small-area level boundaries, the interpolated surface is aggregated to obtain values for each neighbourhood or small area unit. Neighbourhoods often do not have clearly defined boundaries. Different people may perceive different geographical division of neighbourhoods based on their own experience. Neighbourhood boundaries are also changing over time based on changing social and spatial processes. In reality, most definitions of neighbourhood use proxy bureaucratic boundaries, such as census tracts or census dissemination areas, which are also changing from census to census. The inconsistency of neighbourhood boundaries over time is an obstacle, in health studies, to comparison, trend analysis, and prediction. The advantage of this two-step approach is that the interpolated surface in the first step allows data to be aggregated to any desired small area level units. Once neighbourhood boundaries are determined based on specific studies, data obtained at different periods of time can be interpolated and aggregated to the same areal units using the proposed steps, making temporal or longitudinal studies possible.

A case study is conducted in the city of Windsor and Essex County in Ontario, Canada to illustrate the procedures of determining the appropriate interpolation method and data handling processes, and to demonstrate the use of the interpolated results for analyzing neighbourhood impacts on health, specifically on low birth weight (LBW).