Geocoding accuracy and the recovery of relationships between environmental exposures and health

Background This research develops methods for determining the effect of geocoding quality on relationships between environmental exposures and health. The likelihood of detecting an existing relationship – statistical power – between measures of environmental exposures and health depends not only on the strength of the relationship but also on the level of positional accuracy and completeness of the geocodes from which the measures of environmental exposure are made. This paper summarizes the results of simulation studies conducted to examine the impact of inaccuracies of geocoded addresses generated by three types of geocoding processes: a) addresses located on orthophoto maps, b) addresses matched to TIGER files (U.S Census or their derivative street files); and, c) addresses from E-911 geocodes (developed by local authorities for emergency dispatch purposes). Results The simulated odds of disease using exposures modelled from the highest quality geocodes could be sufficiently recovered using other, more commonly used, geocoding processes such as TIGER and E-911; however, the strength of the odds relationship between disease exposures modelled at geocodes generally declined with decreasing geocoding accuracy. Conclusion Although these specific results cannot be generalized to new situations, the methods used to determine the sensitivity of results can be used in new situations. Estimated measures of positional accuracy must be used in the interpretation of results of analyses that investigate relationships between health outcomes and exposures measured at residential locations. Analyses similar to those employed in this paper can be used to validate interpretation of results from empirical analyses that use geocoded locations with estimated measures of positional accuracy.

geocoded data [1]. Attempts to establish relationships between environmental exposures and health depend on the accuracy of the geocodes. When health outcomes are sensitive to the magnitude of the exposures in question, any loss of accuracy can cause a loss in the ability to establish relationships between the two. Geocoding quality has become an issue in epidemiological and environmental health studies [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Different studies use different criteria to judge the quality of geocodes [1], although two measures of quality are widely recognized: positional accuracy and completeness in ascertaining a geocode for a given address. In most studies, the severity of these problems is related to the process that generates the geocodes.
The motivating question for the research in this paper is, how do errors in geocodes affect estimates of the relationship between environmental exposures and health outcomes? Statistical power in a model measuring the relationship between exposures and health is computed for different geocoding processes. The results are intended to help researchers decide whether a geocoding method under consideration in an environmental health study is adequate for risk assessment. A second motivating question asks whether it is possible to know the level of geocoding accuracy that is needed to establish the health risk of environmental contaminants in an area. We assume that the contaminant locations can be measured precisely and that the locations of persons exposed to the contaminants are subject to uncertainty. Our approach is similar to that taken by Rull and Ritz [16], who measured the loss in relationships between exposure and health outcomes due to exposure misclassification. We focus on a particular and common cause of exposure misclassification -the geometric inaccuracy of the geocodes. In this research, we analyze the positional inaccuracy of rural geocodes. There is evidence to show that rural geocodes are susceptible to larger inaccuracies than urban geocodes [4,10]. Geocoding inaccuracies are therefore a more pressing problem with rural geocoding than urban ones, although the method described in this paper can be easily adapted to urban situations.

Overview
We use an experimental method to determine the effect of geocoding inaccuracy on the ability to recover relationships between environmental exposures and health. In our experiments, hypothetical risk models are used to simulate health outcomes for a given spatial pattern of environmental contaminants and a given spatial pattern of exposed individuals. For the given spatial pattern of contaminants, we generate health data for hypothetical individuals living at known address locations in Carroll County, Iowa. The address locations used to calculate the environmental contaminant values and subsequently generate the expected health outcomes are highly accurate geographic locations obtained through geocoding the residential structures corresponding to each address based on their recognition on a properly registered, orthophoto map. This geocoding process is abbreviated as G o . We then ask how this known relationship compares with estimated relationships between environmental exposures and health outcomes based on two other methods for geocoding the addresses. One method uses the emergency responders geocoding process -G E (E-911 geocoding)and the other uses the well known automated addressmatching approach using TIGER line files from the US census. (G T , with and without offset). TIGER is an acronym for Topologically Integrated Geographic Encoding and Referencing. In the experiments described in more detail below, measures of exposures are degraded because of geocoding errors in the locations of individuals. The effect of these errors is assessed by examining the accuracy of resulting odds ratio estimates. It is not our objective in this study to determine which geocoding process is optimal. Such an analysis could be a natural extension of this work. In this study we develop methods to study the effect of geocoding inaccuracy on the relationships between environmental exposure and health. We realize this using the three exemplar geocoding processes. In the next section, we discuss the theoretical framework underlying our approach.

Theoretical Framework
While it is possible to apply the method outlined in this research to aggregated health/environmental data (e.g. aggregated at the level of the Census tract), we confine this discussion to the use of individual level address data. We assume that the dataset consists of N unique addresses, with one individual resident at each address. Like Armstrong et al [17], we let the N×5 matrix X = [I, A, W, Z, L] denote the environmental epidemiological data, where I N×1 is a vector of unique identifiers for each record and A N×1 is a vector of corresponding addresses. The vector W N×1 provides the health statuses for individuals, Z N×1 gives the environmental exposures, either measured or modelled, and, and L N×P contains other covariate information where P possible covariates are available. Following our earlier definition [1] therefore, a geocoding process G is used to assign geographic coordinates (U i , V i ) to the ith address, so that G(A i ) = (U i , V i ). Different geocoding processes could yield different coordinates for the same address. However, since measured/modelled exposures are assumed known at every location, we can define a GIS (Geographical Information Systems) model 'm' that maps coordinates (U i , V i ) to exposure Z i ; i.e. m(U i , V i ) = Z i . Hence, we can see that the contaminant value is a function of the geocoding process G since In addition, the expected health effect E(W i ) can be modelled as a function of Z i and covariates L i as E(W i ) = g(Z i , L i ), where g( ) is often a linear or logistic regression model. A simpler approach is to model health outcomes as a function of the environmental contaminant only; i.e. E(W i ) = g(Z i ). It can thus be seen that the model relating W to the contaminant is a function of the geocoding process as well: Note that given a function 'g', a GIS contaminant model 'm', and known A i ' s, the left hand side of equation (2) can be simulated from the right hand side. W i can often be represented by a binary variable. For example, in population-based studies cases could be coded as 1s and controls as 0s. Alternatively, if the study design is a proportionate incidence or proportionate mortality study, then a certain ICD-9 (International Classification of Diseases-Version 9) code can be coded as 1 and all other ICD-9 codes as 0. In such instances, we can express W i as a Bernoulli random variable where: and π i is the probability of developing disease condition W i .
The probability function of W i is: The relationship between π i , Z i , and W i is usually modelled using the logistic function as: The contaminant values Z are continuous in nature, and the associated model parameter is interpreted as follows: every unit increase in exposure to the contaminant Z causes an increase of in odds of disease. The quantity is interpreted as the prevalence of disease among unexposed (Z = 0) individuals.
From equations 1 and 4, we can write: From 5, we see that for a given address, relationship base prevalence and GIS model m, the probability of disease for a person varies as a function of the geocoding process G. Conversely, if by some means the exact probability of disease for a person at address A i were known, then the disease odds for exposure, or β 1 , would vary from one geocoding process to another. If the exact probabilities were calculated using a gold standard or exact geocoding process G, then the extent to which the odds ratio of and the corresponding odds ratio from another geocoding process G' agree would reflect the quality of the geocoding process G'. This odds ratio can thus be used as a means of exploring the quality of one geocoding process with respect to another.
With reference to equation (4), under the null hypothesis, there is no relationship between exposure Z and health outcomes. The odds of disease from having been exposed is therefore 1.
Under the non-informative alternative hypothesis, the odds of disease is different from zero. While an exposure to contamination usually increases the odds of disease and we can expect this to be greater than one, we allow for the possibility of the odds being less than one; i.e. an alternative hypothesis of For a given alternative sample size N, and Type-I error probability α, there is a direct relationship between statistical power 1 -B (where B is Type-II error) and the variance of Z. Again from equation 5 (and 1), if everything other than the geocoding process is kept fixed then power varies with the type of geocoding used. If the 'exact probabilities' were calculated using a 'gold standard' or 'exact' geocoding process G and any other geocoding process were used to detect the relationship β 1 then power would vary from one geocoding process to another. Equations are available [18] for calculating sample size or power in the situation where Z is t or normally distributed. Unfortunately, environmental contaminants are rarely found to be distributed normally. As an alternative, simulation methods can be used to ascertain power. A simple procedure is followed in this paper: b) All model parameters other than G remain constant, and an effort is made to estimate the relationship between Z and W. The extent to which the estimated relationship varies from the true relationship, as G varies, is a measure of the decline in the quality of G. In this study, we apply this procedure to a real situation occurring in Carroll County, Iowa. The three types of geocoding processes examined are typical of those that are used for counties in the Midwestern U.S.

Address Geocoding
The data we wish to develop consist of residential sites and associated contaminant values. Three geocoding processes were used to develop these datasets: a) Address-matching using TIGER line files (G T ) These are geocodes in which addresses are matched to Census street centerline files. Centerline files are produced by the U.S. Census and were available to us from the E.S.R.I's (Environmental Systems Research Institute) website [19]. For this research, Census 2000 TIGER line files are used. Addresses were matched to the street centerline files using the GIS package ArcGIS 9.1 [20]. TIGER geocodes are placed by the software on the centreline by interpolating location on the basis of the street address. End offset of 3% and side offsets in feet (meters) of 0, 200(60.96), 400(121.92), 600(182.88) and 800(243.84) were applied to the TIGER geocodes. Throughout this paper we refer to the TIGER geocoding process G T as one process which includes geocoding with and without offsets.
b) E-911 geocoding (G E ) E-911 geocodes are a promising means of accurately geocoding rural addresses [21]. For the purpose of emergency services dispatch, all discrete addresses are geocoded so that they may be located in response to a 911 telephone call requesting assistance. In geocoding addresses in this Iowa County, this location was defined as that which would most enable an emergency responder to find the person who had requested the service. Specifically, the location is the geographic coordinates at which the emergency responder would leave the public road and join the private road leading up to the property from which the call was made. These geocodes were obtained as a GIS layer file from the Carroll County G.I.S coordinator. The data are current as of June 2006. No offsets are used with the E-911 geocodes.

c) Orthophoto map-based geocoding (G o )
Using visual identification, the E-911 rural addresses were 'enhanced' to a location centered on the residence loca-tion related to the address. This task was accomplished with the aid of 6 inch (15.2 cm)/pixel and two feet (61 cm) per pixel orthophoto maps of the study area, current as of 2002. Figure 1 displays the locations of the geocoded addresses over Carroll County. This dataset was provided by the Carroll County GIS office. A GIS data layer indicating the parcel to which a particular property belonged (and which is used by the county assessor's office for tax assessment) was overlaid on the Orthophoto map and E-911 address layers, to confirm that the geocode was being assigned to the correct address in the few cases when visual identification could not unambiguously identify the E-911 rural address with the related property.
Parcel geocoding was not considered as a reliable geocoding method in these analyses. The median parcel size for the properties of both farm and non-farm residences in rural Carroll county is 1,618,703 square feet or 179,856 yards (150,382 square meters), so that a geocode placed at the centre of a square parcel of this size would have a median error of approximately 671 feet (204 m). This error can be reduced with the help of ancillary knowledge of the location of the residence. Since the likely source of this knowledge would be an orthophoto image, anyone possessing this source would be better advised to extract the location of the residence as we have done in this work. Figure 2 illustrates these geocodes. It shows the seven locations that we consider for addresses. The geocode on the public road leading to the property is the E-911 location and the geocode on the residence is the orthophoto geocode. In this case, the two geocodes are approximately 550 feet (168 m) apart. The TIGER geocodes, of which there are five, have varying degrees of accuracy in this example, with some of the TIGER offset geocodes having better accuracy than E-911.
We started with a comprehensive dataset of 2,516 addresses representing all rural addresses in Carroll County. All addresses that are located outside the legal (incorporated) boundaries of towns are considered rural. For each address an E-911 geocode is available. The E-911 geocodes therefore have 100% completeness. Since the orthophoto geocodes are enhanced from the E-911 geocodes, all addresses have an orthophoto geocode. Of the 2,516 addresses 14 were found to be duplicates and eliminated. A further 69 addresses were found to be have been erroneously coded as rural and removed. The remaining 2,443 addresses were geocoded to TIGER street centerline files. A minimum match score of 100 % was used and no manual interactive matching was used because the purpose of this research is to show the effects of typical differences in locations between "perfectly geocoded" residences according to currently accepted geocoding processes (automated TIGER, E-911) and ground truth as Locations of the geocoded rural addresses in Carroll County, Iowa Figure 1 Locations of the geocoded rural addresses in Carroll County, Iowa. Orthophoto geocoded locations Streets exemplified by the orthophoto determined locations. 1, 581 of the 2,443 addresses were geocoded with 100% match score to the TIGER Street Centerline files indicating a match rate of 64.7%. Our results represent a conservative view of the difference between TIGER, E-911 geocoded locations and ground-truth locations. Clearly, addresses that could not be geocoded accurately from the TIGER file would represent a systematically larger error than those studied here and bias would be introduced by any attempt to interactively geocode the unmatched addresses.
This research thus utilizes the 'incomplete' [22] set of 1,581 addresses. Therefore for each of these 1,581 addresses three geocodes -E-911, Orthophoto and TIGER are available. The next step is calculating contaminant val-ues (Z). This is calculated using these geocodes and a GIS model 'm'.

Contaminant value calculation
In this research we utilize CAFOs (Concentrated Animal Feeding Operations) as the disease-causing contaminant source. CAFOs have been suspected as possible sources of disease-causing effluents in rural areas of the U.S. [23,24]. Exposure to air from swine CAFOs has been suspected to increase the risk of eye irritation, headaches, nausea and a variety of respiratory and gastrointestinal disorders [23,25,26]. CAFO air is considered to hold elevated levels of H 2 S, Ammonia and suspended particles. Very few studies have attempted to look at the health effect of CAFOs making them an interesting source of pollution to study.
In this study we attempt to work with the relationship between these contaminants and asthma. The study can

TIGER geocodes with offsets
Property for the address being geocoded be generalized to any other respiratory disorder like asthma that has an odds elevation and disease base prevalence similar to the ones assumed in this study, and that has a suspected relationship with one or more of the contaminants.
The locations of 55 CAFOs in Carroll County, for which permits had been issued by the state were obtained as a GIS layer file. A plume dispersal model based on the AER-MOD (AMS/EPA Regulatory Model) [27] was used to model the contaminant dispersed from each CAFO. The contaminant modelled is a generic "conservative" contaminant which means that the contaminant is non reactive in nature. Our model can therefore apply to any and all of H 2 S, Ammonia and suspended particles. The model is a Gaussian dispersal model which accounted for prevailing wind direction. The input variables to the model are the wind direction, speed and the height of the stack. Meteorological data, averaged over five years, are from the National Weather Service Station at Sioux Falls; while the height of the stack is approximated at 5 meters. This was used to determine time-averaged (five years) relative concentrations of an air contaminant dispersed from a CAFO. The model was realized with a combination of MaTLab [28] and Excel VBA (Visual Basic for Applications) [29] programs. The MaTLab program calculates the plume from a single CAFO and outputs the result as a 25 meter fine grid (over a 1 kilometre square CAFO pollution plume footprint) as a digital file. The contaminant value at each grid point is provided in the digital file. The Excel VBA program uses this digital file plume output and the locations of CAFOs and geocoded addresses to calculate the contaminant value at each address location. This program can calculate the contaminant value at any location in the County, be it an address location or any other chosen location. This table of contaminant values at each geocoded address is the input data for the simulation step discussed in the Simulation section below.
For the purposes of visualization, contaminant values were also computed for a 50 meter fine grid and the values were contoured in ArcGIS [20] to produce a surface representation. A small part of the resulting map is shown in Figure 3. In the next section we discuss the computer sim-ulation which generates the modelled relationships and tests their strength in the presence of geocoding error. The simulation was performed using the R statistical software on a standard Pentium desktop.

Simulation
The simulation methodology consists of the following 8 steps: 1) Assume that one individual resides at each address. Simulate probabilities of disease for N = 1,581 individuals π N×1 using equation (5) a specific geocoding process, as: We take β 0 = ln (0.075). This implies that the simulated prevalence of disease among unexposed individuals is 7.5%. Further take β 1 = ln (1.2)/(Interdecile(Z)). This implies that a person at the 90 th percentile of the contaminant distribution Z has an odds of 1.2 compared to a person at the 10 th percentile of the contaminant distribution. The 7.5% disease prevalence is consistent with reported population estimates for asthma, and the exposure odds ratio value of 1.20 is consistent with available risk estimates [30][31][32].

Results
We define a geocoding error as the difference in distance units between the Orthophoto geocode and the geocode (E-911, TIGER) for a given address. Analyses of the TIGER (G T ) geocoding errors and the E-911 geocode errors showed a median difference of 693 feet (211.23 m) for TIGER geocodes and 151 feet (46 m) for E-911 geocodes. Table 1 summarizes the errors between the orthophoto geocodes and other geocodes. Note that median error seems to be minimized at around 400 feet (122 m) offset for the TIGER geocodes. The largest errors with TIGER geocoding are in the range of 8 miles, which is caused by addresses in one part of the county being wrongly matched to a TIGER line file in another part of the county. These matching errors can be contrasted with the more frequent, but smaller offset errors.   [3]. Contaminant values at E-911 geocoded locations and orthophoto mapbased locations of addresses were highly correlated ( Table  2). Figures 4 and 5 display the variation in errors with contaminant values. Note that while both E-911 ( Figure 4) and TIGER ( Figure 5) 19 geocoding have larger errors with increasing contaminant values, errors at smaller values seem to be more pronounced with TIGER geocoding. The outliers are addresses that are erroneously geocoded closer to the CAFOs than their true location. In fact these figures demonstrate that TIGER geocoding errors tend to introduce a pronounced positive bias in the contaminant values at address locations.
One exploratory method of comparing the effect of errors in contaminant values from geocoding errors is the method of calculating the attenuation of odds ratios [16] The odds of disease at a geocoded location for an address can be calculated as a function of the contaminant value as for example , where ZO is the contaminant value calculated using the orthophoto geocode for an address and Δ is the interdecile range (Zo). Similarly the odds value of would represent the odds calculated using the E-911 geocode. ZO -ZE would represent the bias or error in calculating the contaminant value and this bias would in turn affect the odds ratio / . The bias introduced by the error in contaminant values from geocoding inaccuracies could cause the odds of disease to be both greater or less than what we would expect it to be if the geocodes were accurate and there were no modelling error in the contaminant values. The ratio of odds calculated in the no error situation to that calculated with error would be 1 if this error were equal to zero, or so small that the ratio is equal to 1 when rounded to two significant decimal digits. To study the extent of the bias, odds ratios were calculated as odds (disease | ZO)/odds (disease | ZE) and odds (disease | ZO)/ odds (disease | ZT). The results can be seen in Figures 6  and 7. In either of these figures, the more unbiased a given geocoding process is, the more we would expect the data points to cluster at odds ratio (OR) = 1. A larger proportion of the odds were unbiased (OR = 1) with E-911 geocoding (80.00%) than with TIGER geocoding with 0 offset (59.00%). Adding offset to the TIGER geocodes did not substantially improve the proportion of unbiased odds, with the mean being around 60%. There was also a bias towards detecting an association (OR < 1) with TIGER geocodes (≈20%), than with E-911 geocodes (10%). This is consistent with the observations made earlier from Figures 4

and 5.
We tested the robustness of the simulation by changing the value of the simulated odds. The results of this sensitivity analysis are summarized in Table 3 and Table 4. This was done for the 100% sample of 1581 addresses locations. Different values of simulated odds do not cause large differences in bias. The simulation program was tested with odds values of 1.01, 1.15, 1.2, 1.5 and 2.0 (with 1.2 being the value used in our main analyses). All these odds were recovered with reasonable accuracy by the geocoding processes (Table 3). However, as we might expect varying the odds value does have an effect on power (Table 4). An odds of 1.01 is successfully detected in only around 5% of the simulations by the various geocoding processes. In contrast an odds of 2.00 is detected with a power of 100%.
The relationships (odds) are recovered with almost no error across different sample sizes and geocoding processes ( Table 5). The power is greater when E-911 geocodes were used than when TIGER geocodes were used, for a given sample size. TIGER geocoding provides very low power for the most part and it needs more than twice the sample size as that of E-911 or orthophoto geocoding to achieve the same power ( Figure 8). The biased contaminant values (Figures 4, 5, 6, 7) at the TIGER geocodes con-  tribute to this result. Table 6 compares odds recovered with varying TIGER offsets. Note that adding offset to the TIGER geocodes does increase power. The best power is obtained by using offsets in feet (meters) of 400 (121.92), 600(182.88) or 800(243.84) and the differences in power between the three are small. This can partly be explained by the fact that the median TIGER error is around 700 feet (213 m). It is therefore possible, that the 'optimal' TIGER offset is around this value. The higher offsets also result in an odds ratio which is slightly biased towards the greater than the true odds ratio. This could be because adding offset to the TIGER geocodes moves the address locations closer to the CAFOs, which are almost always located at an offset from the main street.

Discussion
This paper investigated the degree to which the recovery of a known relationship between environmental exposure and health is affected by the geocoding quality of the subjects of the research. Power analyses showed that the qual-ity associated with different geocoding processes affected the ability to recover the relationships. As with all power analyses the size of the sample as well as the variability in the contaminant surface and the location of the sample in relation to this surface also affected the ability to recover the relationship. Because state or local regulations often control the locations of CAFOs relative to the residences of people, the numbers of people living in areas of high exposure to CAFO contaminants is limited which, in turn, limits the ability to detect health effects in natural experiments as in this research [33]. Another limitation of this study is the spatial structure of the contamination surface. The structure of the surface we use is limited to the source of pollutants and the GIS model. A different source and a different model would result in a surface with a different structure. Thus the specific results obtained in these analyses are specific to the particular contaminant examined. Nevertheless, the methods used in this paper can be used for any contaminant surface of interest. The generality of the results described here lies in the methods of conduct- Figure 4 Relationship between error in contaminant values at TIGER geocodes with true contaminant value.

Relationship between error in contaminant values at TIGER geocodes with true contaminant value
ing the kind of the analyses we have described rather than the specific results.
The methods used in this paper can be adapted to other situations where the effect of environmental contaminants on health is the subject of study. Because linked social-spatial data [34][35][36] increase the risk of identifying the subjects of the research, institutions often limit the quality of the geocoding in order to mask the identity of the subjects. Such masks will severely limit the ability to recover relationships between contaminant exposures and health, especially when the health effects of such contaminants are sensitive to changes in short distances from the sources of exposure [17]. Pursuing such research in rural areas is doubly difficult because the most commonly used spatial mask which moves the location of the respondent from their true location to a masked location is most effective when used in urban areas where the number of other people with whom the respondent could be linked by location is large [35]. Also, large inaccuracies often occur in some geocoding processes in rural areas. It is easier to capture the variability in a contaminant surface in an urban area with the relatively dense settlement pattern of people there.
Our results suggest that studies of relationships between environmental contaminants and health may be better designed by using spatial sampling procedures that identify locations of residences that equalize the number of subjects for different estimated levels of the contaminant load. Random samples of subjects are unlikely to have such characteristics and power analyses based on such samples will be less effective. With the widespread availability in the U.S. and elsewhere of E-911 or similar master address lists, and the availability as in this study of spatially modelled contaminant surfaces, determining such spatially stratified random samples that parsimoniously identify respondent locations will improve the quality of analyses of effects of contaminants on health.
A common problem faced by researchers of this subject is that they cannot know a priori whether the quality of the geocoding process they have used is adequate for the purpose of finding a relationship between contaminant val-Relationship between error in contaminant values at E-911 geocodes with true contaminant value ues and health. This study is a model of how they might proceed to determine the ability of their proposed research to determine the health effects of the contaminant they are studying by performing the same experiments described in our study. In these experiments they would control the size of sample, the location characteristics of their sample, and the degradation of the geocoding quality of the locations they examine. Some of the studies of geocoding quality include maps of expected geocoding error-rates. These too, when available, can be incorporated in these experiments. We expect that software that automates such experiments will become available in the future. It is needed and could be produced.
Analyses to predict the ability to detect relationships between contaminant values at given locations and health will generally need to incorporate known demographic covariates that are also predictive of a health effect. Power analyses can be designed to incorporate covariates. A recurring question in geographic information science is whether particular geospatial databases are sufficiently accurate for the purpose for which they are used. Determining "fitness-for-use" of a geospatial data set is difficult and has been the subject of research in GIScience [37][38][39][40][41][42]. An interesting case in point is a study by Lewis et al., [43] which estimated the effect of road traffic exposures to the prevalence of asthma in a sample of 11,562 UK children. The geocode used was the UK postal code which places each child in a relatively large area from the spatial centroid of which the distance to nearest main road was computed. Because of the errors in these estimates of distance from the child's home to the nearest main road, errors in exposure estimates were large, and probably large enough to question the conclusion of the study that asthma prevalence was not associated with proximity of the home to a main road.
Although spatial databases are becoming more accurate as GIS technology improves and efforts are made to improve the accuracy of geographic base maps, it is accepted that no single level of accuracy will meet the requirements of every purpose for which spatial data is used. For each use, there are accuracy requirements and the question we asked is which of three widely used measures of location is adequate for the purpose of assessing whether a relationship exists between exposure to environmental con-Variation in odds ratios in simulated disease from exposure to contaminant calculated to Orthophoto geocode and exposure to contaminant calculated to TIGER geocode, with error in contaminant calculation at a TIGER geocode Figure 6 Variation in odds ratios in simulated disease from exposure to contaminant calculated to Orthophoto geocode and exposure to contaminant calculated to TIGER geocode, with error in contaminant calculation at a TIGER geocode. taminants and health. While research in geocoding accuracy and environmental health problems has often focussed on the effect of inaccuracies on an observed prevalence or relationships [14,44,45], this is to our knowledge the first time the effect of geocoding inaccuracies on assessing the strength of an existing relationship has been addressed. Consideration of such inaccuracies in epidemiologic studies of environmental exposures can greatly improve confidence in the validity and accuracy of results.

Conclusion
An experimental method to investigate the effect of geocoding accuracy is proposed in this paper. The method of accuracy assessment takes into consideration the 'purpose of use' of the geocodes in an environmental health context. Since a goal of such research is to examine relationships between health and exposure, the proposed method focuses on estimation of disease risk in the presence of modelling errors introduced through geocoding inaccuracies. We examine three widely used geocoding processes. Health data are simulated using known odds from expo-Variation in odds ratios in simulated disease from exposure to contaminant calculated to Orthophoto geocode and exposure to contaminant calculated to E-911 geocode, with error in contaminant calculation at an E-911 geocode Figure 7 Variation in odds ratios in simulated disease from exposure to contaminant calculated to Orthophoto geocode and exposure to contaminant calculated to E-911 geocode, with error in contaminant calculation at an E-911 geocode.  sure to a contaminant. The contaminant values are calculated using a gold standard geocode. These odds are then detected using contaminant values calculated using two other (apart from the gold standard) geocodes. Of the three geocoding processes studied all were successfully able to recover the simulated odds, though the strength of the relationship varied from process to process. In these analyses E-911 geocoding came out superior to TIGER geocoding (with and without offset). More research is required to decide on an 'optimal geocode', since we have not evaluated all possible offsets of TIGER geocoding, E-911 with offsets and other geocoding processes such as GPS based or parcel based geocoding. Sensitivity analyses show relative robustness of the model at recovering the simulated odds. While the specific results obtained in this research may not be generalized to other situations the method can be applied in any situation where issues of geocoding accuracy are in question in an environmental   research project to which this paper contributes. He wrote sections of the paper.