Study Data Sources
The data used for this study consisted of address data from five different sources. Three were from ongoing epidemiological studies of risk factors for disease: 17,471 addresses from a case-control study of Parkinson's Disease (PD) (cases had been diagnosed with PD, and control subjects were recruited using mailings based on tax-assessor parcel records); 2,417 addresses from a case-control study of prostate cancer, where cases came from the population-based cancer registry, and controls also came from tax-assessor record-based mailings; these subjects provided both residential and occupational historical datasets containing 1,944 and 473 addresses, respectively. In both the PD and prostate cancer studies, the purpose of obtaining geocodable data was to determine proximity to areas of pesticide and other environmental exposures, so obtaining accurate geocodes was essential. In both studies, participants were interviewed to obtain lifetime residential histories, and in the prostate study, lifetime occupational address histories as well. These data are from individuals living in Kern, Tulare, and Fresno counties within the State of California, representing the full spectrum of the rural-urban continuum, as well as areas with varying population densities and mixes of residential and commercial locations.
The two remaining data sources were address lists of 418 hospitals and 2,011 radiation treatment centers in the State of California. These were being geocoded for use in a variety of studies assessing the effect of distance to service provider on health outcomes (e.g. the impact on later stage of cancer at diagnosis of living further away from a radiation treatment center) and would also be useful for such tasks as emergency routing. The hospitals and radiation treatment centers and their addresses were obtained from a comprehensive state-wide listing used for hospital registration purposes [44].
The hospital data are representative of large commercial health care facilities. These locations typically consist of one or more large buildings (and associated parking lots) on large campuses and/or parcels of land. The radiation treatment center data are representative of small commercial health facilities. These locations range from large standalone treatment facilities similar to (or contained within) the large hospitals in the Hospitals dataset to small retail-type offices located in standalone buildings or commercial strip malls.
Overall, these data sources are representative of those used in many health GIS studies. The data from the case-control studies are typical of those used in epidemiological studies of disease risk factors, where participants have the opportunity to describe their residential locations. By contrast, the hospital and radiation facility data come 'as is', but are useful as proxies for geographical measures of access to care [45–47].
Study Data Quality
The address information contained within each of the datasets used for this study varied in quality. For this study we defined the quality of the address data within a dataset in terms of how well the whole set of data could be geocoded. We defined "very good quality" to be locational data for each record corresponding to postal address data that included the standard postal address fields such as street address, city, state, and United States Postal Service (USPS) ZIP code, e.g., "3620 S. Vermont Ave, Los Angeles, CA, 90089". If the aforementioned components of the street address were parsed into their separate components (with "3620", "S.", "Vermont", "Ave" each representing the number, pre-directional, name, and suffix portions of the postal street address, respectively), the assessment was upgraded to "excellent".
The quality of the data was considered "good" when the information contained in the street address portion of each record included non-postal address data such as named facilities, e.g., "Cardinal Gardens", county names, and relative locational descriptions, e.g., "just down the road from Exposition Park". The assessment was downgraded to "poor" for those cases in which data transpositions occurred, data elements were omitted, and/or completely invalid data began to appear in place of an address. The data was considered "very poor" when the address data were not separated into sub-components and some/most of the locational data did not describe the locations at all.
Each of the datasets was characterized as one of these classes based on the addresses it contained. The quality of the hospital and radiation treatment center address data can be characterized as very good because all records contain a full (non-parsed) postal street address, city, state, and USPS ZIP code. The quality of the PD address data source can be characterized as poor because although 64% of the records include data in the street address fields, the address data in these fields are quite poor. The quality of the prostate residential address data can be characterized as fair because over half (56%) of the data include a full street address, and the quality of these address data are also fair. The prostate occupational data can be characterized as very poor because none of the records contained any information in any of the address fields; instead they contain a single attribute called "AddrNotes" which can best be described as a free-text description of where the individual was employed, with roughly 70% of this pertaining to an address in the form of a just business name and city.
Study Batch Geocoding Engine
All records were initially geocoded with a geocoding engine built and maintained by the USC GIS Research Laboratory. This service is hosted at USC [43] and is freely available to any researcher who wishes to use it. The exact details of the geocoder configuration can be found on the USC site [43], and we will only provide a brief overview of the main components here.
The implementation of the geocoder used for this study performed strictly deterministic feature matching with attribute relaxation using the following reference data sources: 2005 US Census Bureau TIGER/Line files [48], US Census Bureau Cartographic Boundary Files [49] including the County, Minor Civil Division, Place, and Zip Code Tabulation Area (ZCTA) layers, and Los Angeles County Assessor's parcel boundaries (LA Assessor Files) [50]. The linear interpolation techniques described in [51] were used for the TIGER/Line street segments, while Green's Theorem was used to obtain the geometric center of parcels and the geographic centroids from the US Census Bureau Cartographic Boundary File features (County, etc.) were used directly. A non-USPS CASS-certified address parser was used to identify the components of an input address. Because of the advanced feature matching, feature interpolation, and additional reference data sources implemented in the USC geocoder, the possible geocode qualities used in this study are based on an augmented version of the NAACCR GIS Coordinate Quality Codes [52], shown in Table 1.
Manual intervention and re-geocoding interface
As part of this study a manual intervention and re-geocoding interface was developed as a web-based application. This interface is hosted at USC [43] and is freely available to all researchers who wish to use it. The service allows a user to upload and interactively process a database of geocoded records in batch, securely over a 128-bit encrypted secure socket layer (SSL) connection in their web browser. The main interface, shown in Figure 1, consists of map, displayed points, record navigation, record selection, and re-geocoding panels. The map panel utilized is an implementation of the Google Maps API, as is the geocoder used for re-geocoding [53]. Technical details on how the Google Maps API can be used to display data are available in [54] as well as from the Google web site [53]. According to the online documentation, the Google Maps API geocoder is based on the Tele Atlas US road network [55]. Other than this detail, little documentation is available describing the components of the Google Maps API geocoder, although there is a substantial amount of user discussion in online newsgroups that suggests it is a probabilistic matching system.
To process a database of geocoded records, a user first uploads a database and maps the fields in their database to the fields the service is expecting. Next, the user selects the type of geocode qualities they wish to work on (or alternatively selects all records), and clicks the "get records" button to display the first set of records from the database. The user can navigate backwards and forwards through the page of records displayed or through different pages of records within the database using the navigation panel. When a record is actively being displayed, the original geocoded point associated with it is displayed on the map. The user can utilize the built in Google Maps functionalities, including zooming, panning, and the selection of several different data layers to view, e.g., satellite/aerial imagery, street networks, or terrain models.
The correction of a geocode creates another point on the map using one or more of the available options; clicking a location on the map, dragging an existing point, or entering information into one of the geocode boxes and trying to re-geocode it. Performing any one of these options creates a new point, adds it to the panel of displayed points, and adds it to the map. When a point is created by either clicking on the map or dragging an existing point to a new location, the user is prompted to indicate the new accuracy level and provide a rationale for how and why they placed the point where they did, as shown in Figure 2.
If the new point is created by re-geocoding using the Google Maps geocoder, the system records the level of accuracy returned from the Google Maps API as a result. The Google Maps geocoder panels are separated into several input fields to obtain information about which portion of the address the user attempted to re-geocode, as shown in Figure 3. When a record is selected for processing, the original address is automatically filled in the "original address" field, so that the user can easily submit it for simple re-geocoding as fast as possible. Alternatively, if the user determines that the original address is something other than a street address, they can enter it into one of the other fields such that the system can keep track of the type of information they attempted to re-geocode, e.g., street intersection, named place, etc. The user can also build an address for re-geocoding from the components of the complete address associated with the record (street address, city, state, USPS Zip code) by clicking on the address matrix portion of the re-geocoding panel as shown in Figure 4. Using this, the user can quickly select individual and/or different combinations of the original address data for re-geocoding. This is useful in the case of transposed address components and/or extraneous data that need not be included in the re-geocoding query, enabling the user to easily re-submit portions of the address for re-geocoding without the need to type anything, again increasing the speed with which they can work on their datasets. If the information they attempt to re-geocode using the Google Maps geocoder returns multiple ambiguous matches, they are all added to the displayed points and map panels.
After creating one or more new points the user chooses and selects the "corrected" one and stores it in the database to be associated with the record. This stores the new point and all relevant metadata about its creation with the record in the database. The correction process is repeated for each record in the database until each of the records that warranted checking has been processed. Two sets of records were skipped altogether: those for which the quality of the geocode was sufficiently high to begin with that no correction was needed and those for which correction was not possible.
Manual intervention and re-geocoding protocol
The users of the system were directed to always use the same protocol when attempting to correct and/or re-geocode an input record. Each user was assigned to work on a different level of original geocode accuracy, with multiple users being assigned the same one in some cases. The geocodes with the lowest accuracy were attempted first. For each record, the users were instructed to first attempt to simply re-geocode the original address. If this was successful in improving the accuracy to a sufficient level, they would store the re-geocoded point as the corrected one. If not, the user would then perform background research using online searches based on information associated with the record to determine what a true corrected address should be. These searches are also performed within the interface so the queries a user performed to obtain more information are also associated as metadata with the corrected record.
The goal was to geocode every record to building centroid accuracy. If the correct building could be unambiguously determined either through a subjective process, e.g., there was only a single building on the street, or through evidence found online, e.g., pictures or descriptions of locations, a corrected geocode was placed at the centroid of the roof centerline, either by clicking on the map or dragging an existing point. The users were directed to place the point as close to the centroid of the building as they could, no matter the size of the building. Any information used to determine this location (i.e., the rationale) was then recorded when the user was prompted to indicate the new geocode's accuracy. If an exact building centroid could not be determined, the users were instructed to attempt to obtain geocodes following the hierarchy defined in Table 1.
Study user group
The user group who participated in this study and worked with the interactive tool to perform the manual intervention consisted of four full-time paid staff members, and three volunteer graduate students. The quality of the work was assessed and verified collectively by two of the authors of this report (DG and MC) by randomly checking 1% of the processed results both during and after the study. This verification consisted of: 1) visually inspecting the placement of the resulting geocode location; 2) a logical check to determine that the correct process and protocol were followed in the determination of how and why the geocode was placed where it was in the case of a successful correction; and 3) a logical check that non-correctable records were in fact non-correctable.