A generic method for improving the spatial interoperability of medical and ecological databases

Ghenassia, A.; Beuscart, J. B.; Ficheur, G.; Occelli, F.; Babykina, E.; Chazard, E.; Genin, M.

doi:10.1186/s12942-017-0109-5

Methodology
Open access
Published: 03 October 2017

A generic method for improving the spatial interoperability of medical and ecological databases

A. Ghenassia ORCID: orcid.org/0000-0001-9115-0941^1,3,
J. B. Beuscart¹,
G. Ficheur^1,3,
F. Occelli²,
E. Babykina¹,
E. Chazard^1,3 &
…
M. Genin¹

International Journal of Health Geographics volume 16, Article number: 36 (2017) Cite this article

2893 Accesses
5 Citations
2 Altmetric
Metrics details

Abstract

Background

The availability of big data in healthcare and the intensive development of data reuse and georeferencing have opened up perspectives for health spatial analysis. However, fine-scale spatial studies of ecological and medical databases are limited by the change of support problem and thus a lack of spatial unit interoperability. The use of spatial disaggregation methods to solve this problem introduces errors into the spatial estimations. Here, we present a generic, two-step method for merging medical and ecological databases that avoids the use of spatial disaggregation methods, while maximizing the spatial resolution.

Methods

Firstly, a mapping table is created after one or more transition matrices have been defined. The latter link the spatial units of the original databases to the spatial units of the final database. Secondly, the mapping table is validated by (1) comparing the covariates contained in the two original databases, and (2) checking the spatial validity with a spatial continuity criterion and a spatial resolution index.

Results

We used our novel method to merge a medical database (the French national diagnosis-related group database, containing 5644 spatial units) with an ecological database (produced by the French National Institute of Statistics and Economic Studies, and containing with 36,594 spatial units). The mapping table yielded 5632 final spatial units. The mapping table’s validity was evaluated by comparing the number of births in the medical database and the ecological databases in each final spatial unit. The median [interquartile range] relative difference was 2.3% [0; 5.7]. The spatial continuity criterion was low (2.4%), and the spatial resolution index was greater than for most French administrative areas.

Conclusions

Our innovative approach improves interoperability between medical and ecological databases and facilitates fine-scale spatial analyses. We have shown that disaggregation models and large aggregation techniques are not necessarily the best ways to tackle the change of support problem.

Background

In the field of epidemiology, the term “spatial analysis” refers to the description and analysis of the spatial distribution of healthcare phenomena, such as the incidence or prevalence of disease or healthcare consumption across geographic areas [1,2,3,4,5]. Although spatial analysis can be applied to point data, geostatistical data and aggregated data, most of the data for spatial analysis in the field of health are aggregated because they ensure that the patients’ data remain confidential. By definition, these so-called ecological studies use data that have been aggregated into administrative spatial units, such as counties, provinces and states. These analyses require two categories of aggregated data. The first category is related to how the events (e.g. the cases of disease or surgical acts) are counted within each spatial unit in the study area. The second category is related to the descriptive ecological data on the source population and the living environment within these spatial units, such as the socio-economic level, the employment rate, housing conditions and environmental quality. For example, a spatial analysis of the incidence of Crohn’s disease in northern France examined correlations between two data sources: all new cases of Crohn’s disease recorded in the EPIMAD register for each district (canton), and the characteristics of each of these districts in terms of the underlying population and the living environment. By combining these two sources, the investigators were able to (1) calculate the incidence of Crohn’s disease for each canton, and (2) evaluate the influence of the living environment and the population’s socio-economic level [6, 7].

Spatial analysis in healthcare is attracting growing interest because of improvements in statistical analysis, the development of information technology tools, and the emergence of disease registries [8,9,10,11,12,13,14]. More recently, the availability of big data in healthcare [15,16,17] and the intensive development of data reuse [18, 19] and georeferencing [20, 21] have opened up new perspectives for describing healthcare consumption or disease prevalence/incidence over large geographical areas—even whole countries—and analyzing their ecological determinants (such as socio-economic factors) [22, 23].

However, the correlation of big data and ecological data over large areas is complicated by the problem of database interoperability [24,25,26]. In the specific setting of spatial analysis, interoperability is based on the smallest possible spatial reference unit, which acts as a link between the medical database and the ecological database. In the absence of this link, the data must be aggregated on a larger scale, which limits the precision of the results [27,28,29]. In fact, the quality and relevance of the conclusions of a spatial analysis depend on the concordance between the spatial resolution and the nature of the phenomenon studied. The use of aggregated data induces an ecological bias that fades (but does not disappear) when the spatial resolution is increased [30]. Moreover, a finer-scale analysis enables the assessment of more local phenomena, such as the impact of sources of pollution [31]. However, larger spatial units may be more appropriate if the underlying disease pathways involve larger-scale phenomena. The availability of fine-scale data provides an opportunity to use the scale that best matches the study’s goal.

Poor interoperability between medical databases and ecological databases thus appears to be a major limitation for fine-scale spatial analyses of large geographical areas. However, the interoperability problem should not limit the choice of the most appropriate scale. This interoperability problem has been highlighted (for example) for National Health Service data in the UK, Statewide Planning and Research Cooperative System data from New York State in the USA, and the French national diagnosis-related group database (Programme Médicalisé des Systèmes d’Information, PMSI) [27, 32, 33].

Two ways of tackling the interoperability problem have been suggested: spatial disaggregation and spatial aggregation. The first approach consists in creating a mapping table that adopts the finest scale; consequently, the data aggregated on a larger scale are disaggregated into spatial units at the finest scale. However, this necessitates the use of complex statistical models for spatial disaggregation (such as areal interpolation models) to estimate the variables’ values on a smaller scale. Hence, these procedures can lead to errors in the spatial estimation, which are especially large because the spatial units of origin are considered on very different scales (e.g. by going from the state scale to the town scale) [26, 34]. The second approach (aggregation methods) consists in creating a mapping table that links the spatial units of one or both databases to a larger scale. In a simple, particular case, the data from one of the two databases are aggregated to the spatial scale of the other database. However, in the most frequent case, the spatial units of the two databases are aggregated into a larger spatial unit that covers them both. Although most studies use administrative spatial units as a larger spatial unit, this is not necessarily the finest and/or most appropriate scale for use. Consequently, aggregation methods markedly decrease spatial resolution (e.g. by going from the town scale to the county scale), and may lead to an increase in the ecological bias [27,28,29].

The primary objective of the present study was to develop and characterize a generic method for building a mapping table between a medical database and an ecological database while maximizing the spatial resolution and avoiding the use of spatial disaggregation techniques and thus enabling the choice of most appropriate scale for the phenomenon being studied. By way of an illustrative example, we applied this method to the interoperability of the above-mentioned PMSI medical database and the socio-economic data produced by the French National Institute of Statistics and Economic Studies (Institut National de la Statistique et des Études Économiques, INSEE).

The generic method

This section describes the generic method for improving the spatial interoperability of medical and ecological databases. The different steps in this generic method are summarized in Fig. 1.

Data and objectives

Let us consider two distinct databases: a medical database that describes patients and healthcare events, and an ecological database that describes the population. The present method considers the following conditions of application:

1.
The medical database is organized on the scale of the individual. Each individual is attached to a spatial ID Spatial_Id_Medical, which corresponds to the spatial unit SU_medical. A variable characterizes each healthcare event.
2.
The ecological database is organized on the scale of the spatial unit SU_eco, which has a unique spatial ID Spatial_Id_Eco.
3.
The spatial units SU_medical and SU_eco differ, as do the spatial IDs Spatial_Id_Eco and Spatial_Id_Medical.

The objective of our method is to build a mapping table that enables the creation of a final database comprising both medical and ecological data from the above-mentioned databases on the scale of the spatial unit SU_analysis and with a unique spatial ID called Spatial_Id_Analysis. The medical database must be aggregated for the variable characterizing the healthcare event on the scale of the spatial unit SU_medical (Fig. 2). An example showing how the final spatial analysis database is built is provided in the Additional file 1.

Construction rules

1.
The direction of the relationship. When spatial units differ in size (i.e. SU_medical ≠ SU_eco), the two databases can only be aligned after the data have been aggregated. Count data are aggregated by calculating a sum, whereas continuous variables or proportions can be aggregated by calculating a median, mean or weighted mean. The larger of the two spatial units is then chosen as SU_analysis. The reverse process requires the use of a disaggregation method, leading to a loss of precision [34, 35].
2.
Transition matrices M ₁ …M _p. A transition matrix is a tool for linking an original spatial ID to a final spatial ID:

A mapping table for the IDs Spatial_Id_Medical and Spatial_Id_Eco IDs can be built by using p transition matrices (p ≥ 1). For example, a transition matrix makes it possible to associate each town’s spatial ID with the spatial ID of the state to which it belongs. However, in more complex situations, there may be no direct way of linking the two spatial IDs. Thus, two or more matrices are required, leading to the creation of at least one temporary spatial ID Spatial_Id_Temp. The mapping table yields p + 1 Spatial_Id, where Spatial_Id ₁ corresponds to the Spatial_Id_Eco and Spatial_Id _p+1 corresponds to the Spatial_Id_Medical. The transition matrices are based on a detailed assessment of the Spatial_Id_Medical and Spatial_Id_Eco IDs. It is then necessary to describe all the equivalence situations for each transition matrix. One or several Spatial_Id _j can correspond to one or several Spatial_Id _j+1 (1 ≤ j < p + 1). The various, mutually exclusive equivalence situations for a given transition matrix M_k (1 ≤ k ≤ p) are shown in Fig. 3.

Validation

Validation of the mapping table

After the final database has been built, it is necessary to validate the quality of the interface between the medical database and the ecological database. We used the following approach: (1) identification of the set of variables shared by the medical database and the ecological database; (2) choice of the variables that display the best exhaustiveness and reliability; and (3) comparison of these variables in the two databases on the scale of the SU_analysis spatial unit.

Spatial validation

In spatial terms, the final purpose of the mapping table is to create a background map on the scale of the SU_analysis spatial unit. In order to check the quality of the selected spatial unit (SU_analysis), it is necessary to evaluate spatial continuity and the decline in spatial resolution.

Spatial continuity is defined as the ability to move from any one point to another point without leaving the spatial unit considered. In other words, a spatially continuous unit has a single boundary [36,37,38]. A spatial unit that does not meet this condition is referred as discontinuous or fragmented. Most studies of putative links between a health outcome and environmental factors rely on the use of aggregated data. These data are frequently represented by the centroid of each spatial unit. However, in the case of discontinuous spatial units, the centroid may be outside the spatial unit. Hence, an error in the data’s spatial location (due to fragmented spatial units) might affect the findings and result in an erroneous conclusion [36,37,38]. In order to control for this eventuality, spatial continuity is evaluated by determining the fragmentation of the spatial units, defined as the number of discontinuous SU_analysis as a proportion of the total number of SU_analysis [37, 38]. This index can be calculated using geographical information systems, such as QGIS and ArcGIS [39, 40].

Spatial resolution is defined as the surface area of the smallest spatial unit in a given data set; it corresponds to the level of detail within the data. Aggregation of spatial units decreases the spatial resolution and thus the quality of the analysis. For example, the spatial resolution decreases if (for a given geographical zone) the data for a town are aggregated with data for the region as a whole. The decline in spatial resolution can initially be evaluated visually. The background map for SU_analysis is compared with the background map for the smallest spatial unit in the initial databases, in order to identify any obviously aberrant aggregates. The decline in spatial resolution can then be measured by calculating the ratio between the median surface area of SU_analysis and that of the smallest spatial unit in the initial databases (SU_initial = SU_eco or SU_med). This ratio must also be calculated for other administrative reference units whose surface area is known. These ratios are then compared: a lower index of decline corresponds to a spatial unit with a higher spatial resolution.

$$\frac{{{\text{SU\_analysis}}}}{{{\text{SU\_initial}}}}\quad {\text{versus}}\quad \frac{{{\text{SU\_reference1}}}}{{{\text{SU\_initial}}}}\quad {\text{versus}}\quad \frac{{{\text{SU\_reference2}}}}{{{\text{SU\_initial}}}}$$

For example, reference units 1 and 2 could be the county and the state for the USA, or the canton and the département for France. This index can be also calculated from census data on the number of inhabitants.

Application of the generic method: an illustrative example based on French databases

Data sources and objectives

In this section, the generic method is applied to a pair of French medical and ecological databases.

1.
The medical database is the PMSI. Collection of these data has been approved by the French National Data Protection Commission (Commission Nationale de l’Informatique et des Libertés; authorization 1754053). The database is compiled and released by France’s Technical Agency for Information on Hospitalization (Agence Technique de l’Information sur l’Hospitalisation, ATIH). The database contains a summary of each inpatient stay in France, including the ICD-10 diagnostic code, the medical procedures performed (coded according to the French CCAM classification) and the patient’s age, gender, and unique identifier. Each patient is localized by his/her place of residence, which is only characterized by the PMSI spatial ID (Spatial_Id_PMSI) in the spatial unit SU_PMSI. There were 5644 distinct SU_PMSIs in France in 2014, which were characterized by a mean surface area of 97.37 km² and a mean population of 11,174.
2.
The ecological database was produced by the INSEE [41]. The INSEE acts as France’s census office, and collects a vast range of demographic, social, economic and housing-related data. Most of the data are publicly available on the INSEE website. The data are summarized for various spatial units: the commune, the canton, the département and the région (in increasing hierarchical order; see Additional File 2 for details). Most frequently, the data are summarized on the scale of the commune (SU_INSEE), which is characterized by the spatial ID Spatial_Id_INSEE. In 2014, there were 36,594 communes (SU_INSEE) in France.
3.
The spatial units SU_PMSI and SU_INSEE differ, as do the IDs Spatial_Id_PMSI and Spatial_Id_INSEE.

The goal of our method is to create a mapping table for the IDs Spatial_Id_PMSI and Spatial_Id_INSEE, in order to build a final database that includes both medical data from the PMSI and ecological data from the INSEE. The PMSI medical database provides information on each hospital stay for each patient, which are aggregated for each Spatial_Id_PMSI spatial unit. In this illustrative example, the healthcare event of interest is an in-hospital birth. This event was detected by screening for (1) hospital admissions from home, (2) a patient age of 7 days or less, (3) admissions from another hospital with a bodyweight below 2500 g, and (iv) admissions from another hospital, with a patient age below 30 days.