An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE)
© Baker and Valleron; licensee BioMed Central Ltd. 2014
Received: 12 August 2014
Accepted: 24 September 2014
Published: 30 October 2014
Examining whether disease cases are clustered in space is an important part of epidemiological research. Another important part of spatial epidemiology is testing whether patients suffering from a disease are more, or less, exposed to environmental factors of interest than adequately defined controls. Both approaches involve determining the number of cases and controls (or population at risk) in specific zones. For cluster searches, this often must be done for millions of different zones. Doing this by calculating distances can lead to very lengthy computations. In this work we discuss the computational advantages of geographical grid-based methods, and introduce an open source software (FGBASE) which we have created for this purpose.
Geographical grids based on the Lambert Azimuthal Equal Area projection are well suited for spatial epidemiology because they preserve area: each cell of the grid has the same area. We describe how data is projected onto such a grid, as well as grid-based algorithms for spatial epidemiological data-mining. The software program (FGBASE), that we have developed, implements these grid-based methods.
The grid based algorithms perform extremely fast. This is particularly the case for cluster searches. When applied to a cohort of French Type 1 Diabetes (T1D) patients, as an example, the grid based algorithms detected potential clusters in a few seconds on a modern laptop. This compares very favorably to an equivalent cluster search using distance calculations instead of a grid, which took over 4 hours on the same computer. In the case study we discovered 4 potential clusters of T1D cases near the cities of Le Havre, Dunkerque, Toulouse and Nantes. One example of environmental analysis with our software was to study whether a significant association could be found between distance to vineyards with heavy pesticide. None was found. In both examples, the software facilitates the rapid testing of hypotheses.
Grid-based algorithms for mining spatial epidemiological data provide advantages in terms of computational complexity thus improving the speed of computations. We believe that these methods and this software tool (FGBASE) will lower the computational barriers to entry for those performing epidemiological research.
Examining whether disease cases are clustered in space is an important part of epidemiological research (see[1–6]). Another important part of spatial epidemiology is testing whether patients suffering from a disease are more, or less, exposed to some environmental factors of interest (see[7–11]). Cluster search fits into a hypothesis-free approach, while testing the effect of specific environmental factors fits into a hypothesis-driven approach. Both these approaches are important: the hypothesis-driven approach generally provides higher statistical power since fewer statistical tests are performed, whereas the hypothesis-free approach can discover associations which researchers did not think of. Searching for clusters consists simply in locating regions where the number of cases (relative to controls or relative to the population at risk) is larger than would be expected by chance. This is a data-driven approach in the sense that the researcher does not formulate any a priori hypothesis regarding the data. Instead the data is examined to see if it exhibits any patterns, which may not have been immediately visible. If such patterns are discovered, they may suggest further specific explorations, and help identify unexpected environmental risk factors.
Testing the effect of specific environmental factors requires the researcher to specify environmental entities (factories, waste dumps, freeways, etc.). The disease impact of the exposition to these spatial entities is often quantified using odds ratios, and when possible, dose-response effects which may suggest a causal relationship. Because the researcher must formulate an a priori hypothesis, namely the existence of link between a set of spatial entities and the disease, this fits into a hypothesis-driven approach. After the hypothesis is chosen, the data is examined to see whether or not it supports the hypothesis.
As the denominator of the likelihood ratio is the same for all zones, only the numerator matters when looking for the most likely clusters.
And if then supp >qL(z, p, q) is equal to p c (1 - p)N - C.
Description of FGBASE
"Data driven" where the epidemiologist searches if there are clusters of cases which can later be interpreted. In this case, the data consists of the positions of the patients and the positions of the controls (or the density of the underlying at risk population).
"Hypothesis driven" where the epidemiologist wants to test whether cases of a disease of interest are differently exposed than controls to environmental factors which are geographically specified. In this case the data are the positions of the patients and of the controls on the map, and the positions of the environmental points of interest.
The principle of the method is to use a high-resolution grid to avoid the computer costly computations of distances, as outlined in the introduction.
Representation of geographical information in FGBASE
Choosing a geographical grid standard Although the software will run with any type of grid, the use of equal area grids based on the Lambert Azimuthal Equal Area (LAEA) projection system is encouraged. This type of grid is recommended by the Directive 2007/2/E, which was adopted by the European Parliament in March 2007. This directive aims at establish an Infrastructure for Spatial Information in the European community (INSPIRE) for environmental policies. The INSPIRE directive recommends the use of the Lambert Azimuthal Equal Area (ETRS89-LAEA) projection for pan-European spatial analysis and reporting when true area representation is required. For each European country, population grids of this type can be downloaded from the European Commission’s Eurostat website.
Representing the points of interest in the hypothesis driven approach The grid-based approach is well suited for examining the effect of specific environmental factors, when these environmental factors can be located to specific points on the map. These can be isolated points (such as factories or waste dumps). More generally, curves and surfaces can be modeled as collections of points though sampling. This enables the grid-based approach to be applied to the study of the possible environmental hazards associated with a highway, a river, or specific agricultural cover using pesticides.
Using FGBASE to test specific environmental factors in the hypothesis-driven approach
Using FGBASE to find cluster of cases using the data-driven option
In this option, the locations of cases are examined in the context of either a set of control locations or an underlying population density. A cluster search is performed to determine areas where the number of cases is much larger than would be expected by chance. Two datasets are necessary: (i) locations of cases (ii) locations of controls or appropriate population density data. For example, for all European countries, population grids can be downloaded from the European Commission’s Eurostat website. (For non-European countries a solution could be to use the LandScan™ population density distribution see, but the user must ensure his/her cases and controls are projected to the same grid as the one used by LandScan™).
A case study with FGBASE: patients with type 1 diabetes (T1D)
The development of FGBASE was driven by a program of search of possible environmental factors of Type 1 Diabetes (T1D). Indeed, in several European countries, the incidence of T1D continues to progress rapidly and has doubled since the 80′s in children aged less than 5 years. The reason for this cannot be genetic as the observed variations occurred during a short time when the genetic structure of the population did not change. Numerous case-control studies of environmental associations with T1D have examined specific candidate factor approaches ([20–23]) but no single factor has gained further credit in the causation of T1D ([24, 25]).
The population used in this example is made with the participants of "Isis-Diab", an ongoing prospective cohort of T1D patients recruited since 2007 by the Isis-Diab Network composed of 99 diabetic centers covering almost all French regions (see description and list of participating centers in Additional file1: Table S1). The main objective of the Isis-Diab program is the exploration of environmental and gene-environment interaction in patients with T1D. Inclusion criteria for the current study were T1D occurring in children aged less than 15 years. T1D was defined according to the American Diabetes Association, and by positive autoantibodies to GAD, insulin, and/or IA2. All studied patients were born in France. Data consist of clinical, genomic, and environmental exposures. Here, in the example shown to exemplify the "hypothesis" driven option of FGBASE, we do not consider the clinical and genetic characteristics of the patients and we focus on a single source of pollution described in the next section.
The environmental exposures analyzed
In France, all industrial polluting industries are registered and must provide data on the polluting emissions they are responsible of. Data are in the IREP database (see) that contains comprehensive data (=93 809) on polluting emissions regarding 159 registered chemicals. The data set consists of three database tables. The first table contains a list of industrial entities such as factories. (total: n = 12 173) Each entity comes with a numerical identifier (entityID), as well as its geographical coordinates. The second table provides a list of 160 chemicals, each one having a numerical identifier (chemID). The third table contains a comprehensive list of chemical emissions over a period of 10 years (from 2000 to 2009). Each emission is provided with its date, its entityID and its chemID. This data set naturally fits into the framework of spatial entities grouped into classes. Here the classes are the registered chemicals (m = 160), and each class is associated to the entities that have emitted that chemical. The entities are mapped to the grid using their geographical coordinates.
"Hypothesis driven" option: In the hypothesis driven option cases must be compared to controls. Here we have defined "virtual controls" that are randomly taken over France in places of comparable density to those of cases. Strength of this method is that the definition of controls, and the sampling issues, can be addressed by replicating the algorithm as many times that needed. In the example, we chose to take 100 series of 4507 virtual controls, each series being compared to the cases. The algorithm will test each class separately, one after the other. This ensures stronger statistical power but is submitted to the classical false discovery issues.
Candidate clusters of T1D cases discovered by the FGBASE using cases from the Isis-Diab cohort
Cluster candidate 1
Cluster candidate 2
Cluster candidate 3
Cluster candidate 4
Next to Le Havre
Next to Dunkerque
Next to Toulouse
Next to Nantes
LAEA -ETRS89 grid coordinates of the rectangular zone:
[(3610,2979), (3610,2981), (3608,2981), (3608,2979)]
[(3774,3126), (3774,3130), (3783,3130), (3783,3126)]
[(3627,2313), (3627,2315), (3631,2315), (3631,2313)]
[(3446,2746), (3446,2750), (3453,2750), (3453,2746)]
WGS84 coordinates of rectangular zone:
[(0.15368,49.4943), (0.150047,49.5121), (0.122639,49.5097), (0.126282,49.4919)]
[(2.19376,50.9823), (2.18771,51.018), (2.31538,51.0265), (2.32132,50.9908)]
[(1.40999,43.5677), (1.40733,43.5856), (1.45649,43.5897), (1.45914,43.5718)]
[(1.59501,47.2094), (-1.60299,47.2449), (-1.51148,47.2548), (-1.5035,47.2192)]
Number of cases
Number of controls
log of the numerator of Kulldorff’s likelihood ratio:
Results with the hypothesis driven option
Using the hypothesis driven option and correcting for multiple comparisons, the polluting chemicals in the IREP dataset did not show a statistically significant association with T1D.
Comparisons with the SaTScan™ software
SaTScan™ is a software program which implements the spatial scan statistic. It is very widely used (the SaTScan™ user guide list over a hundred public health papers which have used it to obtain results in a wide range of studies). Despite its quality and wide user base, SaTScan™ has a certain number of drawbacks which warrant the existence of alternative software tools such as FGBASE. The first drawback is that while SaTScan™ is freely downloadable, it is not open source. This strongly limits its customizability, as the users cannot add and modify features to fit their needs. For example, we (the authors) have benefited from being able to customize the source code of the open source genomic analysis tool Plink to add certain statistical tests which we needed but were not implemented in the original software. Another advantage of open source software is that after a while, user scrutiny of the source code reduces the number of bugs and security flaws in the software. The benefits of open source software in the area of geographical information systems has been studied in.
A second drawback of SaTScan™ is that it uses circles. From the SaTScan™ user guide: "With latitude/longitude coordinates, what planar projection is used? No projection is used. SaTScan™ draws perfect circles on the spherical surface of the earth". As discussed in this paper, the use of circles and distance calculations, is computationally much slower than using grid projections. Increasing the speed of SaTScan™ through the use of cloud services has been proposed in, it is clear however that using faster algorithms is a preferable (and less expensive) solution.
A third limitation of SaTScan™ is the lack of cartographic output, while this can be addressed though the use of an external macro, an integrated map as the one available in FGBASE, increases the ease of use when viewing clusters. Finally a fourth limitation of SaTScan™, is that it addresses only cluster searches and not the testing of environmental factors. By combining both types of analyses (cluster searches and testing of environmental factors) in a single software, FGBASE facilitates the interpretation of clusters in terms of environmental factors (both clusters and environmental factors are displayed on the same map).
Types of environmental factors handled by FGBASE
The "hypothesis driven" grid-based algorithm described in this paper and implemented in the companion software (FGBASE) is well suited in the case of environmental factors, which can be located at specific points. By sampling curves and surfaces as collections of points, these are also well handled. The key aspect is that the environmental factors studied should be of a binary nature, in other words at any given location the factor should either be present or absent. This is the case for factories, power lines, highways and fields of specific crops. This is not the case for factors of a continuous nature such as temperature, atmospheric concentrations of a given gas or particle, or ultraviolet index.
Grid-based algorithms for mining spatial epidemiological data provide advantages in terms of computational complexity and improve the speed of computations. This work starts by examining suitable geographical grids, and how epidemiological data is projected to such a grid. Based on this framework, data-mining algorithms are introduced which enable both a data-driven approach and a hypothesis-driven approach. These algorithms enable rapid discovery of clusters of cases as well the testing of specific environmental factors. A new open-source software tool (FGBASE) implementing these algorithms, is presented together with a case study of its use on the "Isis-Diab" cohort of French T1D cases. We hope that these methods and this software tool (FGBASE) will lower the computational barriers to entry for those performing epidemiological research.
FGBASE can be accessed athttp://www.fgbase.org.
This research was supported by grant MRES-PRNPE-1-CVS-014 from the (French Ministry of Environment, and the Programme Hospitalier de Recherche Clinique of the French Ministry of Health (AOM08049), Inserm, NovoNordisk Laboratory. We acknowledge the participation of the pediatricians of the diabetes centers contributing to the Isis-Diab cohort (see list of participating centers in Additional file1: Table S1). We gratefully thank Sofia Meurisse for the geocoding of patients. We thank the patients and parents who participated in the study. Finally we would like to thank the two anonymous reviewers for their helpful comments and suggestions.
- Kulldorff M: A spatial scan statistic. Commun Stat-Theor Methods. 1997, 26 (6): 1481-1496. 10.1080/03610929708831995.View ArticleGoogle Scholar
- Kulldorff M, Nagarwalla N: Spatial disease clusters: detection and inference. Stat Med. 1995, 14 (8): 799-810. 10.1002/sim.4780140809.View ArticlePubMedGoogle Scholar
- Waller LA, Gotway CA: Applied spatial statistics for public health data. 2004, Hoboken, New Jersey: John Wiley & SonsView ArticleGoogle Scholar
- Lawson AB, Denison DG: Spatial cluster modelling. 2010, CRC pressGoogle Scholar
- Lawson A, Biggeri A, Böhning D, Lesaffre E, Viel J-F, Bertollini R: Disease mapping and risk assessment for public health. 1999, John Wiley & SonsGoogle Scholar
- Assuncao R, Costa M, Tavares A, Ferreira S: Fast detection of arbitrarily shaped disease clusters. Stat Med. 2006, 25 (5): 723-742. 10.1002/sim.2411.View ArticlePubMedGoogle Scholar
- Bithell J: Statistical methods for analysing point-source exposures. Geographical and Environmental Epidemiology: Methods for Small Area Studies. 1992, USA: Oxford University Press, 221-230.Google Scholar
- Bithell JF, Stone RA: On statistical methods for analysing the geographical distribution of cancer cases near nuclear installations. J Epidemiol Community Health. 1989, 43 (1): 79-85. 10.1136/jech.43.1.79.PubMed CentralView ArticlePubMedGoogle Scholar
- Diggle PJ: A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point. J R Stat Soc A Stat Soc. 1990, 153: 349-362. 10.2307/2982977.View ArticleGoogle Scholar
- Stone RA: Investigations of excess environmental risks around putative sources: statistical problems and a proposed test. Stat Med. 1988, 7 (6): 649-660. 10.1002/sim.4780070604.View ArticlePubMedGoogle Scholar
- Jacquez GM, Greiling DA: International Journal of Health Geographics. Int J Health Geogr. 2003, 2: 4-10.1186/1476-072X-2-4.PubMed CentralView ArticlePubMedGoogle Scholar
- Kulldorff M: Information Management Services, Inc. SaTScanTM v8. 0: Software for the spatial and space-time scan statistics. 2009, [http://www.satscan.org/], Google Scholar
- Price RC, Pettey W, Freeman T, Keahey K, Leecaster M, Samore M, Tobias J, Facelli JC: SaTScan on a Cloud: On-Demand Large Scale Spatial Analysis of Epidemics. Online J Public Health Inform. 2010, 2 (1):Google Scholar
- Fleming DM, Schellevis FG, Falcao I, Alonso TV, Padilla ML: The incidence of chickenpox in the community. Lessons for disease surveillance in sentinel practice networks. Eur J Epidemiol. 2001, 17 (11): 1023-1027. 10.1023/A:1020066806544.View ArticlePubMedGoogle Scholar
- Annoni A, Bernard L, Lillethun A, Ihde J, Gallego J: Short Proceedings of the 1st European Workshop on Reference Grids. 2004Google Scholar
- Epstein PR: Climate change and infectious disease: stormy weather ahead?. Epidemiology. 2002, 13 (4): 373-375. 10.1097/00001648-200207000-00001.View ArticlePubMedGoogle Scholar
- GDAL - Geospatial Data Abstraction Library: Version 1.10.1. 2014, Open Source Geospatial Foundation, [http://www.gdal.org/], 
- Dobson JE, Bright EA, Coleman PR, Durfee RC, Worley BA: LandScan: a global population database for estimating populations at risk. Photogramm Eng Remote Sens. 2000, 66 (7): 849-857.Google Scholar
- Patterson CC, Dahlquist GG, Gyürüs E, Green A, Soltész G: Incidence trends for childhood type 1 diabetes in Europe during 1989–2003 and predicted new cases 2005–20: a multicentre prospective registration study. Lancet. 2009, 373 (9680): 2027-2033. 10.1016/S0140-6736(09)60568-7.View ArticlePubMedGoogle Scholar
- Mohr S, Garland C, Gorham E, Garland F: The association between ultraviolet B irradiance, vitamin D status and incidence rates of type 1 diabetes in 51 regions worldwide. Diabetologia. 2008, 51 (8): 1391-1398. 10.1007/s00125-008-1061-5.View ArticlePubMedGoogle Scholar
- Knip M, Virtanen SM, Seppä K, Ilonen J, Savilahti E, Vaarala O, Reunanen A, Teramo K, Hämäläinen A-M, Paronen J: Dietary intervention in infancy and later signs of beta-cell autoimmunity. N Engl J Med. 2010, 363 (20): 1900-1908. 10.1056/NEJMoa1004809.PubMed CentralView ArticlePubMedGoogle Scholar
- Karlén J, Faresjö T, Ludvigsson J: Could the social environment trigger the induction of diabetes related autoantibodies in young children?. Scand J Public Health. 2012, 40 (2): 177-182. 10.1177/1403494811435491.View ArticlePubMedGoogle Scholar
- Hober D, Alidjinou EK: Enteroviral pathogenesis of type 1 diabetes: queries and answers. Curr Opin Infect Dis. 2013, 26 (3): 263-269. 10.1097/QCO.0b013e3283608300.View ArticlePubMedGoogle Scholar
- Forlenza GP, Rewers M: The epidemic of type 1 diabetes: what is it telling us?. Curr Opin Endocrinol Diabetes Obesity. 2011, 18 (4): 248-251. 10.1097/MED.0b013e32834872ce.View ArticleGoogle Scholar
- Nokoff N, Rewers M: Pathogenesis of type 1 diabetes: lessons from natural history studies of high‒risk individuals. Ann N Y Acad Sci. 2013, 1281 (1): 1-15. 10.1111/nyas.12021.PubMed CentralView ArticlePubMedGoogle Scholar
- Association AD: Diagnosis and classification of diabetes mellitus. Diabetes Care. 2009, 32 (Suppl 1): S62-S67.View ArticleGoogle Scholar
- D’Agostino RBSR, Massaro JM, Sullivan LM: Non-inferiority trials: design concepts and issues - the encounters of academic consultants in statistics. Stat Med. 2003, 22 (2): 169-186.View ArticlePubMedGoogle Scholar
- Kulldorff M: SaTScanTM User Guide. 2006, [http://www.satscan.org/], Google Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoepman J-H, Jacobs B: Increased security through open source. Commun ACM. 2007, 50 (1): 79-83. 10.1145/1188913.1188921.View ArticleGoogle Scholar
- Steiniger S, Hay GJ: Free and open source geographic information tools for landscape ecology. Ecol Inform. 2009, 4 (4): 183-195. 10.1016/j.ecoinf.2009.07.004.View ArticleGoogle Scholar
- Abrams AM, Kleinman KP: A SaTScan™ macro accessory for cartography (SMAC) package implemented with SAS® software. Int J Health Geogr. 2007, 6 (1): 6-10.1186/1476-072X-6-6.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.