Spatial analyzes of HLA data in Rio Grande do Sul, south Brazil: genetic structure and possible correlation with autoimmune diseases

Background HLA genes are the most polymorphic of the human genome and have distinct allelic frequencies in populations of different geographical regions of the world, serving as genetic markers in ancestry studies. In addition, specific HLA alleles may be associated with various autoimmune and infectious diseases. The bone marrow donor registry in Brazil is the third largest in the world, and it counts with genetic typing of HLA-A, -B, and -DRB1. Since 1991 Brazil has maintained the DATASUS database, a system fed with epidemiological and health data from compulsory registration throughout the country. Methods In this work, we perform spatial analysis and georeferencing of HLA genetic data from more than 86,000 bone marrow donors from Rio Grande do Sul (RS) and data of hospitalization for rheumatoid arthritis, multiple sclerosis and Crohn’s disease in RS, comprising the period from 1995 to 2016 obtained through the DATASUS system. The allele frequencies were georeferenced using Empirical Bayesian Kriging; the diseases prevalence were georeferenced using Inverse Distance Weighted and cluster analysis for both allele and disease were performed using Getis-Ord Gi* method. Spearman’s test was used to test the correlation between each allele and disease. Results The results indicate a HLA genetic structure compatible with the history of RS colonization, where it is possible to observe differentiation between regions that underwent different colonization processes. Spatial analyzes of autoimmune disease hospitalization data were performed revealing clusters for different regions of the state for each disease analyzed. The correlation test between allelic frequency and the occurrence of autoimmune diseases indicated a significant correlation between the HLA-B*08 allele and rheumatoid arthritis. Conclusions Genetic mapping of populations and the spatial analyzes such as those performed in this work have great economic relevance and can be very useful in the formulation of public health campaigns and policies, contributing to the planning and adjustment of clinical actions, as well as informing and educating professionals and the population. Electronic supplementary material The online version of this article (10.1186/s12942-018-0154-8) contains supplementary material, which is available to authorized users.


Background
Harboring more than 200 genes spread over a 3.6 Mb region, the Major Histocompatibility Complex (MHC) is the region of the human genome most enriched for open reading frames [1]. MHC genes, or HLA (Human Leukocyte Antigen) genes in humans, are the most polymorphic loci of the human genome [2], showing different allelic frequencies in populations from different geographic regions around the world [3][4][5]. Due to their high genetic variability and strong linkage disequilibrium, HLA genes have been used in studies of genetic ancestry and demography [6]. Due to their major role in immune response, the loci of the HLA system are the primary determinants of tolerance or rejection in organ and hematopoietic stem cell transplantation (HSCT) [7]. HSCT from bone marrow is clinically indicated for the treatment of disorders of the hematopoietic system or the immune system and in cases of malignant bone marrow diseases and disseminated solid tumors. Leukemia is the leading indication for allogeneic HSCT (72%), followed by lymphoproliferative diseases (15%), non-malignant diseases (12%) and solid tumors (0.6%) [8,9].
In addition, specifc HLA alleles have already been associated with various autoimmune and infectious diseases [10,11]. As a class, the overall cumulative prevalence for all autoimmune diseases (AD) is 5.0%, being 3.0% for males and 7.1% for females [12]. Rheumatoid arthritis (RA) is the most prevalent AD (0.5-1%) [13], being the HLA-DRB1 is the principal locus contributing to disease susceptibility, with an estimated contribution of 30-50% to overall susceptibility to RA [14,15]. Other AD, such as Celiac Disease, Type 1 Diabetes Mellitus, Ankylosing spondylitis, Multiple sclerosis and Crohn's disease also presented HLA genes associated with its susceptibility [see [16] and [17] for further review]. Thus, knowledge of HLA diversity at the population level is important to guide public health policies focused on AD and to improve bone marrow transplantation programs.
The use of the geographic information system (GIS)a toolkit for capturing, storing, transforming, analyzing and presenting spatial data-has been a powerful tool in assessing and monitoring public health in different populations around the world [18,19]. GIS-based data contributes to the improvement of health-related services for the population, since health data combined with geographic information allow researchers to analyze the spatial variation of diseases, mortality, morbidity, access to health care systems and social or environmental determinants for health outcomes [20,21]. The transformation of detailed data into maps can facilitate communication of the geographical distribution of health challenges in different communities and identify areas for intervention [18,22].
The Brazilian Bone Marrow Donor Registry (REDOME, in Portuguese) is the third largest bank of bone marrow donors in the world, with more than 4 million donors registered to date. The state of Rio Grande do Sul (RS), in southern Brazil, has the fourth largest number of registered donors in Brazil, with approximately 300,000 individuals. This register contains information of HLA-A, -B and -DRB1 genotypes, city of residence of the donor as well as ethnicity by self-declaration based on skin color. In this work, we used GIS tools to evaluate the spatial correlation between immune system alleles (from HLA-A, -B and -DRB1 loci) and occurrence of AD in Rio Grande do Sul, based on data information available from governmental health agencies.

Sample
We analyzed a dataset containing 97,292 potential bone marrow donors residing in the state of Rio Grande do Sul who voluntarily registered in REDOME between January 2008 and December 2012. Rio Grande do Sul, the southernmost state of Brazil, is the fourth largest state of the country with more than 11 million inhabitants distributed in 497 cities [23], 439 of which are represented in the dataset. At the time of registration in REDOME, the individuals declared their ethnicity by auto-perception based on skin color, following Brazilian Institute of Geography and Statistics (IBGE) standards. Only municipalities with 50 or more registered donors were included in the analysis. For self-reported white individuals (or Euro-descendants, EURD), 120 cities distributed across all regions of the state cities met the sample size criteria (Additional file 1), totaling 86,672 individuals ( Fig. 1). On the other hand, only 19 cities had more than 50 selfreported black (or Afro-descendants, AFRD) individuals, scattered across the state and, therefore, excluded from further analyses.
Genotyping for HLA-A, -B and -DRB1 loci was performed at the Hospital das Clínicas de Porto Alegre (HCPA) (Luminex LABType SSO system; One Lambda, Inc., Canoga Park, CA). Due to the high polymorphism of HLA genes, complete identification of each allele is only possible through sequencing-based typing. Thus, Luminex genotyping identifies only "low resolution" allelic groups. Because they are closely located on the same chromosome, HLA alleles segregate in linkage blocks, known as haplotypes.
Information on AD were available in DATASUS (Department of Informatics of the Unified Health System, in Portuguese), a database established by the Brazilian Ministry of Health since 1991 that contains health information and statistics from all municipalities in Brazil and that is publicly available through online access

Statistical analyses
Allele and haplotype frequency estimations and tests of Hardy-Weinberg equilibrium (HWE) were performed using the GENE[RATE] tools as described elsewhere [7,[24][25][26]. Principal component analysis (PCA) was done for each locus using Rstudio (v0.98.1103) and the genetic structure was measured using the synthetic genetic structure (SPC) measure proposed by Xue et al. [27], as follows: where PC is the component score and W is the proportion (weight) of the component contribution. All components with an eigenvalue greater than 1 were included in the SPC calculation, following the Kaiser criterion [28]. Hospitalization data for RA, multiple sclerosis (MS), Crohn's disease (CD) and leukemia for each city, comprising the period from January 1995 to December 2016, were obtained through the DATASUS system (tabnet.datasus.gov.br/). RA, MS and CD are the only AD recorded in DATASUS. The number of hospitalizations of each disease in each city was adjusted by the number of inhabitants and used as an indicator of disease prevalence (disease index, DI). Spearman's correlation test between each allele and each disease was performed using IBM SPSS software, Version 20.0 (IBM Corp., Armonk, NY). The result obtained in the Spearman correlation test was submitted to the multiple comparison test FDR (false discovery rate) in Rstudio (v0.98.1103) using the stats (3.3.0) package.
Allele and haplotype frequency, SPC data as well as positive and statistically significant alleles × diseases (hereafter A*D) in the Spearman's correlation test were spatially interpolated using the Empirical Bayesian Kriging method (EBK). For each interpolation, scatterplots were performed for the observed and predicted values and calculated their respective coefficients of determination (R 2 ), Spea r man's correlation coefficient (ρ), Spearman's coefficient of determination (ρ 2 ) and the root mean square error (RMSE). The P-values were adjusted by FDR for α = 0.05.
Cluster maps for A*D showing positive and statistically significant correlation in the Spearman test were generated through the Hot-Spot analysis using the Getis-Ord Gi* method [29,30] based on the following formula: where DI is the disease index, DImax is the maximum disease index, DImin is the minimum disease index, AF is the allelic frequency, AFmax is the maximum allelic frequency and AFmin is the minimum allele frequency. All spatial analyses were performed in ArcGis v10.3.

Results
For all cities, allele frequencies did not show deviations from the Hardy-Weinberg equilibrium (Additional file 2). Considering the whole state, the most frequent alleles for each locus were HLA-A*02 (27.6%), HLA-B*35 (12.4) and HLA-DRB1*07 (13.4%), with substantial allele frequency variation among cities (Additional files 2, 3). Five haplotypes reached frequencies above 2% in at least one city. Haplotype A*01 ~ B*08 ~ DRB1*03 presented the highest frequency considering the entire state (3.7%). Following allele frequencies, there was substantial variation in haplotype frequencies among cities (Additional files 4, 5). Figure 2 shows the spatial HLA genetic structure based on SPC in Rio Grande do Sul. HLA-A and HLA-B loci have a very similar structure, showing a higher differentiation between the Southwest and Metropolitan regions in relation to the Central and Northwest regions. The HLA-DRB1 locus presents a slightly different structure, with a higher differentiation in the Northeast. The combined data for the three loci, shows a very similar structure compared to HLA-A and HLA-B. Figure 3 shows the distribution of the DI, while disease prevalence is shown, for each city, in Additional file 6. The small town of União da Serra, located in the Northeast region of the state, has a population of approximately 1500 inhabitants, which is equivalent to 0.014% of the total population of the state of Rio Grande do Sul. However, this city responded to 0.162% of all hospitalizations for RA from January 1995 to December 2016 (62 hospitalizations events). Thus, when considering the number of hospitalizations by the number of inhabitants in relation to the totals for the state, União da Serra is the municipality with the highest prevalence of RA with a DI 11.6 × higher than expected. Similarly, the town of São Sepé, in the Center-East region of the state, was the city with the highest prevalence for MS, with a DI almost 9 × higher than expected. On its turn, the town of São Pedro da Serra, in the Metropolitan region, had the highest prevalence for CD, with a DI 9.5 × higher than expected. For leukemia, the towns of Pouso Novo, in the Center-East region, Vista Alegre and Três Arroios, both in the Northwest region, had a DI 4.5 × higher than the expected. For all diseases (CD, MS, RA, and leukemia) there was strong evidence for spatial clusters in DI (Fig. 4, Additional file 7, P < 3 × 10 −5 in all cases). Different regions appeared as hot-spots for different diseases. The Center-East and Northeast regions behaved as hotspots for RA and MS, while the Metropolitan and the Northwest regions were cold-spots. CD had a hot-spot cluster in the Metropolitan region and in a small area in the Northeast, while for leukemia there was a hot-spot in the extreme North of the state. Table 1 shows the Spearman correlation index (ρ) for each allele and each disease tested. Most of the statistically significant correlations found were negative.

Discussion
This is the first study to perform spatial analysis of HLA genetic structure, correlating HLA population genetics data with epidemiological data on AD. Figure 1 shows the HLA structure of the bone marrow donor population in Rio Grande do Sul based on the principal component analysis (PCA) of HLA allele frequencies. Visually, HLA-A, HLA-B and the combined data for the three loci showed a very similar structure, presenting a higher differentiation between the Southwest and Metropolitan regions in relation to the Central and Northwest regions.
PCA is a very useful tool in the investigation of population structure, but sampling strategy and the amount of data may impact its results [31]. In this study, only selfreported white individuals were included due to sample size limitations (only 19 cities had more than 50 selfreported black individuals, with little coverage in the state). It is unlikely that this had a major impact on the characterization of HLA genetic structure in Rio Grande do Sul as a whole, given that more than 80% of the population of Rio Grande do Sul is self-declared white [23], and more than 90% of REDOME donors in Rio Grande do Sul declare themselves white at the time of registration. However, given that there are differences in AD prevalence between black and white individuals [32, 33, see 34 for review], an important step forward would be characterizing geographic clusters of AD in the black population of this state and its relationship with the clusters identified in this study.
Among the classic HLA genes, HLA-A is more sensitive to demographic processes, such as genetic drift, because it is less affected by balancing selection [35,36]. In this sense, the differentiation between the Southwest, Southeast and the Metropolitan regions, on one hand, compared to the Central and Northwest regions, on the other hand (Fig. 1), may mirror the colonization history of Rio Grande do Sul (Additional file 11). In these former regions, Portuguese and Spanish individuals were the major settlers since the early eighteenth century, with the later arrival of African slaves, mostly in the Pelotas (Southeast) region. On the other hand, Germans (1824), Poles (1871) and Italians (1875) were major ethnicities settling the Central and Northern regions [37,38].
Specific spatialization and interpolation techniques may influence the geographic trends shown by the data. In this study, allele and haplotype frequencies as well as A*D positive and statistically significant correlations were spatialized by the EBK method (Additional files 3,5). This method was chosen because we had only 120 points to represent the 496 municipalities of Rio Grande do Sul. Kriging is a probabilistic predictor, thus assuming a statistical model for the data, being able to quantify the uncertainty associated with the values predicted from the standard errors. This method uses a semivariograma function of distance and direction separating two locations-to quantify the spatial dependence on the data. EBK differs from classical kriging by using many semivariogram models rather than using only a single model. For each repetition, the semivariogram is used to simulate a new set of values at the input sites; then the simulated data are used to estimate a new semivariogram and its weight. Thus, predicted values and standard errors are inferred for the non-sampled regions using these weights [39].
All interpolated maps showed correlation between observed and interpolated values. The lowest correlation coefficients were typically observed in alleles with low frequencies, and where the sampling is consequently smaller. It is important to note that the alleles that showed a positive and significant correlation with AD (HLA-A*29, HLA-B*08, HLA-B*38, HLA-DRB1*01 e HLA-DRB1*03) presented a correlation coefficient for interpolation ranging from 0.513 to 0.982 and maximum RMSD of 0.017 (Additional file 9). These values indicate that the interpolation method and the analyses performed are consistent.
The Hot-Spot analysis (Getis-Ord Gi*) revealed geographic clusters of AD (RA, MS and CD) and leukemia in Rio Grande do Sul, indicating that neighbor regions should have similar disease prevalence (Additional file 7). On the other hand, our analysis also revealed spatial clusters of A*D, even though both spatial clusters had little overlap (Figs. 4, 5). A genetic cluster can be defined as a group of genetically divergent individuals that arises when gene flow is impeded by physical or cultural barriers [40]. Evolutionary forces such as the founder's effect and low immigration may reinforce genetic backgrounds that pre-dispose to some genetic conditions. One interpretation for the little overlap between DI and A*D is that while DI spatial clustering is dominated by shared environmental and genetic (non-HLA) affecting disease status, A*D spatial clusters indicate a more important role for the common HLA genetic background (through specific "risk" alleles) for these diseases. As a result, cities having a high frequency of HLA-B*08, for example, will have a higher chance of having high DI for RA even if this city is distant from the DI spatial cluster disconsidering HLA information. AD are heterogeneous in regard to prevalence, clinical manifestations, and pathogenesis, being caused by an immune response against constituents of the body's own tissues. Specific HLA alleles can predispose to several AD [10,11]. Indeed, some of the positive and significant correlations between HLA alleles and AD found in our study have already been described in case-control studies. Han et al. [41] established a relationship between HLA-B*08 and RA subtype anti-citrullinated-protein-autoantibodynegative (ACPA − or seronegative) in a study involving 2406 ACPA − case and 13,930 control individuals. Alsaied et al. [42] found an association between HLA-DRB1*03 and juvenile RA in Kuwaiti Arab children, and Manivel et al. [43] established an association between HLA-DRB1*03 and RA subtype anti-CII (anti fibrillar collagen type II) in the Swedish population. On the contrary, Lysandropoulos et al. [44] tested the relation between MS and HLA-B*08, but the result was inconclusive. Concerning CD, Goyette et al. [45] found a significant association with DRB1*01, but there are no other studies correlating CD and HLA-A*29 and HLA-B*38. Differently from our findings, Konda Mohan et al. [46] ] and Bizzari et al. [47] indicated a protective role for HLA-DRB1*03 for RA in Indian and Arabic populations, respectively. These results may indicate that some relationships between AD and HLA background may be population-specific, which highlights the potential of spatial analyses to identify small-scale A*D clusters in populations from a similar background.
Nonetheless, some limitations of this study should be taken into account: the bone marrow donor individuals are not the same reported in the DATASUS system for the mentioned diseases, in addition to the already mentioned limiting number of cities having enough sample size for allele frequency analysis. Besides, the data used in the DATASUS system refers to the number of hospitalizations for each disease and, because we use data of chronic disease, the same person may hospitalize more than once for the same condition. However, spatially studies can serve at least as preliminary models of genetic × disease interaction to guide further investigations and promote public health actions.
Understanding the demographic processes that affect the genetic diversity of human populations at a spatial scale can be useful in public health policies in the present. The study of the HLA diversity at the population level is invaluable in disease-association studies and in the effectiveness of bone marrow transplantation programs. Thus, the results presented in this study, such as the heterogeneous genetic structure and the A*D spatial correlations, demonstrate the importance of the integrated use of large databases with spatial-specific analysis approaches, and may indicate the need to implement space-specific interventions to guide policy planning and decision making in public health.
Despite all the potential use of GIS, this tool is still underutilized in public health centers around the world. Georeferencing is an essential first step in making it possible to analyze public health data geographically [48]. Through the georeferencing of public health data it is possible to perform a spatial analysis for public health systems [49]. The correct use of GIS can inform and educate professionals and the public, give more power to decision making at all levels, assist in planning and adjusting clinical and cost-effective actions, monitor and analyze changes in health levels and exposure to disease [50].

Conclusions
In this study, we used GIS tools to evaluate the spatial correlation between HLA alleles and occurrence of AD in Rio Grande do Sul, based on data available from governmental health agencies. To the best of our knowledge, this is the first study that investigates the spatial correlation between genetic data and AD occurrence. The results presented in this study highlights the potential of spatial analyses to identify the interaction between alleles and diseases in populations from a similar background. The use of information from large databases such as REDOME and DATASUS together with georeferencing tools can help in the identification of useful markers in population genetics that may confer resistance or susceptibility to diseases. Genetic mapping of populations and the spatial analyzes such as those performed in this work have great economic relevance and can be very useful in the formulation of public health campaigns and policies, contributing to the planning and adjustment of clinical actions, as well as informing and educating professionals and the population.