Obtaining district-level health estimates using geographically masked location from Demographic and Health Survey data

Background Demographic and Health Survey (DHS) data are an important source of maternal, newborn, and child health as well as nutrition information for low- and middle-income countries. However, DHSs are often unavailable at the administrative unit that is most interesting or useful for program planning. In addition, the location of DHS survey clusters are geomasked within 10 km, and prior to 2009, may have crossed district boundaries. We aim to use DHS surveyed information with these geomasked coordinates to estimate district assignments for use in health program planning and evaluation. Methods We developed three methods to assign a district to a geomasked survey cluster in two DHS surveys from Malawi: 2000 and 2004. Method A assigns districts of origin in proportion to the likelihood that results from repeated simulated geomasking, allowing more than one possible district of origin. Method B assigns a single district of origin which contains the greatest proportion of simulated geomasked survey clusters. Method C maps the geomasked survey cluster’s location to a district polygon. We used these method assignments to estimate a selection of commonly used coverage indicators for each district. We compared the district coverage estimates, confidence intervals, and concordance correlation coefficients, by each of the methods, to those which used validated district assignments in 2004, and we looked at coverage change from 2000 to 2004. Results The methods we tested each approximated the validated estimates in 2004 by confidence interval comparison and concordance correlation coefficient. Estimated agreement for method A was between .14 and .98, for method B the estimated agreement was between .97 and .99, and for method C the agreement ranged from .93 to .99 when compared with the validated district assignments. Therefore, we recommend the protocol which is the simplest to implement—method C—overlaying geomasked survey cluster within district polygon. Conclusions Using geomasked survey clusters from DHSs to assign districts provided district level coverage rates similar to those using the validated surveyed locations. This method may be applied to data sources where survey cluster centroids are available and where district level estimates are needed for program implementation and evaluation in low- and middle-income settings. This method is of special interest to those using DHSs to study spatiotemporal trends as it allows for the utilization of historic DHS data where geomasking hinders the generation of reliable subnational estimates of health in areas smaller than the first-order administrative unit (ADM1).


Background
There is potential for maternal, newborn, and child health, and nutrition (MNCH&N) programs to be informed and evaluated using household surveys [1], while health information systems are scaled-up to adequate quality in low-and middle-income countries (LMICs) [2]. The Demographic and Health Survey (DHS) program in particular provides systematic technical expertise to Ministries of Health in LMICs to design and implement household surveys for nationally representative health information [3]. Collectively, DHS represents an invaluable resource for international health organizations, researchers, and policy makers, and has been utilized in many contexts since the DHS's inception in 1984 [4][5][6].
The DHS program includes the collection of standard MNCH&N coverage indicators, which are often used at the national level for accountability purposes, even though health system requirements are known to vary considerably within countries [7]. As a result, program implementation, intervention coverage, and policy decisions are increasingly important in sub-national areas [8]. In Malawi, for example, the two most recent DHSs, in 2010 and 2015-2016, were sampled to offer reliable coverage estimates at the second-order administrative unit (ADM2), or districts [9]. In most cases, however, DHSs are designed to be representative at the first-order administrative unit (ADM1), often the province or region.

DHS geographic masking
In many surveys, DHS collects a coordinate for the centroid of the survey cluster, also referred to as the primary survey cluster defined by enumeration areas from a country's most recent census [10]. This coordinate is geographically masked to protect the identity of those who participate. The geomasking procedure randomly geomasks the centroid location up to 10 km prior to public release in a two-step process. First, an angle, and second, a distance, are randomly chosen. In urban areas, this distance is between 0 and 2 km, while in rural areas, the distance chosen is between 0 and 5 km for 99% of survey clusters, and between 0 and 10 km for a random 1% of survey clusters [11]. Official census and DHS criteria for urban-rural distinction is country-specific and may be based on population size or infrastructure [12,13].
This geomasking process produces an approximate location of each household while preserving privacy of survey participants. Surveyed household locations can be used to assess spatial variation in health outcomes [14], determine the predictive value of geographic factors on health [15], and link household and facility surveys [16]. In practice, even when finer resolution of surveyed locations is of interest, the geomasking protocol is often ignored, and coordinates are used as given [17]. There has been some analysis from within the DHS program to address the sensitivity in statistical applications involving spatial geomasking. In addition, the data management protocol for DHSs was updated in 2009 so that survey clusters are not geomasked across district boundaries [11]. However, for surveys prior to 2009, it is possible that the geomasking process would place the geographic coordinates of a survey cluster centroid in a district other than the one from which it was sampled, thus misidentifying the district of those who were surveyed.
Our objective, as part of the National Evaluation Platform project [18], is to use household survey data with geomasked coordinates, publically provided by DHS, to calculate indicator coverage estimates for each district by associating survey clusters inside a district, and then using household data within a district to calculate estimates and variances of coverage indicators widely used in the MNCH&N community. We chose two surveys prior to 2009, to introduce the potential that survey clusters may have been displaced across district lines. To our knowledge there has not yet been an analysis of how best to handle DHS geomasking with a statistical approach to determine the district identity of survey clusters, with the aim of identifying, summarizing, and utilizing district level MNCH&N data in program evaluation or health system planning. We tested three methods which approximate the district location of each survey cluster, in the Malawi 2000 and 2004 DHSs and make recommendations for using district level estimates and examining district-level trends in coverage based on these results.

Methods
We aimed to systematically estimate the area from which a survey cluster may have originated. It was not possible to recover a geomasked location, however, we specified the possible original districts, given the publicly available location of the geomasked survey cluster. Because area is increasing at distances further from the origin, the DHS geomasking process is not purely area-based, as in a dart board. A dart board would generate locations evenly distributed throughout an area, e.g., within 2 or 5 km. The DHS geomasking algorithm generates locations that are more likely to be in the area closer to the origin than the administrative boundary limit. To simulate all the locations from which a survey cluster could originate, we first selected a random distance, up to the 5 km maximum in rural areas, and 2 km maximum in urban areas, and then selected a random angle, illustrated in Fig. 1. For each survey cluster, we repeated this process 1000 times. We then used district boundaries to identify districts from which the survey cluster could possibly have been located prior to DHS geomasking. We ignored the 1% chance that a rural cluster is geomasked by more than 5 km. We used these simulated locations in two distinct ways to estimate the district of origin. In the first method, method A, we used the respective frequencies of each simulated district, represented as a fraction of the total 1000 simulated locations for each survey cluster. For example, if simulations identified four districts as possible districts of origin (Fig. 1a), each of these districts was assigned the proportion of simulations out of 1000 that resulted from the simulation process (Fig. 1b). In method B, we selected the district with the greatest respective likelihood, out of all possible districts of origin. For method C, we selected the second order administrative unit which contained the geomasked survey cluster, as the district of origin.

Analysis
For each of three methods, we used the resulting district assignments to estimate three coverage indicators: the proportion of households with piped water, the proportion of children under five who were moderately or severely stunted, and the proportion of infants up to 6 months old who were exclusively breastfed. We chose these measurements to cover a variety of contexts, and for variety in the expected estimate precision. The number of households, not children under 5 years, nor infants under 6 months, are fixed by survey design [19]. Therefore, all or nearly all households in this survey have information on water source, however, not all households have information on children under 5 years of age, and even fewer households have information on infants under 6 months. Thus, we expected stunting to be less precise than the piped water estimate and we expected exclusive breastfeeding to yield the least precise estimate of the three indicators.
We used the open source software R version 3.5.1 to assign districts for all three methods. For method A, we accounted for differences in probability of survey cluster selection by taking the product of the district likelihood and the survey cluster sampling weights [19]. Standard errors and confidence intervals were calculated using the Taylor linearized variance estimation [20]. Coverage estimates and their approximate standard errors were obtained using Stata 13. Source code to replicate methods and analyses is publicly available (see "Availability of data and materials").
For the 2004 DHS survey, we also had validated district locations of each survey cluster by verbal communication from the Nation Statistical Office in Malawi. While these validated district locations are not available publicly, we were able to use them to get validated district estimates. We then compared the resulting district estimates of each method to the validated district estimates [20]. We were not able to obtain validated district locations for the 2000 DHS survey in Malawi, and so we do not include a comparison to validated estimates for 2000. Fig. 1 a The simulation process to identify possible districts of origin (green, yellow, purple, and pink areas), for a hypothetical survey cluster from a Demographic and Health Survey. The maximum displacement area with a radius of two km in an urban area or five km in a rural area (black circle) surrounds the mapped sampling unit location (black dot). First, a distance is chosen (white solid arrow), and then, an angle is chosen, (white broken arrow). b This process is simulated 1000 times to estimate relative likelihoods for possible districts of origin. 200 simulations (white circles) are pictured, method A assigns all four districts non-zero probability proportions: 0.13, 0.24, 0.33, 0.29 (green, yellow, blue, and pink, respectively), while method B identifies the blue district, and method C identifies the yellow district as the district of origin We used geomasked latitude and longitude coordinates, obtained from the geographic dataset for the Malawi 2000 and 2004 DHS surveys; additionally, we used second-order administrative boundaries in Malawi, from the Food and Agriculture Organization (FAO) of the United Nations within the Food Security for Action Programme [21], which has updated and archived all countries, and every year from 1990 to 2014. Employing these two sources of geospatial information, geomasked coordinates and administrative district boundaries, we pursued the three aforementioned methods to identify the district of origin for the survey clusters in Malawi's 2000 and 2004 DHSs.
We included 26 districts in our analysis, out of 28 districts in Malawi at the time of the 2004 survey. We combined Neno with Mwanza in all analyses, as these districts were not separately distinguished in 2000. In addition, the island district of Likoma contained only one survey cluster in both 2000 and 2004, so we excluded Likoma from our analysis. We compared the number of survey clusters assigned to each district in three different district assignment methods as well as the validated districts. Using each method, we compared estimated coverage and confidence intervals, at the second-order administrative unit level, for three indicators against estimates based on Malawi's National Statistics Office-confirmed validated second-order administrative units. We graphically examined estimates, described differences between estimated coverage by districts in Malawi, and assessed agreement between coverage estimates using the concordance correlation coefficient, where we had validated district assignments in 2004 [22]. This statistic ranges from − 1 to 1, where values close to 1 indicate stronger agreement [21]. We also compared the estimated district-level trends in the above described coverages for three district assignment methods A, B, and C. We assume the most reliable trend estimate is provided by the district assignment method that has the highest agreement with the validated district in their estimate from the 2004 DHS.

Results
The number of actual survey clusters per district, as well as those estimated for methods A, B, and C, are shown for 2000 and 2004, in Table 1. In many cases, geomasked survey clusters had no district borders within 5 km of a rural survey cluster, or 2 km of an urban survey cluster, and thus were associated with only one district. In 2000, the number of survey clusters largely differed by 1 to 2 between method B and method C with the exception of: Phalombe, Nsanje, Machinga, Chitipa, Chikwawa, and Mangochi, which were consistent. In 2004, Phalombe, Ntchisi, Blantyre, Mwanza, Chitipa, and Rumphi had the same number of survey clusters although more often, the number of survey clusters differed by 1 to 3 among method B, C, and the validation method. The number of survey clusters in each district using method A is accounted for by assessing both the number of survey clusters that have a geomasked radius contained entirely within district borders, as well the partial survey clusters which resulted when geomasking simulations had more than one possible district of origin.

Estimate variability by method
For the 2004 survey, we plotted coverage estimates using districts approximated with methods A, B, and C against the estimates employing validated district assignments (Fig. 2) as proportions, from zero to one. Estimates appear stable for household piped water and methods B and C. We also examined approximate 95% confidence for coverage estimates, shown in Additional file 1: Tables S1-S3. All three methods produced confidence intervals that overlapped with the validation, in each district, for each of the three indicators that we tested with the exception of stunting in Chiradzulu, Lilongwe, Mulange, Mwanza, Nkhotakota, and Thyolo, for method A. The estimated proportion of households with piped water is shown in Table 2. In 24 of the 26 districts, estimates for this indicator were within ± 0.02 of the validation estimate. In the remaining two districts, Karonga and Nkhotakota, estimates did not differ among methods by more than ± 0.03. Moderate stunting estimates (Table 3) varied from the validation method by as much as ± 0.35 in Chiradzulu, ± 0.17 in Lilongwe, ± 0.22 in Mulanje, ± 0.32 Mwanza, ± 0.26 in Nkhotakota, and ± 0.19 in Thyolo, in method A. Exclusive breastfeeding among infants under 6 months ( Table 4) produced estimates that varied by as much as ± 0.30 from the validation in Chiradzulu and ± 0.47 in Mwanza, ± 0.23 in Nkhotakota, and ± 0.32 in Mulanje, method A.
Agreement between 2004 coverage estimates and the validated district estimates was described using the concordance correlation coefficient. Strong agreement was indicated for methods B and C between estimates and validated districts, and for method A in the coverage of piped water. For piped water, estimated agreement with the validated district assignments was 0.98 (95% confidence interval .

Estimate variability by district
We found broadly consistent estimates for a single district across different methods of district assignment. However, we did observe differences in coverage estimates depending on district. Coverage of piped water was lowest in Nsanje 0.00 (95% confidence not estimable), Machinga 0.00 (0.00, 0.01), and Ntchisi 0.00 (0.00, 0.01) and was highest in Blantyre 0.16 (0.09, 0.23) (  (Table 4). All three indicators showed real, actual variation between districts. Coverage levels are mapped for these indicators using method C and estimates that resulted using validated districts in Fig. 3. When examining the district-level trend in coverage estimates between 2000 and 2004, there are apparent differences between district assignment methods for all coverages considered, especially for moderate stunting and exclusive breastfeeding. The estimated district-level trend was most different for the % with piped water in Rumphi district, ranging from a 1% decrease using method B and a 9% decrease using method C, and in Karonga district, ranging from a 6% increase with method A to a 12% increase with method C. For estimating the percent of children under five with moderate stunting, there were more extreme differences between district assignment methods, especially for Method A. In Mchinji district, method A estimated a 10% increase in stunting, while

Table 1 Number of primary survey clusters per district, per method, in increasing order of district area (km 2 ) in 2000 and 2004
*Method A accounts for survey clusters per district using two columns: a survey cluster's maximum displacement area may be entirely within a district's boundaries, or, a survey cluster may have more than one possible district of origin, in which case partial survey clusters are counted

Discussion
We developed three methods to estimate where survey clusters of DHS surveys originated. Using each method, we estimated three coverage indicators which are commonly used in health system planning and program evaluation. We then compared the district-level estimates, confidence intervals, and agreement that came from each of the methods to the values from validated district locations in a single survey year, and compared estimated district-level trends over time between two surveys (Fig. 4).
Where we had validated district locations for comparison, we found that simulating possible district of origin, which is required for method A and method B, is not an improvement on using the geomasked coordinates directly. Methods A and B require more computational processing and analytical capacity than method C, which is simpler to implement and disseminate in low resources settings. We found that many survey clusters in the Malawi 2004 survey could be confirmed to have been geomasked within original district borders, where no district boundary was within the geomasking limit. Finding multiple possible districts of origin, as for methods A and B becomes irrelevant for surveys in 2009 or later, when DHS amended their geomasking process so that survey clusters are geomasked within district borders. Still we expect that these methods will be useful for estimating trends in survey data where an initial survey was conducted prior to 2009. We saw considerable variability in the trends in district-level estimates between these methods of district assignment, so the choice between methods would generally result in meaningful differences in estimated districtlevel trends. Although we were not able to estimate validated district-level trends over the period 2000-2004, we recommend using trend estimates using method C based on the high agreement between method C and validated district estimates in 2004 and the high availability of this method (Figs. 5, 6).
There are inherent limitations to using complex surveys such as the DHS that are designed for representation at the national or regional level for describing smaller areas such as districts. Household surveys are undertaken with great care and expense so that the best quality information is collected with pre-specified precision due to sampling variability. Optimal DHS sample size requires a trade-off between the resources available and the desired survey precision [23]. None of the methods considered allow that ideal and planned precision be maintained for district level estimation, where information is necessarily less enriched. In these instances, how much actual data is in each district may be of interest, since there is potential to have none, one, or few, survey clusters when these methods are used. For example, exclusive breastfeeding in neonates may result in uninformative estimates for many districts due to too few survey clusters. If sample sizes are too small, users may choose to group districts together or to consider alternate indicators.
These methods can be applied to countries whose DHSs sampled the population to be representative of making assumptions about sensitivity to geomasked locations. The intention for districts assigned using these methods is to estimate coverage, and the precision of coverage, for district populations. District assignments using these methods cannot necessarily be extrapolated when geomasked locations are being used for other purposes, such as estimating distances between households and health facilities. However, district assignments offer information about districts, even where survey design is not specific to the sampling of districts. Similarly, DHSs offer much information about subpopulations such as young infants, even though sampling design is not based on the size of those subpopulations. Another limitation of our methods is that we have not included the district population in any of our potential assignment methods, which may improve the usability of our partial district assignment method A or the most probably district for method B. Incorporating population would require reliable district population estimates at the same time of survey, which may not be available for some surveys. We expect, however, that it is unlikely either methods A or B would be significantly improved here beyond the high agreement between method C and the validated district estimates.

Conclusions
We can use geomasked geolocations from DHSs to describe coverage in districts in LMICs, nearly as if the DHS had provided the validated district of surveyed households. Comparing districts assigned this way to the validated districts, we found coverage estimates and confidence intervals that result from each method are effectively the same as the coverage estimates and confidence intervals that result from the validated assignments.
District data is necessary to better implement health programs, as well as to identify gaps in data where more information is needed. It is possible, with additional research, that these methods could be carried beyond descriptive analyses, and incorporated into hypothesis testing, predictive modeling, and statistical comparisons. Modeled and imputed data can be examined against direct evidence from each district to identify districts without data, where modelling and imputing remain necessary. We advocate ongoing investment in obtaining high-quality MNCH&N data in hard-to-reach subnational areas in LMICs. In the meantime, we recommend that governments, policy makers, implementers, and evaluators access district data for planning, implementing, and evaluating MNCH&N programs in LMICs.
Additional file 1: Table S1-S3. Household piped water, moderate stunting, and exclusive breastfeeding point estimates and 95% CIs, by district and method, in 2000 and 2004.