Ethics approval
This study was approved by the Institutional Review Board of the National Institute of Health in Mozambique.
Study setting
Mozambique is a southern East African country with some of the lowest rankings for health and development globally. The World Bank classifies Mozambique as a low-income country, with gross domestic product per capita in 2016 of US $1200 [17]. According to the 2015 Human Development Index, Mozambique ranks 181st out of 188 countries [18]. The country has 26.5 million inhabitants with more than half of the population under the age of 18, and 45% of the population under age 15. It also is among the eight countries with the highest HIV prevalence, with 13.2% of the adult population infected [19]. Mozambique has made great strides in decreasing under-5 and infant mortality over the past decade, although decreases have not been uniform across the country [20], with areas in the center and north of the country generally having higher mortality rates and lower statistics of development. Most (> 65%) of the population of Mozambique lives in rural areas, most often in groupings of households aggregated by a kinship system forming a homestead [21]. These homesteads are usually easy to identify as they are separated from other household groupings by significant distances or physical barriers such as fences or mud/concrete walls. Among countries in sub-Saharan Africa, only Swaziland, Lesotho, and Malawi have a higher percentage of their populations living in rural areas. Rural areas thus often have clusters of buildings surrounded by open areas of fields for subsistence farming—often each cluster of buildings represents a homestead with multiple generations cohabitating.
After gaining independence from Portugal in 1975, Mozambique endured a 16 year civil war, displacing millions and leading to near complete destruction of essential infrastructure. The civil war ended in 1992, with a period of stability and peace until 2012/2013, when an insurgency by the RENAMO political group (Mozambican National Resistance; Portuguese: Resistência Nacional Moçambicana) restarted, primarily affecting Sofala and Manica provinces in central Mozambique (see Fig. 1 for map of provinces). This insurgency against the country’s ruling FRELIMO (Mozambique Liberation Front; Portuguese: Frente de Libertação de Moçambique) has links to the decades-long Mozambican civil war and includes nighttime raids on cities and villages, as well as violent conflicts targeting main transport corridors. The fighting between RENAMO and FRELIMO intensified in 2014, leading to parts of Sofala and Manica provinces being unsafe to survey in the present investigation.
Background on aim of survey and sample size calculations
The overall purpose of this study was to conduct a community survey to evaluate the impact of a 7-year health system strengthening intervention occurring in Sofala Province [16], using Manica Province as an evaluative control. In order to collect essential endline data for this impact evaluation, we conducted the following steps: (1) collaborative digitization and georeferencing of all visible buildings in Sofala and Manica Province using remote satellite imagery to serve as an estimate of population distribution for survey sampling; (2) development of probability proportional to size sampling frame using digitized buildings as a proxy for population distribution; (3) field implementation of house-to-house survey activities; (4) survey fidelity checks and the development of survey weights for data analysis. These steps are described in detail below, along with initial data on the performance of the field implementation procedures and the final sample drawn.
We based our sample size on the number of households visited for the standard DHS in Mozambique. The standard DHS in 2011 aimed to visit 1300 households in Sofala and 1200 in Manica [22]. To ensure our sample would exceed these numbers, we targeted a minimum of 1500 households in each province. Thus, we selected 88 grids within each province to reach 1760 households, anticipating ongoing conflict and transportation issues could preclude the inclusion of some clusters/households.
Mapping buildings in Sofala and Manica provinces using satellite imagery
Health Alliance International (HAI) contracted Humanitarian OpenStreetMap Team (HOT), a Non-governmental organization (NGO) and global mapping community, to digitize and georeference all visible buildings in Sofala and Manica Provinces to serve as an estimate of population distribution for survey sampling. The HOT team worked with a team of 20 trained mappers out of Dar Es Salaam, Tanzania who utilized Java OpenStreetMap Editor to trace polygons on all buildings in the two provinces visible by satellite imagery. Each mapper traced polygons in a given work area, which was then reviewed by an expert quality-control supervisor prior to confirmation and being uploaded into the OpenStreetMap database. These mapping activities started in February 2015 and were completed in May 2015, with a total of 1,610,902 building digitized (685,189 Sofala; 925,713 Manica). Basemap satellite imagery was obtained from Microsoft Bing and Mapbox Satellite, with all basemap imagery being from 2013 to 2016. Interested parties can view the final mapped buildings at www.openstreetmap.org, and export the most up-to-date most up-to-date OpenStreetMap project data for Mozambique at: https://download.geofabrik.de/africa/mozambique.html. Also, see Fig. 2 for examples of digitized building polygons in Beira City, Sofala, Mozambique, as well as a more rural area of Chibabava district, Sofala, Mozambique. Customized thematic maps for Figs. 3, 4 and 5 were created in ArcMap 10.5.
Developing primary sampling units for probability proportional to size sampling
Once the building digitization was complete, we exported the shapefiles—including building shapes for Sofala and Manica—from the OpenStreetMap platform. We then used R 3.2.3 (Comprehensive R Archive Network) to overlay a grid on Sofala and Manica Provinces using the GridFilter command explained at http://bit.ly/2xCI5Ge (see Additional file 1 for raw R code). The spatial resolution we chose for the primary sampling unit “grid” (PSU) was 0.02 decimal degrees, with each box being 2.106 km across using the Universal Transverse Mercator Projected Coordinate System 37S (UTM PCS 37S) for Mozambique. This resolution balanced the inherent tension between granularity of the sampling frame and attempting to minimize the number of grid PSUs that would contain no buildings or a number of buildings less than 20, which was the number to be selected from each PSU. We then tabulated the number of buildings within each PSU and exported the final database file including each PSU paired with a building count using R. This resulted in 28,391 total PSUs (14,764 Sofala; 13,627 Manica). Even with this resolution, this resulted in 12,178 PSUs (7004 Sofala; 5174 Manica) with no buildings. Thus, the final PSU numbers for probability proportional to size (PPS) sampling were 16,213 (7760 Sofala; 8453 Manica). PPS sampling here means that the probability of selecting each PSU “grid” cell was directly proportional to the number of buildings contained within each PSU “grid” cell. Of these PSUs included in the final sample, the mean building count was 99.3 (SD = 341.9), with a median building count of 44. See Fig. 3 for a thematic heatmap of these PSUs colored by the number of buildings in each 2.1 × 2.1 km grid cell.
Carrying out probability proportional to size sampling
Due to ongoing violent civil conflict, a number of subdistricts in both Sofala and Manica Provinces were unsafe for travel or survey research in September, 2016 when activities were to be launched. This forced our team to exclude all PSUs lying within the subdistricts of Marromeu, Chupanga, Inhamitanga, Inhaminga, Gorongosa-Sede, Mixungue, Divinhe, and Machanga in Sofala and Mandie, Nhamassonge, Nhacafula, Nhacolo, Mungari, Buzua, Dacata, Guro-Sede, Macossa, Nhamangua, Nhampassa, Choa, Catadica, Nguawala, and Chiurairue in Manica prior to drawing our final PPS sample (see Fig. 4 for the locations of these excluded subdistricts). This was unfortunate but necessary to maintain the safety of our research teams. Survey implementation could not be delayed due to the need to measure key indicators as close to the end of our intervention as possible. Intervention activities ended in September 2015, and thus our survey already had a year’s delay between survey initiation and the end of intervention activities.
We then used the SamplePPS command in Stata 14 (available from the Boston College Statistical Software Components archive) which draws a random sample from a current dataset with probabilities proportional to size. In our case, the PSUs were our aforementioned grid, and our size was the number of buildings in each grid. See Additional file 2 for detailed Stata 14 code for PPS sampling. We sampled 88 PSUs within each Province with replacement. A number of the highest density PSUs were sampled multiple times, such as those in Beira and Chimoio cities (see Fig. 4).
Field implementation
Prior to initiating field procedures, each supervisor entered the Global Position System (GPS) location of the center of each PSU their team was responsible for into OpenStreetMap or Google Maps and determined the optimal transport route to this point. The center of each PSU was calculated using ArcMap 10.5. Many of our sampled PSUs were in very rural areas of central Mozambique with limited to no paved roads—many of which are inaccessible many months of the year due to rains, failed bridges, or other infrastructural challenges. Thus, supervisors had detailed discussions with HAI expert staff drivers to determine individual logistics plans to visit each PSU.
In OpenStreetMap, the supervisor also noted the location of the households closest to the central GPS point. Once arriving at a given PSU, field teams used GPS location on tablet-based Android survey devices and the GPS navigation application Sygic to identify the geographic center of the PSU. Sygic is a tablet or phone-based navigation software that can operate without an active internet connection (more information: https://www.sygic.com/gps-navigation). In rural areas the closest houses to this central GPS location were sampled first, with subsequent households being those whose front door was closest to the front door of the initial household sampled following methods outlined in the World Health Organization’s (WHO) Expanded Programme on Immunization methods [23]. In urban areas, or areas where there were multiple houses equidistant from the starting center GPS location, these households were numbered, with one household randomly selected as the starting household. If the GPS location for the center of the PSU was in a creek, or a lake, supervisors observed this by plotting in OpenStreetMap prior to initiating field activities, and visually identified the household closest to this GPS location to serve as the starting sampling unit. If the closest building to the center of a given PSU was a commercial building, field teams excluded this building from consideration and traveled to the closest residential building. If an apartment complex was selected, the research teams randomly sampled one household every two floors.
Approximately 1 week prior to visiting PSUs, research teams contacted community leaders to notify the population that a survey would be undertaken. When research teams arrived at PSUs, they first visited these community leaders, who often traveled with the teams to households sampled—this led to near universal consent participate in the survey and helped ensure that sampled participants were home and prepared to answer survey questions.
When arriving at a household, research teams first asked to talk with the self-appointed “head of household”. This individual was asked to conduct initial informed consent and answer questions related to the general household questionnaire (household demographics, assets, etc.). If the “head of household” was not present, the teams asked to interview another adult ≥ 15 who was comfortable answering questions on behalf of the household. After completing the household survey, this individual completed the full adult questionnaire. All individuals surveyed were administered questionnaires in a private area of their homestead. After surveying the “head of household” or household proxy, research teams were instructed to select a woman with a child < 5 who were both present in the household and administer the Child and Women’s Health surveys, including anthropometry and questions on birth history and maternal/child health. If no such woman was available, research teams were to administer the Women’s Health survey to another mother of a child < 5; if no mother of a child < 5 was available, this survey was administered to a mother of a child ≥ 5; and if there we no mothers present, they were to administer the Women’s Health survey to another woman of reproductive age (15–49 years old). If there were no further women of reproductive age in the household, research teams were to administer the Adult survey to any other adults ≥ 15 in the household attempting to maximize gender diversity. That is, if the teams had interviewed two women of reproductive age they would interview a male adult. If the teams had interviewed a man as “head of household” and a woman of reproductive age and there were both men and women not 15–49 years old, they would randomly select the next interviewee. A maximum of three individuals were interviewed per household. If no one was present at the sampled household, teams would move to the next household in the sampling procedure. Field implementation progressed in three research teams, each with 3–5 research assistants, one supervisor, and one car. A figure explaining sampling populations, sampling criteria, and survey modules applied to each population group can be seen in Fig. 5.
Data management and collection
Data were collected on Samsung tablets using Open Data Kit (ODK) software (https://opendatakit.org/). Data were directly transferred from ODK to a REDCap database [24] through a cloud server in real time. Household questionnaires were adapted from the Mozambique DHS with additional modules to estimate the burden of cardiovascular disease, mental health conditions, alcohol abuse and epilepsy, as well as a general disability module (see Additional file 3 for full survey materials).
Sampling weight calculation
To calculate sampling weights, we calculated both the probability of selecting each PSU (Eq. 1 below) and the probability of household selection within each PSU (Eq. 2 below).
The probability of each cluster i being sampled in each province h is given by:
$$P_{1ih} = \frac{{\# buildings_{i} \times \# clusters\;to\;be\;sampled_{h} }}{{\# total\;buidings_{h} }}$$
(1)
The probability of each household j being sampled in each cluster i is given by:
$$P_{2ji} = \frac{\# households\;to\;select\;per\;cluster}{{\# buildings_{i} }}$$
(2)
If the number of buildings in a PSU was bigger than sampling interval, the probability to select the PSU was set to 1. Additionally, if the number of buildings in a PSU was less than 20, the probability to select households was also set to 1. The overall basic weight of household sampling was the inverse of the probability of selection (Eq. 3 below).
Overall sampling weight of the household is given by:
$$W_{j} = \frac{1}{{\left( {P_{1ih} *P_{2ji} } \right) }}$$
(3)
Out of 176 PSUs (88 PSUs per province), 23 PSUs were excluded because of ongoing regional conflict. As non-response adjustment, the sampling weights of these 23 PSUs were redistributed to other PSUs at the Provincial level, consistent with the stratification of PSUs at the Provincial level.