Open-source environmental data as an alternative to snail surveys to assess schistosomiasis risk in areas approaching elimination

Background: Although the presence of intermediate snails is a necessary condition for local schistosomiasis transmission to occur, using them as surveillance targets in areas approaching elimination is challenging because the patchy and dynamic quality of snail host habitats makes collecting and testing snails labor-intensive. Meanwhile, geospatial analyses that rely on remotely sensed data are becoming popular tools for identifying environmental conditions that contribute to pathogen emergence and persistence. Methods: In this study, we assessed whether open-source environmental data can be used to predict the presence of human Schistosoma japonicum infections among households with a similar or improved degree of accuracy compared to prediction models developed using data from comprehensive snail surveys. To do this, we used infection data collected from rural communities in Southwestern China in 2016 to develop and compare the predictive performance of two Random Forest machine learning models: one built using snail survey data, and one using open-source environmental data. Results: The environmental data models outperformed the snail data models in predicting household S. japonicum infection with an estimated accuracy and Cohen’s kappa value of 0.89 and 0.49, respectively, in the environmental model, compared to an accuracy and kappa of 0.86 and 0.37 for the snail model. The Normalized Difference in Water Index (NDWI) within half to one kilometer of the home and the distance from the home to the nearest road were among the top performing predictors in our final model. Homes were more likely to have infected residents if they were further from roads, or nearer to waterways. Conclusion: Our results suggest that in low-transmission environments, investing in training geographic information systems professionals to leverage open-source environmental data could yield more accurate identification of pockets of human infection than using snail surveys. Furthermore, the variable importance measures from our models point to aspects of the local environment that may indicate increased risk of schistosomiasis. For example, households were more likely to have infected residents if they were further from roads or were surrounded by more surface water, highlighting areas to target in future surveillance and control efforts.

Meanwhile, in more densely vegetated areas of China, other environmental characteristics have been identi ed as strong predictors of snail habitats, including elevation, humidity, annual average precipitation, vegetation type, the normalized difference vegetation index (NDVI) and distance to the nearest waterbody (6, 10,19). There is considerable variability in key predictors of snail habitat when comparing across different ecosystem types, even within regions of China. For example, one study spanning several hundred kilometers along the Yangtze River in Anhui Province indicated that distance to the nearest river was the most important predictor of snail habitats in marshland ecosystems, whereas 100-m resolution summaries of mean temperature and annual precipitation were the most important predictors within the hilly regions of the Province (19). Notably, across smaller geographic areas, factors like ecosystem type and climatic conditions would not be expected to vary meaningfully, highlighting the ongoing need to identify metrics that can predict schistosomiasis at ne spatial scales.
Fewer studies have demonstrated the use of environmental characteristics to directly predict human schistosomiasis risk (9,12,22). A recent study in Senegal found that measures of vegetation and water contact area were better predictors of S. haematobium reinfection in children in a highly endemic region than measures from snail surveys (12). Similarly, studies of S. japonicum infection in China have found measures of vegetation and proximity to rivers were predictive of human infection clusters (9,22). In all three studies, the models were designed to identify infections at the village scale. Further research is needed to identify higher resolution environmental proxies of human infection in low transmission settings. Moreover, localized investigations of the suitability of different environmental measures for predicting human infection across a range of settings are needed, as snail habitat preferences, suitability and transmission risks may vary substantially from ecosystem to ecosystem (22)(23)(24)(25).
As China approaches schistosomiasis elimination goals, the perceived payoff of comprehensive infection and snail surveys will decrease, making it likely that resources will be diverted to other priorities in the coming decades. In order to avoid a resurgence in schistosomiasis, it is crucial that cost-effective, low labor surveillance techniques are developed that can be used to pinpoint, at ne geographic scales, areas of high infection risk in areas approaching elimination.
Precision risk mapping can enable targeting of resources to high-risk areas for testing, treatment or transmission-blocking interventions. The proliferation of high-resolution, open-source geospatial data products offer an opportunity to develop new methods for mapping schistosomiasis risk in areas where control programs have reduced but not fully eliminated schistosomiasis.
Our study aimed to determine whether open-source environmental data that is freely available and less time-and labor-intensive to collect than snail survey data can directly predict household S. japonicum infection distribution, with a similar or improved degree of accuracy as data obtained during snail surveys. To do this, we developed and compared two models for predicting household S. japonicum infection among rural farming communities in Sichuan Province, China. In our rst model, we used geocoded snail survey data to build a set of predictors and determine how well the proximity and density of snail habitat relative to the location of the home predict household S. japonicum infection status. In the second model, we drew on freely available, open-source environmental data to create a set of measures characterizing local surface water and vegetation density in the area surrounding the home in the months prior to infection surveys. By comparing the ability of these two models to predict ne-scale geographic patterns of human S. japonicum infection, our study provides valuable information on the utility of each of these surveillance techniques for identifying potential high and low risk households in communities where low levels of persistent S. japonicum infection are obstructing elimination goals. Furthermore, by comparing the relative predictive performance of a range of measures of snail habitat and environmental conditions at different proximities to the home, this study sheds light on those characteristics of the local environment that can best be leveraged for predicting household S. japonicum infection risk and targeted in local prevention and control efforts.

Setting and village selection
This study was conducted in 2016 in two counties located in the hilly regions of Sichuan Province, China. County surveillance records were used to select ten villages at high risk of reemergent or ongoing S. japonicum transmission. We conducted a census in each village, attempted to geocode the location of all households using handheld Global Positioning System (GPS) devices, and surveyed each household for S. japonicum infection, as described below. The number of households in the selected villages ranged from 19 to 75, with between 50 to 250 residents residing in each village at the time of data collection. We restrict this analysis to households for which: a) GPS coordinates were successfully recorded, and b) at least one household resident was tested for S. japonicum infection during the 2016 infection surveys. Of the 463 households identi ed during the census, a total of 283 households (61.1%) had both GPS and infection survey data and are therefore included in this analysis. See Figure 1 for details on household exclusion and inclusion.

Data collection and sources
Human infection data were collected in July 2016 as a part of ongoing research efforts in the region assessing persistent schistosomiasis hotspots. All village residents over the age of ve were invited to participate in the study. Each participating individual was asked to provide three stool samples on consecutive days. All samples were labelled with the date of collection and participant ID numbers and stored in a cooler or cool room (ideally <10°C) until they could be transported to the central laboratory for processing. Samples were examined using the miracidium hatching test, following standard protocols (26). In brief, for each sample, 30 grams of stool was suspended in water (pH range of 6.8 -7.2), strained to remove large particles (80 head nylon mesh), strained again to retain the remaining solids (280 head nylon mesh), and then suspended in water at room temperature (28 -30°C). At 2-, 4-and 8-hours after suspension, the samples were examined for the presence of miracidia for at least 2-minutes each time. An individual was considered positive if any of the three hatch tests were positive.
The habitats of O. hupensis snails in the present study were determined during a national survey on O. hupensis that was conducted in 2016. Snail habitats were rst identi ed by trained professionals from county anti-schistosomiasis control stations using historical records that date back to the 1950s. Historical habitats were digitized by global positioning and geographic information systems (GIS). Surveys of Oncomelania snails were then conducted in the eld via transect walks between the months of April and October 2016 using standard systematic sampling methods (27). Brie y, each historic, existing or suspected snail environment was divided into sampling frames set every 5-10 meters along the water line for linear features (e.g., ditches, rivers, etc.) and every 5-10 meters along the periphery of polygon features (e.g., ooded paddy elds, ponds, etc.), with parallel lines extending from each to form a set of sampling frames of between 25 -100 m 2 covering each site. The majority of existing or suspected sites were characterized by shallow, stagnant or moving water (e.g. a stream, pond, rice paddy or irrigation ditch), as these conditions are the preferred habitat of the amphibious freshwater O. hupensis snails (28). For each site, 20% of the sampling frames were randomly selected to be investigated on foot for the presence of snails. The digitized maps were updated using handheld GPS devices to document present and absent habitat locations, shapes, and whether and how historic habitats had been destroyed or changed (e.g., land use change via urbanization). For each site, the environment type was recorded as either a polygon or line feature in the dataset. Polygons most frequently represented rice paddy elds in the area, though they occasionally represented other habitat sites such as a small pond, a dry eld or a beach. Meanwhile, line features most commonly represented dirt or concrete irrigation ditches used for ooding rice paddies, though it could also correspond with streams or other narrow waterways. For the purposes of this analysis, all polygons are referred to as " elds", while all lines are referred to as "ditches".
Data on waterbodies, waterways and roads in Sichuan Province, China were obtained from the OpenStreetMap project, downloaded on November 11 th , 2021 (29). The OpenStreetMap project draws on local communities of mappers to build a knowledgeable database detailing roads, waterways, transportation and other built and natural environment features (30). OpenStreetMap Contributors use aerial imagery, handheld GPS devices, and eld maps, both to generate the data and to verify the accuracy of the open data on a regular basis (30). The roads data from the OpenStreetMap project ranges from national freeways and motorways down to gravel tracks and paths, while the waterways and waterbodies included permanent water features such as large rivers, streams, canals, lakes and reservoirs. Details on the OpenStreetMap data used in this analysis can be found at: https://download.geofabrik.de/osm-data-in-gis-formatsfree.pdf (31).
Elevation Data was obtained from the Earth Observation Research Center Japan Aerospace Exploration Agency's (JAXA EORC) Advanced Land Observing Satellite (ALOS) global digital surface model, which has a horizontal resolution of approximately 30 meters (32). To calculate indices of vegetation and waterbody coverage, the U.S. Geological Survey's (USGS) Earth Resources Observation and Science (EROS) Center's image library from the Landsat Satellite 8 -Collection 1 was accessed from the USGS Earth Explorer website (https://earthexplorer.usgs.gov/) to obtain data on surface re ectance bands 2 -5, as well as the QA band (33). The Landsat-8 satellite repeats its orbital pattern every 16-days (34), resulting in a total of 12 available observations across 2016 that occurred prior to our July infection surveys, which were downloaded for use in this study. The National Aeronautics and Space Administration's (NASA) preprocessed Moderate Resolution Imaging Spectroradiometer (MODIS) Terra satellite imagery database was also accessed to obtain 250-meter resolution data on vegetation at 16-day intervals at all available timepoints in 2016 prior to the infection surveys (35).

Variable de nitions and generation
Outcome variable S. japonicum infection survey results from the ten study villages were aggregated to the household level and spatially joined to the geographic location of the home. To avoid issues with multicollinearity resulting from residents of the same household having the same values for all environmental predictors used in this analysis, the outcome was a binary measure of household infection status, with "0" indicating no infections detected among participating household members, and "1" indicating that one or more household member tested S. japonicum positive. Each household was represented in the geospatial dataset as a point feature.

Explanatory variables
Model 1: Snail survey data Using the snail survey data collected in 2016, predictors were generated to assess how the household's position in relation to surrounding snail habitats in uence household-level S. japonicum infection risk ( Table 1). The geocoded snail habitat data was divided into two categories: present snail habitat sites, and absent snail habitat sites. Present snail habitat sites were those sites where one or more snails were identi ed during the survey period, while absent snail habitat sites were those where snails were not found during the 2016 survey. The data were further grouped into "ditches" (i.e., line features deemed suitable for snail habitation) and " elds" (i.e., polygon features deemed suitable for snail habitation), resulting in four snail habitat categories: present ditches, present elds, absent ditches, and absent elds. For the snail survey data models, present sites are those where at least one snail was found, while absent sites are those where no snails were found during the 2016 snail surveys. For the environmental data models, NDVI and NDWI were calculated using Landsat-8, Collection 1 satellite data collected on January 23rd, February 8th, and April 28th (dates where there was < 30% cloud cover). Pre-processed EVI data from NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) Terra satellite was averaged across a total of 12 observations occurring at 16-day intervals between January 1st and July 10th, 2016.
Using ArcGIS Pro software (36), three different buffer sizes (0.25, 0.5 and 1.0 kilometer (km) radius length) were generated and applied to each household location using the "Buffer" analysis tool. These buffer radius lengths were de ned such that the largest buffer (1 km) generally spanned the entire village area for a centrally located household, whereas the smallest buffer (0.25 km) spanned the immediate surroundings of a given household. The "Summarize Within" analysis tool was used to calculate the total length (km) of present ditches and absent ditches that fell inside each household buffer area. This step was repeated for the present elds and absent elds, calculating the total area of elds (km 2 ) that were encapsulated by each household buffer. The "Near" analysis tool was then used to calculate the geodesic distance (in meters) between each household point and the nearest present ditch, absent ditch, present eld, and absent eld. The "Join Field" data management tool was used to join all of the newly created variables summarizing the length and area of ditches and elds into a single table, which was then exported to the project's geodatabase for use in R. Including separate measures for the distance to the nearest ditch and nearest eld, as well as the total length and area of ditches and elds within a given buffer area helped us determine whether these features varied in their relative predictive capacity or in their respective spatial scales (i.e. buffer sizes) of in uence.
Open-source environmental and remotely sensed data were compiled to create a geospatial dataset containing a range of hypothesized environmental (built and natural) predictors of household S. japonicum infection. Potential environmental predictors were selected if they were 1) previously identi ed or hypothesized in the literature to serve as predictors of schistosomiasis infection or snail habitat sites; and 2) made publicly available at a 250-meter resolution or ner for the entire study area. Most of the predictors represented natural features of the environment (e.g., waterbodies, elevation, vegetation indices, etc.). Human-made environmental features like roads were also included, as the relative remoteness or connectedness of a given household was hypothesized to be a factor associated with schistosomiasis infection status. Roads and waterways from the OpenStreetMaps project were included as line features, while waterbodies were water features coded as polygons. The "Near" Analysis tool was then used to calculate the geodesic distance (km) between each household and the nearest road, waterway, or waterbody.
Prior studies have suggested that elevation is negatively associated with the presence of O. hupensis snails (17,37). We used 30-meter resolution elevation data from JAXA EORC's ALOS satellite to extract the elevation (m) value that corresponded with each household point location.
The presence of either water or vegetation can provide opportunities for water contact and have the potential to impact human infection risk. In this study, we use the NDWI (38) to estimate water content across the study area, and the NDVI (39) and the Enhanced Vegetation index (EVI) (40) to describe vegetation health and density in the study area. The NDWI identi es water features and distinguishes them from soil and vegetation surfaces (38). The NDVI index is chlorophyll-sensitive and provides a measure of crop and vegetation health, while the EVI is more sensitive to canopy variations and performs particularly well in high biomass regions (41). As such, the two vegetation measures complement each other and are frequently used jointly in vegetation studies (41). Whereas NDWI and NDVI were calculated using data from the Landsat-8, Collection 1, pre-processed EVI data at 250-meter resolution was downloaded directly from NASA's MODIS data library (35).
There was a total of 12 Landsat-8 satellite images collected between January and July of 2016, all of which were examined and processed to remove all medium to high con dence clouds, cloud shadows or other sources of terrain occlusion. To do this, the QA band les made available by USGS for each of the Landsat-8 observations (34) were used in ArcGIS Pro with the "Remap" raster function, to recode all grid cells corresponding with terrain occlusion as "No Data", while all clear or low con dence cloud cells were set to equal 1. The "Clip" Raster function was then used to remove all cells obscured by cloud cover from each corresponding Landsat-8 image. Overall, cloud cover was high between January 2016 -July 2016 over the study area, ranging from 9.97-100%, with an average cloud coverage of 66.13% across the 12 satellite observations made within that period. To assess the extent to which the removal of cloud cover and terrain occlusion would result in missing data for each household at each time point, the "Raster to Points" tool was used to convert the cloudcorrected satellite data to a grid of points, and the "Extract Multi Values to Points" geoprocessing tool was used to extract the data corresponding to each household's point location. A total of 3 of the 12 Landsat-8 collections had < 30% cloud coverage and had cloud-corrected data available for between 98.5-100% of households. The remaining 9 collections had cloud cover ranging from 33-100%, which resulted in data for between 0-65% of households. As such, we restricted our use of the Landsat data to the 3 collections with < 30% cloud coverage (collected on January 23rd, February 8th, and April 28th of 2016).
Using ArcGIS Pro's "Raster Calculator" Image Analyst tool, the NDWI on January 23rd, February 8th and April 28th were each calculated from the cloudcorrected Green and Near Infrared (NIFR) Landsat Surface Re ectance bands (bands 3 and 5 in Landsat-8, respectively), using the following formula developed by McFeeters (1996) (38): NDVI on January 23rd, February 8th and April 28th was calculated from the cloud-corrected Red and NIFR Landsat Surface Re ectance bands (bands 4 and 5 in Landsat-8, respectively), using the following formula (39): The "Mosaic to New Raster" tool was then used to calculate an estimate of the average NDWI and NDVI across the three time points with su cient data coverage.
EVI is calculated based on Blue, Red and NIFR Re ectance bands, as well as a soil adjustment factor (L), and two coe cients (C 1 and C 2 ) used to correct for aerosol scattering, as shown in the following formula (39, 41): All 12 of the pre-processed, cloud-corrected EVI measures at 16-day intervals for the period between January -July 2016 were downloaded from NASA's MODIS Terra Satellite Imagery database (35), and were joined into a single, average EVI measure using the Mosaic to New Raster" tool in ArcGIS Pro.
Each of the nal NDWI, NDVI and EVI measures were converted to a grid of points using the "Raster to Multi Points" tool. As was done for the snail data, three different buffers sizes were generated around each household point (0.25, 0.5 and 1.0 km radius), and the "Summarize Within" analysis tool was then used to calculate the average NDWI, NDVI and EVI in the 0.25, 0.5 and 1 km area surround each household. See Table 1 for a summary of all predictors included in the environmental predictors model.

Analysis
To assess the predictive capacity of snail survey and environmental data to predict household S. japonicum infection status, and to compare the predictive performance of the model constructed using snail survey data to one that exclusively used open-source environmental data as predictors, a Random Forests (RF) machine learning approach was used. After the snail habitat dataset and the environmental predictors dataset were each generated in ArcGIS Pro, the R-ArcGIS Bridge from the 'arcgisbinding' R package was used to facilitate an easy transfer of data between ArcGIS and Rstudio for the RF analysis (42). Each dataset was split 75/25 for training and validation, respectively. For each training dataset, we oversampled the minority class to correct for class imbalance in our outcome variable (13.8% of households were S. japonicum positive). In total, three different balanced training datasets were generated for the snail data, and three for the environmental data, yielding a total of six balanced datasets that were used for RF model training. This approach allowed us to assess the stability of model performance metrics and variable importance rankings in light of our oversampling approach. The 'caret' package in R was used to perform a 10-fold cross validation process to tune each model, helping to determine the optimal maximum node size to use and the number of variables to try at each branch. For each RF model, we speci ed 5000 trees per forest, as a high number of trees is recommended to help stabilize variable importance rankings (43).
The reserved validation data was used to test each model and calculate performance statistics (accuracy, Cohen's kappa statistic, receiver operator curve (ROC) area under the curve (AUC), sensitivity, speci city, positive predictive value (PPV) and the negative predictive value (NPV)). To compare performance between models, the best model was de ned as the one with the highest kappa value, followed by accuracy and ROC AUC, respectively. The kappa statistic was selected as our main metric for indicating model performance because our reserved validation datasets had a high degree of class imbalance (13.8% of households were S. japonicum positive), and the kappa statistic was developed to help correct for bias related to over-rewarding the prediction of the majority class (44). Model accuracy was also compared to the No Information Rate (NIR), which indicates what the accuracy that would be expected to be if the majority class were predicted every time (NIR = 0.859). A high NIR value results when there is a high degree of class imbalance for the outcome of interest, as was the case in this study. Finally, in the event of a tie in the kappa and accuracy of two models, the ROC AUC was used to select a nal, top performing model.
To determine which predictors were the most in uential in predicting S. japonicum infection status in our models, the mean decrease in accuracy (MDA) values of predictors were visualized in variable importance plots for each model. For each of the three environmental data models and three snail data models, the top ten predictors indicated by the model's MDA plots were given a score of 10 to 1 (10 being the score of the top predictor). Variable scores were then summed across the three models to create a three-model summary score of 0 to 30, 30 being the highest score possible, while a score of 0 indicates that the variable was never ranked among the top ten predictors. Simple logistic regression models and lowess plots were examined to determine the direction of association between household S. japonicum infection status and each predictor.

Prediction mapping
Using the top performing RF model, a map of the predicted probability of S. japonicum infection across the entire study area was generated. Within ArcGIS Pro, the "Raster to Point" tool was used to generate a grid of points covering the entire study area surface. The grid dataset was then exported to R using the R-ArcGIS Bridge to calculate the predicted probability of infection at each point across the study area. These predicted probabilities were added to the grid dataset and exported back to ArcGIS Pro for mapping. Finally, the "Point to Raster" tool was used to transform the predicted probabilities into a raster surface, using the "Mean" method for the cell assignment type. All analyses were conducted in ArcGIS Pro 2.8.3 and RStudio Version 4.1.2 (36, 42).

Results
Village-level S. japonicum infection prevalence (n=10) ranged from 0% to 27.1%, while the number of infections per household ranged from 0 to 3, with a mean of 0.16 (Standard Deviation (SD)=0.44) infections per household across the 283 households included. A total of 4,896 historical or current snail habitat sites were identi ed in the study area, of which 1,092 (22.30%) were found to contain one or more snails. None of the snails identi ed during the snail surveys were found to be infected with S. japonicum. In total, 1,485 (30.33%) of the surveyed sites were categorized as elds. Within 1 km of the home, the total area of elds (present or absent) ranged from 0 to 0.19 km 2 , while the average was 0.06 km 2 (SD=0.06) for absent elds, and 0.04 km 2 (SD=0.06) for present elds. A total of 3,413 (69.7%) sites were categorized as ditches. The total length of ditches within 1 km of the home ranged from 0 to 7.31 km long, with an average length of 1.74 km (SD=1.70) for present ditches, and 2.22 km (SD=1.35) for absent ditches. Figure 2 shows the snail survey data and village prevalence in the study area and an example of the geographic distribution of household infections in relation to the snail habitat sites within one village.
On average, the homes in our study villages were located closer to a road (0.36 km) than to a waterbody (2.11 km) or waterway (3.02 km). The mean elevation of households in the study villages was 573 m. Surface water in the area surrounding the home was generally low. NDWI values can range from -1 to 1, with a value of < 0 indicating a surface with little to no water content, though a threshold of >0.3 has been proposed as a reasonable value to use for identifying waterbodies (45). In our study, the mean NDWI within 1 km of the home was -0.19 (SD=0.01). Similarly, the NDVI and EVI range from -1 to 1, with lower values indicating more barren landscapes. Values lower than 0.1 for NDVI represent low vegetation areas (e.g. rocks, sand or snow), while values greater than 0.6 corresponds with temperate and tropical forests (46). For the EVI, values between 0.2 and 0.8 are generally used to indicate healthy vegetation (47). The average NDVI and EVI within 1 km of the home was 0.18 (SD=0.02) and 0.40 (SD=0.02), respectively. Table 2 provides summary statistics for the household predictors included in this analysis.  Figure 3). Despite being outperformed in all other metrics, the ROC AUC of each snail model was higher than that of the environmental data models, with the best performing snail model producing a kappa, accuracy and AUC of 0. 37  Due to the high degree of imbalance between the outcome classes across the study period, the Cohen's kappa statistic is a useful metric for our models, as it helps to correct bias that results when rewarding the prediction of the majority class. The kappa statistic and accuracy of the environmental models indicated strong predictive performance. The accuracy of all three open-source environmental data models was 0.89, slightly higher than the NIR of 0.86. The kappa statistic for all three environmental data models was 0.49, indicating the predictive capacity of the environmental models was "Moderate" (0.41 -0.61) when using the Landis & Koch (1977) benchmarks (44). The ROC AUC for the environmental models ranged from 0.78 -0.80. While the sensitivity and PPV for the environmental predictor models was still relatively low (sensitivity: 0.5; PPV: 0.63), the speci city of the models (0.95) and NPV for all three models (0.92) were very high.

Variable Importance
The mean NDWI within 0.5 km of the home was the top predictor in all open-source environmental data models, resulting in a three-model summary score of 30 (Table 4 and Figure 4). Distance to the nearest road and the mean NDWI within 1 km of the home were the next most important predictors, each with a summary score of 23. EVI within 1 km and 0.5 km of the home were also ranked in the top ve predictors, followed by NDVI at 0.5 km and NDVI at 1 km, which were ranked 6 th and 7 th , respectively. None of the variables that used a 0.25 km buffer around the home was ranked in the top 50% of predictors, nor was elevation, the distance to waterbodies or waterways, or the number of people tested per household.
For each of the three models generated with the snail data and the environmental data, variable importance was determined using Mean Decrease in Accuracy (MDA). Each variable is assigned one color across all three models such that color can be used to highlight major shifts in variable importance ranks between models. Absent eld area within 1 Km  obtained for each by oversampling the minority outcome class. These balancing repetitions were used to assess the stability of model performance metrics and variable importance rankings that results from using an oversampling approach to creating a balanced training dataset. After tuning each model using ten-fold cross-validation, the nal models were run on the reserved testing data to generate model performance metrics and variable importance summaries (indicated by the Mean Decrease in Accuracy (MDA)). The ten predictors with the highest MDA in each model were given a score of 10 -1 (10 being the score of the predictor with the highest MDA). Variable scores were then summed across the three models to create a three-model summary score of 30 -0, 30 being the highest score possible (ranked rst in all three models), while a score of 0 indicates that the variable was not ranked in the top ten in any of the three models. In this table, the top ~50% of predictors (determined by the three-model summary score) are shown above the dotted line in black, while those that were in the bottom 50% are below the dotted line and shown in gray.
The total length of all absent ditches (i.e., ditches where no snails were found) within 1 km of the home was the top predictor for all three snail models, followed by the distance to the nearest absent eld, the distance to the nearest present eld, and the distance to the nearest present ditch, respectively (Table  4, Figure 4). Like what was found with the environmental data models, none of the variables that used the smallest buffer size (0.25 km) around each home to summarize the snail habitat were ranked among the top 50% of predictors in the three-model summary score.

Logistic Regressions and Predictions
In our simple logistic regression analyses for the environmental predictors, we found that the total distance to the nearest road was the only predictor that was ranked among the top 50% of predictors that was also signi cantly (p-value <0.05) associated with household S. japonicum infection status (Table 5). For each 1 km increase in the distance between the home and the nearest road, the log odds of household infection increased by 1.30 (standard error = 0.60, pvalue = 0.03). NDWI and EVI within 0.5 km and 1 km of the home were positively associated with household infection status, whereas NDVI was negatively associated with infection status. The logistic regression results suggest a nonlinear relationship between household infection status and NDWI, NDVI, and EVI within 0.5 -1 km of the home. An exploratory examination of lowess plots between the log odds of household infection status and NDWI, NDVI and EVI suggested threshold points, as the approximate lower quartile of NDWI (~< -0.2) and the upper quartile of NDVI (~>0.2) was associated with a lower log odds of infection than other values (data not shown). Although neither distance to the nearest waterway nor elevation were ranked among the top 50% of the environmental predictors, both were strongly negatively associated with household infection status (p-value <0.01). , and the environmental predictors data (bottom), simple logistic regression models were run to determine the direction of association with household S. japonicum infection status. Each predictor was scaled to make a one-unit change represent meaningful incremental changes. The units used for each snail variable are as follows: for the distance to the nearest present ditch, absent ditch, present eld and absent eld, the unit of change was 1 km; for the total present ditch length and total absent ditch length within 0.25 km, 0.5 km and 1 km of the home, the unit of change was 1 km; for the area of present elds and area of absent elds within 0.25 km, 0.5 km and 1 km of the home, the unit of change was 0.1 km 2 ; the unit of change was 1 person. The units used for each environmental variable are as follows: for NDWI, NDVI and EVI, the unit of change was 0.1 (index range of -1 to +1); for the tested in the home, the unit of change was 1 person.
For the snail predictors, the distance from the home to the nearest eld where snails were present, the total length of present ditches within 0.5 km of the home, and the total area of absent elds within 0.5 km of the home were among the top 50% of predictors and were also linearly associated with household S. japonicum infection status (p-value <0.01). For each 1 km increase in the distance between a eld where snails were present and the home, the log odds of household infection increased 0.92 (SE=0.33, p=0.005). Likewise, for each 0.1 km 2 increase in elds where no snails were found within 0.5 km of the home, the log odds of household infection decreased 2.78 (SE=1.05, p=0.008). For every 1 km increase in the length of ditches where snails were found within a 1 km radius of the home, the log odds of household infection increased 0.78 (SE=0.26, p=0.002). Of those snail variables that were signi cantly (p<0.05) associated with household infection, elds were associated with a lower risk of infection, while ditches were associated with an increased risk of infection.
Given that the kappa and accuracy of the three nal environmental models was higher than the kappa and accuracy of all the snail habitat models, the environmental model with the highest ROC AUC (Model 1; see Table 3) was used as our nal prediction model. Using the nal model, we generated a prediction surface for the entire study area to illustrate the predicted probability of infection across different landscapes within the study area ( Figure 5). The predicted probability of infection for the study area ranged from 0.2% to 89.6%.

Discussion
In this study, we set out to gain a better understanding of the strengths and limitations of on-the-ground-surveillance as compared to remote sensing and open-source environmental data for identifying pockets of schistosomiasis in a region approaching elimination. In our analysis, we found that the open-source environmental data models outperformed the snail data models in predicting household S. japonicum infection status in rural farming communities in Sichuan, China. Across our models, the sensitivity, speci city, NPV, PPV, kappa and overall accuracy of the environmental data models was higher than the snail data models. This has important implications. Whereas snail surveys are labor-intensive and time-consuming pursuits, the data from the environmental predictors models are readily available and free to download from the OpenStreetMap Contributors (30) with access to and pro ciency in software designed for performing geospatial analyses would inevitably involve upfront costs, once obtained, this highly skilled and specialized workforce could repeat this analysis in other areas to ne-tune these models for use in other geographic regions and contexts and evaluate the generalizability of the ndings presented in this study. What's more, the analytical methods presented in this study could also be broadly applied by GIS professionals to a range of emerging or reemerging diseases across different landscapes and ecosystems in China and beyond in order to improve the broader understanding of the environmental conditions that can promote or interrupt the transmission of environmentally-mediated diseases.
As more locations across China approach elimination goals and S. japonicum becomes increasingly rare, intensive prevention and control programs and their S. japonicum-dedicated teams are likely to be phased out in favor of targeted surveillance and response methods. It is therefore becoming increasingly important to explore a range of lower-input alternatives to large-scale snail surveys for monitoring S. japonicum risk in the years to come. In this study, the low false positive rate of our environmental models (speci city = 0.951), suggests that open-source environmental data can serve as an effective alternative to large-scale snail surveys for ruling in the possibility of S. japonicum infection at ne spatial scales in areas on the verge of elimination. This is useful in the context of resource-limited control programs, in that it can serve as a rst step in identifying areas where infections are likely to be present (and, conversely, ruling out areas where infections are unlikely to be found). This can enable the direction of resources such as infection screening, preventative prophylaxis and improved sanitation to areas that are predicted to have high infection probability and avoid diverting resources to regions where infections are unlikely to be present.
When looking at the relative variable importance of the open-source environmental data models, the three-model summary score highlighted NDWI within 0.5 km of the home, distance to the nearest road, and NDWI within 1 km of the home as the rst, second and third best environmental predictors of household S. japonicum infection, respectively. Homes that were further from a road were signi cantly more likely to have one or more S. japonicum infection. This nding is consistent with the results of other studies, which have suggested that schistosomiasis infection risk is higher in areas that are further from a city (48, 49), a phenomenon potentially related to lower access to healthcare in more remote locations, as has been suggested elsewhere (50). In this study, NDWI was positively associated with infection risk, with some evidence of a threshold effect. This suggests that residents in homes situated in areas with more surface water nearby (within 1 km) have a greater risk of S. japonicum infection -an association that could be due to increased opportunities for human exposure to schistosomes through water contact, as has been previously found (9,19,49,(51)(52)(53)(54). In a similar vein, we found that homes that were closer to waterways, as well as those at lower elevations were more likely to have S. japonicum infection than those that were nearer to waterways or situated at higher elevations. In the case of elevation, the negative association with household S. japonicum infection could potentially be linked to water accumulation at lower elevations, or a greater risk of encountering O. hupensis snails at lower elevations, as has been found in other studies (17,28,37). Taken together, the fact that the distance to the nearest waterway and household elevation were strong predictors of household infection, and that NDWI within 0.5 km and NDWI within 1 km of the home were the rst and third best predictors in our RF models is consistent with what is known about the important role of water in the schistosomiasis transmission cycle and highlights the utility of using measures of surface water accumulation as a simple means of schistosomiasis risk characterization and surveillance.
In this study, NDVI was negatively associated with household infection such that households in the highest quartile of NDVI ( ~ > 0.2) had lower infection risk.
Meanwhile, both the highest and lowest quartiles of EVI had lower infection risk than the middle two quartiles. Given that the EVI is particularly sensitive to canopy health (38), while NDVI tends to measure lower-lying crop health (41), our ndings suggest that areas with moderate levels of canopy cover and lower levels of crop vegetation are at higher risk of S. japonicum infection. As far as we are aware, these results have not been replicated elsewhere, warranting further investigation into this phenomenon and its potential underlying mechanisms.
While the models using snail survey data did not perform as well as the open-source environmental data models, we identi ed a few key predictors that shed light on the relationship between snail habitat and human infections. First, proximity to and the total length of ditches in the area surrounding the home (0.5-1 km radius) were consistently among the top predictors of household S. japonicum infection and generally followed the anticipated direction of association. For example, in the case of ditches where snails were present, our simple logistic regression models suggest that homes that were closer to or those with a greater density of ditches where snails were present were more likely to have one or more residents with S. japonicum infection. This aligns with our expectations, as more snail habitat sites near the home would be expected to correspond with an increasing number of opportunities to encounter infected snails and become infected. Second, we found surprising evidence that elds may be protective against S. japonicum infection -greater density of elds near the home where snails were present or absent, and proximity to elds where snails were present were all associated with decreased household S. japonicum infection risk. While determining why this might be the case was beyond the scope of this study, we hypothesize that it is related to a lower overall density of snails across elds, as compared to ditches where snails are likely more compactly situated.
We assessed which spatial scales were most relevant to household S. japonicum infection risk by applying three different buffer sizes (0.25 km, 0.5 km, 1 km) around the home to summarize each of our four main snail habitat predictor categories (present elds, absent elds, present ditches, present ditches), and our three environmental indexes measuring surface water and vegetation (NDWI, NDVI, EVI). For all models, only those predictors that used a 0.5 or 1 km buffer were among the top 50% of predictors according to our three-model summary score. Although other studies from China have also highlighted the importance of aggregated or village-scaled measures of S.japonicum risk (55,56), this is an important consideration in light of a push for precision mapping of schistosomiasis. We found that some of the strongest predictors of a high-resolution outcome (household-level infection) were characteristics of the neighborhood rather than the area immediately surrounding the home.. Overall, this highlights the important role that spatial scales can play when assessing predictors of environmentally-mediated diseases like schistosomiasis. As a result, we suggest that future studies and interventions focused on the proximity of the home to snail habitats or high-risk environmental features will bene t from considering a range of potential scales of in uence, rather than focusing solely on the immediate surroundings of the home.
There are a few limitations to this analysis that warrant further discussion. First, we had a relatively small sample size (N = 283 households), and many predictors in each of our models (N = 17 in the snail data models, and N = 14 in the open-source environmental data models). While RF models are wellrecognized for being robust to small sample sizes and large predictor sets (57), smaller samples result in reduced power to detect rare events and an increased risk that the sample is unrepresentative of the underlying population. We compensated for this, in part, by running multiple models and summarizing broadscale trends in performance and variable rankings that held across multiple iterations of model building. Additionally, because our focus was on ne-scale prediction, the relatively small study area of interest (~ 700 km 2 ) resulted in the exclusion of certain predictors (e.g., temperature, and precipitation). With a maximum distance of < 25 km between any two households, weather would not be expected to vary substantially, and therefore was not considered in this study. Furthermore, the relatively small geographic area of this study tends to restrict the generalizability of our ndings to those areas with similar ecosystems and climates (hilly regions of China), and those that use similar snail survey techniques to those used in this study.
Another limitation in this analysis was the class imbalance for the outcome variable, as misclassi cation rates tend to increase when using RF models to predict outcomes that do not have roughly equal numbers of observations within each category (58). Overall, 39/283 (13.8%) households had one or more cases of schistosomiasis. To account for the high degree of class imbalance in our outcome, we oversampled the minority class in the training datasets. However, for the reserved validation dataset, the class imbalance remained, resulting in in ated accuracy measures. As such, we recommend that readers prioritize the kappa statistic over the accuracy measure when considering the performance of our models, as this was developed to help correct for bias due to class imbalance (44). Another noteworthy limitation in this analysis is that the variable importance measures were likely impacted by the high degree of correlation between some of our predictors (e.g., two different measures of vegetation, or the three different spatial scales used to develop predictors), as the variable importance rankings that are used in RF models become less reliable when predictors are highly correlated with one another (59). As such, the relative rankings of predictors should be interpreted with caution, instead looking at broad-scale trends in predictor rankings (e.g., ranked in the top 50% of predictors, versus the bottom 50% of predictors). Finally, it is worth noting that snail survey data is inherently incomplete, as it is inevitable that not every snail is going to be detected in each eld, ditch or other environmental feature and snail surveys provide only a snapshop of highly dynamic snail populations. The snail surveys were conducted using standard protocols (27), meaning that the data on snail habitats is likely to represent typical data for the area, making our analysis an assessment of the predictive capacity of real-world snail data.

Conclusion
In this study, we compared the use of labor-intensive snail survey data with that of open-source environmental data for developing prediction models aimed at predicting household infection status among rural farming communities in China. Overall, we found that freely available environmental data can be used to predict household infection status among rural farming communities in Sichuan Province, China, with high accuracy. Furthermore, the open-source environmental data ultimately outperformed the snail habitat data, suggesting that, prior to conducting comprehensive snail surveys, the overarching goal of the surveys ought to be considered to determine whether less resource-intensive methods might be suitable. Not only has this analysis helped to improve our understanding of where and when transmission is most likely to be occurring in the study area, but it has also highlighted speci c aspects of the local environment that are associated with household infection -for example, homes that are furthest from roads, or those surrounded by more surface waterwhich can become the target of future surveillance and control efforts. Ultimately, by expanding the current body of knowledge on the utility of using opensource environmental data for predicting infection risks as well as some of the limitations and uses of snail survey data in the context of household risk characterization, this study provides valuable insight on priority locations and corresponding tailored control activities that can be used to maximize the impact of surveillance and intervention efforts in in areas approach elimination. Figure 1 Depiction of household exclusion and inclusion.

Figure 2
Map of the study villages.  Variable importance plots for the snail and environmental data models.

Figure 5
Prediction map showing the probability of S. japonicum infection using the top-performing environmental data model.
The nal top performing model was de ned as the one with the highest kappa, accuracy, and receiver operating characteristic (ROC) area under the curve (AUC), respectively. Model performance metrics (Cohen's kappa and accuracy) highlighted that the open-source environmental data models outperformed the snail data models. The top performing environmental data model was used to create a prediction surface of the probability of S. japonicuminfection across the entire study area.