The use of models to predict the distribution of species is common in ecology , and novel approaches to building these models such as random forest have become more widely available in recent years. We used two methods to predict the probability of larval An. gambiae s.l. habitat across the landscape and over time, and the random forest method produced more accurate models than the logistic regression method. This may be due to the differences in predicted heterogeneity of larval habitats at fine scales between the two methods, which can be compared across the entire 10 by 10 km study site (Figure 6). Predictions from the random forest model are more fragmented, showing a closer proximity of high-probability locations to low-probability locations, relative to the estimates from the logistic regression model. The general pattern is similar for the predictions of both models at broad scales (Figure 6). However, the fine scale heterogeneity in the random forest estimates more closely reflects the nature of actual larval habitat distribution on the ground, where larval An. gambiae s.l. habitats are distributed as many small patches rather than one continuous, large patch.
The most important landscape variables for predicting larval habitat presence in these models were TWI and distance to the nearest stream. In the 10 by 10 km random forest model, the mean decreases in the Gini impurity criteria of TWI and distance to the nearest stream were much larger than those of LULC and soil (Figure 5), indicating a stronger association with the prediction of habitat presence . In general practice, high quality soil and LULC data can be difficult to acquire. Given limited resources, our data suggest it is possible to build reasonably accurate larval habitat models without these two landscape variables. Nonetheless, soil and LULC do show an association with habitat presence according to the logistic regression models presented here.
An important question in the application of predictive larval habitat models is whether models parameterized with data for habitat locations in one season are applicable to another season . The creation of larval An. gambiae s.l. habitats (which are temporary, small bodies of standing water) depends on rainfall, which varies seasonally across the range of the species complex. One strategy to model differences between seasons is to account for variation in precipitation. In the random forest models, accumulated precipitation was less important than TWI and distance to the nearest stream, but it was a more important predictor variable than soil and LULC (Figure 5). Additionally, we found more larval habitats in months with more precipitation compared to the same area in months with less precipitation (Figure 4). Thus, including accumulated precipitation in our models improved the accuracy of larval habitat location predictions. These results should be interpreted with caution given the use of a single location about 40 km from the study site as the source of precipitation data. Daily precipitation totals can be spatially heterogeneous at that scale. Despite this limitation, it is clear from previous work that variation in precipitation influences larval An. gambiae s.l. habitats [12, 16, 19, 20]. However, the relationship may be more complex than it first appears. For example, it may not be linear. Rather, the number of larval habitats may increase monotonically with accumulated precipitation up to a threshold, after which more of the water on the landscape flows as surface sheet or channeled water, which is unsuitable aquatic habitat for An. gambiae s.l. larvae. Additionally, different habitat types may respond differently to increasing accumulated precipitation. Standing water forming in drainage channels and stream bed pools may be described better by a threshold relationship than the water filling burrow pits, hoof prints and tire tracks, because the former develops from channel and sheet water made stationary by diminished water flows, whereas the latter forms from water accumulating into various catchments not associated with channels. These additional factors may explain some of the uneven residual error seen in Figure 4, where the 4 months in the red box falling above the fitted regression line have more larval habitats with a lower accumulated precipitation relative to the 3 months in the blue box falling below the fitted regression line.
The n-day cumulative precipitation measure used for each modeling approach within each dataset was selected according to the criteria outlined in the methods to maximize the predictive power of each model. However, comparing across modeling approaches within each dataset, the cumulative precipitation measures were highly correlated (Table 1). Thus, the choice between 21-day cumulative precipitation and 30-day cumulative precipitation, for example, may be less important in general practice than using either measure instead of the daily precipitation total (referred to as 0-day in Table 1). Comparing across datasets, which differed in temporal scale, the 6-day cumulative precipitation and the 30-day cumulative precipitation are only moderately correlated. Their differences in terms of model fit (BIC) could reflect temporal differences in hydrology on this landscape, but it may also reflect a limitation of the 10 by 10 km data collection (see below).A counterintuitive result of this study was that the odds of larval habitat presence decreased with increasing cumulative 6-day precipitation using the best logistic regression model of the 10 by 10 km data. Most likely this reflects a limitation of the 10 by 10 km data collection rather than the true influence of precipitation on larval habitat presence, given the range of cumulative 6-day precipitation over the 49-day period (1.5 mm – 51.1 mm; Figure 3). The sampling strategy for those data was designed to capture variation in landscape variables over space. While precipitation varied among the days of the ground surveys, we were not able to capture that variation over the full range of values for the landscape variables. Instead, the effect of accumulated precipitation in this particular model may be an indication of some other property differing between the quadrats sampled on days of higher and lower accumulated precipitation. Alternatively, the temporal scale over which larval habitats respond to variation in accumulated precipitation may be closer to monthly than daily. That is, ground surveys conducted at monthly intervals in the same area may be more likely to be different than daily samples within a month in the same area. As noted above, this may also reflect the use of a single location as the source of all precipitation data.
In addition to the use of precipitation data from one location, there were other limitations to this study. First, we did not account for spatial autocorrelation in the logistic regression models. Doing so may have slightly increased the confidence intervals associated with the parameters of those models, but it is unlikely to have changed the model comparisons or accuracy evaluations presented here. Previous studies modeling An. gambiae s.l. larval habitat locations have found similar results for logistic regression models with and without parameters accounting for spatial autocorrelation [19–21]. Second, there were additional variables we could have included in our analysis, such as a model-based wetness index (MWI) or normalized difference vegetation index (NDVI). MWI are similar to TWI, but MWI use simulations of distributed catchment models to account for differences between groundwater gradients and surface gradients, thereby creating more accurate topographic data . We used TWI here because it has performed well in other models of Anopheles larval habitats [19–21], and is easily implemented compared with MWI. While our models using TWI showed high accuracy, further studies comparing the use of MWI and TWI in larval habitat modeling are needed. NDVI has also been associated with the distribution of malaria [46, 47], although some studies have found contradicting results [17, 48, 49]. NDVI is an indirect measure of available moisture, but NDVI values are additionally influenced by vegetation type and phenology. Thus, we used accumulated precipitation as a measure of available moisture.
Finally, the models developed here exclusively used physical and environmental factors as predictor variables, but the formation of larval An. gambiae s.l. habitats also depends on human behavior. For example, landowners in Asembo create small drainage channels around fields. Standing water left behind in the channels creates habitats for An. gambiae s.l. larvae . The locations of these drainage channels are often in low-lying agricultural areas, and therefore our models were able to predict the locations of most of the drainage channels. However, drainage channels are not found in 100% of low-lying agricultural areas, probably in part because of individual variation in landowner decision-making. Larval habitats formed from burrow pits and aggregations of hoof prints are also subject to variation in human behavior. While our models were able to correctly predict the locations of most of these habitats, interactions between the physical landscape and human behavior likely account for some of the locations identified incorrectly by the models.
The sampling designs of these two datasets allowed us to address two complementary goals. The monthly surveys in Aduoyo-Miyare and Nguka captured variation in precipitation across both dry and rainy seasons in the same landscape. This provided a stronger logical basis for inferences about the relationship between seasonal variation in precipitation and variation in the location and number of larval habitats. The small spatial extent of Aduoyo-Miyare and Nguka made monthly surveys more feasible, but it also limited the applicability of the model results across a larger area. Conversely, limiting the ground surveys of the 31 quadrats from the 10 by 10 km study site to one season likely impeded our ability to infer much about the effect of precipitation on these data. On the other hand, concentrating our sampling effort to increase replication across space in the 31 quadrats captured more variation in landscape variables, allowing us to apply the results of models based on these data to a larger area.
As a general application, the spatially stratified sampling strategy used in the 10 by 10 km site could serve as a framework for creating predictive larval habitat models for larval control. Targeted larval control is often cited as a useful application of predictive larval habitat models [20, 21], and we agree that there is potential for this application. For example, malaria control programs could identify areas suited to environmental management such as filling in burrow pits and engineering drainage channels to drain more completely. Additionally, allowing larvicide application crews to focus on areas with a higher probability of larval habitat presence would reduce the time, and therefore the cost, of larviciding. However, models fitted to data from a single geographic location may have limited generalizability . Malaria control programs could overcome this limitation by using spatially stratified random samples, repeated across a variable landscape, to build models that are useful over larger areas.