Skip to main content

Table 4 Summary of variable importance rankings for the snail and environmental data models

From: Open-source environmental data as an alternative to snail surveys to assess schistosomiasis risk in areas approaching elimination

  1. After dividing the snail habitat data (top), and the environmental data (bottom) 75:25 for training and validation, three balanced training datasets were obtained for each by oversampling the minority outcome class. These balancing repetitions were used to assess the stability of model performance metrics and variable importance rankings that resulted from using an oversampling approach to create a balanced training dataset. After tuning each model using ten-fold cross-validation, the final models were run on the reserved testing data to generate model performance metrics and variable importance summaries (indicated by the Mean Decrease in Accuracy (MDA)). The ten predictors with the highest MDA in each model were given a score of 10 – 1 (10 being the score of the predictor with the highest MDA). Variable scores were then summed across the three models to create a three-model summary score of 30 – 0, 30 being the highest score possible (ranked first in all three models), while a score of 0 indicates that the variable was not ranked in the top ten in any of the three models. In this table, the top ~ 50% of predictors (determined by the three-model summary score) are shown above the dotted line in black, while those that were in the bottom 50% are below the dotted line and shown in gray