Most population modelling methods essentially involve some form of re-distribution of aggregate census counts using ancillary datasets at finer spatial detail that are known to influence human population distribution. Here, the district level population count data were redistributed at a finer spatial scale using all the available information contained in the datasets described above. Specifically, the Africover LC dataset was first adapted to accommodate the more precise and detailed mapping and locational information on settlements provided by the Landsat-derived settlement polygons, the settlement points and refugee camps, all described above. Next, LC specific weights were derived based on information on population sizes from the settlement points and from the detailed Kenyan census data where the same LC classes exist (as described in [12]). These calculated densities were then utilised as weightings to redistribute population by settlement and LC type that were unaccounted for by existing settlement population size data. More detail on the full process is outlined below.
Land cover data refinement
The Africover urban class, which typically overestimates settlement extent size [11, 12], was first removed and the surrounding classes expanded equally to fill the remaining space. The settlement location data and the Landsat-derived settlement polygons were then used to refine the 'urban area', 'rural settlement', 'refugee camp' and 'industrial area' classes. Given the clustered nature of populations across Somalia, ensuring that all known settlements were identified and mapped using information from all available datasets represented an important step.
1. Urban areas
The urban class refinement was mainly based on settlement locations classified as 'towns'. Town extents were mapped in three different ways according to available data: (i) the Landsat-derived settlement polygons were mapped when town location points could be mapped unambiguously onto a polygon, (ii) information on population size was used to provide an estimate of town extent, where just a single georeferenced location existed, and (iii) an average town extent of 1.04 km² - which corresponds to the average size of Landsat-derived settlement polygons intersecting with towns - was used for towns where just a single point existed and the population size was unknown. An urban extent map was derived from these town extents.
2. Rural settlements
A similar method as above was used to produce a rural settlement layer, based on rural settlement points: (i) the Landsat-derived settlement polygons were used when settlements could unambiguously be mapped onto a polygon, (ii) information on population sizes were used to provide an estimate of settlement extent where just a single georeferenced location existed, and (iii) a settlement extent of 10,000 m² (i.e. one pixel) was used for settlements where just a single point existed and the population size was unknown. We did not use the average size of rural settlements because only 2.7% of rural settlement points intersected Landsat-derived settlement polygons (in contrast to 69% for towns), which suggests that only the biggest settlements were detected in Landsat-derived settlement polygons database and that, using these, the average size would then likely be overestimated.
3. IDP camps
The 'refugee camp' class mapping was mainly based on settlement locations classified as 'IDP camp': (i) information on population sizes were used to provide an estimate of IDP camp spatial extent, and (ii) an average IDP camp extent of 0.04 km² - which was calculated based on the UN data for the Afgooye corridor - was used for IDP camps where the population size was unknown. IDP camp extents were assembled to form a 'refugee camp' map.
4. Industrial areas
The Landsat-derived industrial area delimitations were used to define an industrial area map.
The urban, rural settlements, refugee camps and industrial area maps were all overlaid onto the Africover dataset and the land covers beneath were replaced to produce a refined LC dataset.
Land cover specific population densities
Relative per LC class population densities were defined for each class of the refined LC dataset. The average population density in urban areas and rural settlements were calculated based on the Landsat-derived settlement polygons combined with settlement population counts. Average population densities of 18,302 people/km² and 2,990 people/km² were calculated for urban areas and rural settlements, respectively. Typical population density in refugee camps was calculated based on available data for the Afgooye corridor. The Afgooye corridor is divided into 5 sub-areas, for which the UN OCHA estimated the surface area covered by IDP camps [18] and the UNHCR estimated population sizes [17]. From these data, we calculated an average population density of 77,199 people/km² in IDP camps. Zeros were attributed to classes with no human habitation such as water bodies, industrial areas and sand beaches.
The average population densities of the remaining LC classes were derived from the Kenyan census data, where significantly more accurate and detailed data on population distribution were available. The Kenyan Enumeration Area (EA) census data [19], which contain 46,034 EAs and has an ASR of just 3.21 km, provided a valuable dataset for calculating more accurate relative per LC class population densities than could be obtained from existing Somalia data. Moreover, all the Africover LC types found in Somalia are also present in Kenya. The average population density of one specific LC class was calculated based on EAs that record this LC class for the majority of their pixels, as outlined in [12] and [13]. As shown in [13], the extrapolation of LC specific population densities to neighbouring regions had a limited impact on population distribution model accuracies in Kenya. However, even if the relative values between population densities derived from Kenya are important, the absolute population density values can vary notably from one country to the other. Population densities derived from Kenya are expected to be overestimated because small settlements were not distinguished from major Africover classes in Kenya. Moreover, populations are much more clustered across the whole of Somalia due to the arid environment. We therefore varied the population densities derived from Kenya by scaling them by a sequence of weightings between 0 and 1 (with an increment of 0.01), while keeping the weights derived from Somalia data fixed. We tested the accuracy of population data produced based on each population density table by comparing predicted population with the observed population in towns and settlements from the location dataset with known populations. This provided a test of the repartition of populations between settlements/towns and other LC classes. The root mean square error (RMSE) was extracted for each population dataset. The LC specific population density table that produced the lowest RMSE was selected for the final population distribution model.
Population distribution modelling
The per-LC class densities defined above were used as weightings to reallocate populations within Somali districts. Per-pixel population densities were adjusted to match the total population estimated by the UNDP (2005) in the administrative units that they belonged to. An estimate of population in 2010 was produced based on UN rural and urban growth rates for the 2005-2010 period, using the following equation: P2010 = P2005ert, where P
2010
is the required 2010 population within a pixel, P
2005
is the population within the same pixel at year 2005, t is the number of years between year 2005 and 2010, and r is the average growth rate for rural pixels (2.21%) and urban pixels (4.17%) - these growth rates were taken from the UN World Urbanization Prospects Database, 2007 version [20].
Comparison with existing datasets
Accuracy assessment of largescale population datasets is always challenging due to the use of all geographically-specific datasets to produce the population dataset, leaving little independent data for testing. However, simple comparison tests with existing gridded population datasets were undertaken. The 2008 version of LandScan [21] and the 2000 beta version of the Global Rural Urban Mapping Project (GRUMP) [22] are the most widely used population datasets, and were acquired and compared to the newly created dataset (AfriPop). Given the differing spatial resolutions, the tests should not be considered as formal accuracy assessments, but merely informative comparisons. To make the comparisons possible, population datasets were adjusted to the same year using UN growth rates [20] and resampled to 100 meters spatial resolution. Different methods were used to compare the AfriPop, GRUMP and LandScan datasets. Firstly, predicted population totals per district were compared to the UNDP population estimates for the year 2005. The three population datasets were adjusted to 2005 for this calculation. The AfriPop dataset was unsurprisingly near perfect here, as the population data were matched to UNDP population estimates in the modelling procedure. However, our aim was to observe how far away the GRUMP and LandScan datasets were from these most contemporary estimates. Root mean square errors (RMSE) were extracted and differences in population estimates per district were mapped. Secondly, we measured grid-based differences between datasets, as described in Sabesan et al. [23]. Per-pixel absolute differences were mapped and plotted to explore tendencies in these differences. Thirdly, we compared the numbers of people predicted in towns and settlements with known population size. In order to allow the calculation of population predicted in small settlements (smaller than 1 km), the LandScan and GRUMP datasets were resampled to 100 m for this comparison. Pearson correlation coefficients and RMSE between predicted and observed population in towns and settlements were extracted. Finally, we tested the impact of the choice of population dataset on estimates of the population at risk (PAR) of Plasmodium falciparum (Pf) malaria in Somalia. The AfriPop, LandScan and GRUMP datasets were overlaid on the map of Pf malaria endemicity classes for the year 2007 (figure 1) produced by Hay et al. [2] and PAR estimates were extracted.