Research  Open  Published:
How many suffice? A computational framework for sizing sentinel surveillance networks
International Journal of Health Geographicsvolume 12, Article number: 56 (2013)
Abstract
Background
Data from surveillance networks help epidemiologists and public health officials detect emerging diseases, conduct outbreak investigations, manage epidemics, and better understand the mechanics of a particular disease. Surveillance networks are used to determine outbreak intensity (i.e., disease burden) and outbreak timing (i.e., the start, peak, and end of the epidemic), as well as outbreak location. Networks can be tuned to preferentially perform these tasks. Given that resources are limited, careful site selection can save costs while minimizing performance loss.
Methods
We study three different site placement algorithms: two algorithms based on the maximal coverage model and one based on the Kmedian model. The maximal coverage model chooses sites that maximize the total number of people within a specified distance of a site. The Kmedian model minimizes the sum of the distances from each individual to the individual’s nearest site. Using a ground truth dataset consisting of two million deidentified Medicaid billing records representing eight complete influenza seasons and an evaluation function based on the Huff spatial interaction model, we empirically compare networks against the existing Iowa Department of Public Health influenzalike illness network by simulating the spread of influenza across the state of Iowa.
Results
We show that it is possible to design a network that achieves outbreak intensity performance identical to the status quo network using two fewer sites. We also show that if outbreak timing detection is of primary interest, it is actually possible to create a network that matches the existing network’s performance using 59% fewer sites.
Conclusions
By simulating the spread of influenza across the state of Iowa, we show that our methods are capable of designing networks that perform better than the status quo in terms of both outbreak intensity and timing. Additionally, our results suggest that network size may only play a minimal role in outbreak timing detection. Finally, we show that it may be possible to reduce the size of a surveillance system without affecting the quality of surveillance information produced.
Background
Although facilities location algorithms were originally used to help firms decide where to build new retail outlets or distribution centers [1], these algorithms have also been used for decades to help allocate healthcare resources. In the United States (U.S.), for example, the Emergency Medical Services (EMS) Act of 1973 required that 95% of service requests had to be served within 30 minutes in a rural area and within 10 minutes in an urban area [2]. More recently, investigators have studied how to locate EMS facilities to aid in largescale emergencies such as earthquakes or terrorist attacks [3]. In addition to improving responses to healthcare problems, facilities location algorithms have been used to place preventive healthcare services [4] and also to design healthcare systems in developing countries [5]. In previous work, we have shown how to apply facilities location algorithms to design disease surveillance networks [6] and primary stroke center networks [7].
We focus on outpatient influenza surveillance in this paper. The Centers for Disease Control and Prevention (CDC) currently collects different types of infleunzarelated information [8]. Although these different systems (Table 1) are in some sense complementary, they were not originally developed to optimize detection of influenza cases in any systematic way (i.e., using an explicit optimization criterion, such as maximizing population coverage or minimizing average distance to population elements). Indeed, these systems were in many cases “networks of convenience”.
Surveillance network design has recently been improved using a datadriven approach, incorporating weekly statewide data, hospitalization data, and Google Flu Trends data [9]. Although such methods may provide for better networks in certain instances, many large and populous regions of the world in critical need of surveillance lack the requisite data for such analysis (e.g., poor/untrustworthy records, lack of reasonable influenza activity estimates, lack of Google Flu Trends data in India, China, and all of Africa). Additionally, Google Flu Trends does not track influenza activity perfectly and can differ dramatically from CDC data [10]. Thus, more traditional approaches based on facilities location algorithms that require only population data are still the method of choice for surveillance network design in many regions of the world.
Surveillance networks are used to determine not just outbreak location, but also outbreak intensity (i.e., disease burden) and outbreak timing (i.e., the start, peak, and end of the epidemic). Using networks to detect these factors of disease spread is not new; however, to our knowledge, no other study has examined the implications of designing networks that are tuned to preferentially perform one of these three tasks. Clearly, if one were primarily interested in outbreak intensity or finegrained outbreak location information, one would want to incorporate as many sites as possible. But given that resources are inevitably limited, careful site selection can save costs while minimizing performance loss; knowing the primary detection task is an important first step in designing more efficient and/or effective networks.
In this paper, we examine site placement for an influenzalike illness (ILI) sentinel surveillance network in the state of Iowa. Iowa is a state in the U.S., roughly 310 miles by 199 miles (500 kilometers by 320 kilometers) in area, populated by approximately three million people. In Iowa, ILI is the major form of outpatient surveillance for influenza activity. ILI is a collection of symptoms that indicate a possible influenza infection (e.g., cough, fever, sore throat). Only laboratory tests can confirm actual influenza. The Iowa Department of Public Health (IDPH) maintained 19 ILI sentinel sites in 2007, comprised of primary care facilities and test laboratories selected strictly on a volunteer basis. We analyze and compare several algorithmic surveillance site placement techniques using Iowa as a test environment, specifically in terms of detecting outbreak intensity and outbreak timing. We examine the proportion of cases detected by the different placement methods under explicit probabilistic detection assumptions. We compare these results against the number of cases that would have been detected by the 2007 IDPH network under identical assumptions. We then use statistical correlation as a means to study outbreak timing. We demonstrate how we can dramatically reduce the size of the surveillance network while still successfully detecting the start, peak, and end of the outbreak.
Methods
Online tool
We have developed a webbased calculator that provides a simple user interface for public health officials to determine the best site placement for every state in the U.S. [11] This web application takes as input a list of possible candidate site locations (by ZIP code — there are 935 in Iowa) and, if the user is extending an existing network, a list of any preselected site locations. The user chooses an algorithm and provides any additional parameters specific to the algorithm as well as the total number of sites required. The application then selects a set of sites and overlays the results on a map. Population coverage statistics are also shown. The calculator is capable of designing networks in every state in the U.S. and currently uses 2010 U.S. Census population data. Iowa population distribution by ZIP code is presented in Figure 1.
The methods in this paper operate at the ZIP code level; each surveillance site is represented by the ZIP code in which it resides. Because the ZIP code is an integral part of a site’s address, we can determine the location (i.e., latitude and longitude), and also population, without geocoding the address; we simply consult a lookup table. More finegrained population data may certainly be used (e.g., block or tractlevel), but addresses must be geocoded in those cases to determine location and population. Our abstraction does not preclude network design in the case where multiple sites are located in the same ZIP code.
Algorithms
The webbased calculator supports three different network design algorithms: two algorithms based on the maximal coverage model and one based on the Kmedian facilities location model.
Maximal coverage model
The maximal coverage model (MCM) considers each site as having a fixed coverage radius. For example, given that surveillance sites are typically primary care facilities, it may be reasonable to assume that a site may serve patients who live within a 30minute driving radius of the site (indeed, this is the radius of coverage we use in our simulations). The resulting optimization problem can be stated informally as follows: given a geographic population distribution and a radius of coverage for each site, we wish to choose the sites that maximize the total number of people within the specified distance of a site [12]. Because the problem is nondeterministic polynomialtime hard (NPhard) to solve exactly (i.e., it is typically infeasible to compute the optimal solution), we instead implement a greedy approximation algorithm that provides a $(1\frac{1}{e})$approximation of the optimal solution [13]. This approximation algorithm guarantees a rapid solution that is “close enough” to optimal for use in practice.
Note that the standard MCM formulation places no restrictions on the number of cases a site can serve (or in this case, detect). In the real world, however, surveillance sites cannot detect an infinite number of cases, as each site will have some established natural limit, for example, in terms of the number of patients it can serve. Such site capacity constraints are explicitly modeled in the capacitated MCM formulation where each site is endowed with some intrinsic integer capacity. Each person inside the radius of a site S_{ i } is then uniquely counted against that site’s capacity. Once a site’s capacity is exhausted, it may become appropriate to place another site S_{ j } near S_{ i } notwithstanding overlapping site radii. For example, using the standard noncapacitated MCM formulation, sites are preferentially placed in very dense urban areas, often with several hundred thousand people within a single site’s coverage radius. The capacitated model would instead deploy multiple surveillance sites to high density locations in order to account for each site’s intrinsically limited surveillance capacity.
Figure 2 shows how 19 sites chosen using the noncapacitated MCM compare against the 19 sites used by the IDPH.
Kmedian model
The Kmedian model (sometimes also referred to as the Pmedian model, as in [14]) minimizes the sum of the distances from each individual to their nearest site (a more formal specification is found in [15]). Like the maximal coverage problem, the Kmedian problem is also NPhard [16], so an approximation algorithm is once again in order. Here, we use a simple greedy algorithm, although there are more complicated approximation algorithms that can generate slightly better solutions (e.g., [14, 17]).
Note that there is a fundamental difference between the maximal coverage model and the Kmedian model: the Kmedian model has no explicit notion of population coverage; hence no radius of coverage is involved. By definition, every person in the selected geography is “covered”, although the “quality” of his or her coverage (in terms of travel distance) will vary. For this reason, our webbased calculator always claims 100% coverage when sites are placed using the Kmedian model.
Validation
We can evaluate these different methods empirically by simulating the spread of influenza across the state of Iowa and calculating the probability of each case being detected by any surveillance site. Because our simulations are based on a historical record of actual influenzarelated cases, we can make meaningful comparisons between the performance of algorithmicallyderived surveillance networks and the existing IDPH network.
Medicaid dataset
We use a dataset consisting of two million deidentified Medicaid billing records representing eight complete influenza seasons from July 2000 to June 2008. Medicaid is a U.S. federal health insurance program for people and families with low incomes. These records comprise all of the Iowa Medicaid records from this time period that contain any one of 30 prespecified ICD9 codes that have been previously associated with influenza [18]. Note that we use ICD9coded data as a proxy measure for influenza activity because laboratorybased influenza were not available for the state of Iowa. A look at a sevenday moving average graph of the dataset in Figure 3 clearly shows the wellestablished seasonal influenza peak that occurs each winter [19].
Each record consists of an anonymous unique patient identifier, the ICD9 diagnosis billing code, the date the case was recorded, the claim type (I — inpatient, O — outpatient, and M — medical), the patient ZIP code, age, gender, and provider ZIP code (Table 2). The dataset is very complete; of the two million total entries, only 2500 entries are dropped due to an erroneous or missing field (e.g., a patient ZIP code of 99999). A second influenzaspecific subset of the original data can be defined by selecting only three of the original 30 ICD9 codes that diagnose laboratoryverified influenza (i.e., 487 — influenza, 487.1 — influenza with other respiratory manifestations, and 487.8 — influenza with other manifestations). These three ICD9 codes constitute approximately 30,000 entries, or about 4,000 per year. When all 30 ICD9 codes are considered, the disease seems to never disappear (Figure 3); even during the summer, there are several thousand cases. This might be attributed to the fact that many of the 30 ICD9 codes present in our expanded dataset include codes that represent diseases and symptoms seen yearround (e.g., cough and acute nasopharyngitis).
The current diagnosis billing code standard is ICD10, which provides for more diagnostic granularity than ICD9. Although our data do not use this new standard, no significant changes would need to be made to the methods used in this paper for validation; only careful selection of ICD10 codes that correspond to cases of interest is required.
Simulation
We treat the Medicaid dataset as a proxy of the record of all ILI cases that occurred in Iowa between 2000 and 2008. The probability of case detection is determined by the Huff model, a probabilistic model often used in geography literature to analyze and understand aggregate consumer behavior [20]. Here, we use the Huff model to determine where people might seek care based on distance to the provider and the provider’s perceived “attractiveness”. More formally, the probability H_{ ij } that case i is detected by surveillance site j is given by
where A_{ j } is the attractiveness of site j, D_{ ij } is the distance from case i to site j, α is the attractiveness enhancement parameter, β is the distance decay parameter, and n is the total number of surveillance sites.
We use the Huff model because it gives us a way of balancing the “attractiveness” of a site against the distance a patient may be from the site. Although we could use the greatcircle distance formula (i.e., geodesic distance on the surface of a sphere) to approximate road distance [21], we instead created a driving distance matrix using Microsoft’s Bing Maps API so that our measurements of travel time are as accurate as possible. D_{ ij } is measured as driving distance in miles.
The challenge of properly setting appropriate values for the attractiveness, attractiveness enhancement parameter, and distance decay parameter remains. One solution, and the one adopted in this work, is to estimate the attractiveness of a site from the number of cases seen at that site in the Medicaid dataset. Since we have a comprehensive set of Medicaid cases on which we use the Huff model, we can fit appropriate values of α and β from the dataset. Although a number of parameter estimation methods have been proposed (e.g., [22–26]), we present a method that uses a metaheuristic global optimization algorithm called harmony search (HS) [27] to determine the two parameters. HS has been applied to a variety of problems, including other parameter estimation problems, and it often outperforms other commonly used search algorithms, such as simulated annealing, tabu search, and evolutionary algorithms (e.g., [28–35]). We treat our parameter estimation problem as a maximization problem, where the goal is to select values of α and β that produce the maximal average number of Medicaid cases “correctly” located using the Huff model; a case is “correctly” located if a number selected at random in the range [0,1) is less than the Huff probability, H_{ ij }. Case count is averaged over 50 replicates.
We use an open source Python implementation of HS called pyHarmonySearch [36]. α and β are both allowed to vary in the range (0, 20]. We set max_imp to 100, hms to 20, hmcr to 0.75, par to 0.5, and mpap to 0.25. We ran a total of 20 HS iterations. For the full dataset, the best solution gave us a fitness of 1,032,762.2 cases correctly detected (out of two million total cases) with α = 17.998 and β = 19.769. For the influenzaspecific dataset, the best solution had a fitness of 15,141.14 cases (out of 30,000 total cases) with α = 19.114 and β = 19.479.
Results
We simulate influenza spread considering both the entire dataset and the influenzaspecific dataset. Because our simulations are stochastic, results are produced by averaging over 50 replicates. Placement algorithms design networks by selecting sites from an IDPHprovided set of 117 candidate sites spread across the state of Iowa. In addition to the MCM and Kmedian locationallocation models, our analysis considers surveillance networks designed by selecting sites uniformly at random. Results are reported for each network size by averaging over 50 randomly generated networks.
Outbreak intensity
One way of comparing the quality of two different surveillance networks is to compare the accuracy of their respective measures of outbreak intensity: here the percentage of cases correctly detected by each network using the Huff model. In each graph, the performance of the existing IDPHselected sites is shown as a single data point at n = 19. As seen in Figures 4 and 5, sites generated by the capacitated and noncapacitated MCM (MCMC and MCMNC, respectively) tend to perform best, followed closely by the Kmedian model. Performance improves as network size grows. Unsurprisingly, selecting sites uniformly at random results in worse outbreak intensity detection than preferentially selecting sites.
It seems particularly appropriate to consider the performance of networks of size 19, since this is the number of surveillance sites in the existing IDPH network. At n = 19 for the full dataset, we see that all methods, except Kmedian and random selection, outperform the existing network. As seen in Figure 4, the existing IDPH network detects approximately 24.2% (±0.02%) of all cases using the full dataset. At n = 19, MCMC detects approximately 27.4% (±0.02%) of cases, MCMNC detects approximately 28.5% (±0.02%), Kmedian detects approximately 22.2% (±0.01%), while a random network detects 13.7% of cases on average (5.2% lower bound, 28.9% upper bound). MCMNC is capable of more efficient detection than the existing network with only 17 sites. For the influenzaspecific dataset, as seen in Figure 5, all three algorithmic site placement methods outperform the existing sites. Here, it only takes 12 sites selected using the Kmedian model to match the outbreak intensity detection of the existing sites. In other words, in the state of Iowa, a network can be designed that detects outbreak intensity as well as the existing network with two fewer sites when considering the full gamut of possible influenzarelated ICD9 codes. However, if we only consider direct diagnoses of influenza, the network can consist of 37% fewer sites. This practically significant result indicates that preferentially selecting sites can yield more efficient surveillance networks with less overhead cost.
Outbreak timing
In addition to outbreak intensity, a sentinel surveillance network should be able to detect outbreak timing, or the temporal start, peak, and end of a disease season. Intuitively, when attempting to maximize outbreak intensity detection (as well as outbreak location detection), increasing the number of surveillance sites will improve the quality of detection. However, it is not clear that there is an inherent benefit of having more sites when looking at outbreak timing. We would like to explore just how few sites are necessary in order to still accurately detect the timing of a disease season.
A surveillance network will necessarily detect fewer cases than actually occurred among a population; yet, if the surveillance network detects cases temporally in sync with this ground truth, then the disease curve should increase and decrease in proportion with it. We use the Pearson productmoment correlation coefficient (often abbreviated Pearson’s r) to correlate each detected time series with the ground truth dataset in order to quantify outbreak timing detection quality [37]. Correlation coefficients range from 1 to 1. Values above 0.5 and below 0.5 are often interpreted to indicate strong positive and negative correlation, respectively, although these limits are not hard and greatly depend on the context [38]. This method for measuring outbreak timing does not require that we explicitly define the start, peak, or end of a disease season; we simply correlate the simulated disease curves with the ground truth disease curves.
Figures 6 and 7 compare the outbreak timing detection capabilities of the algorithmic placement methods and the existing sites using the full dataset and influenzaspecific dataset, respectively. In Figure 6, at n=19, we see similar outbreak timing performance among all placement methods, with all networks achieving correlation coefficients of at least 0.98 (indicating very strong positive correlation with ground truth). It only takes six algorithmicallyplaced sites in order to detect outbreak timing at least as well as the existing network, while a network containing only two wellplaced sites is capable of achieving a 0.9 correlation coefficient. Even networks with as few as one site are able to achieve correlations of at least 0.67. When the set of ICD9 codes is restricted to the influenzaspecific dataset, as in Figure 7, outbreak timing quality is only slightly reduced. It takes 14 sites to match the performance of the existing network, but it only takes six sites to achieve correlation of at least 0.9. These practically significant findings suggest that it may be possible to drastically reduce the size of a network if the metric of primary interest is outbreak timing detection.
Conclusions
Disease surveillance is critical in epidemiological studies and in the realm of public health policy. Using a publicly available webbased surveillance site placement calculator and three different algorithmic surveillance site placement methods, we compared the performance of networks generated by the calculator with the volunteerbased network maintained by the IDPH.
The major contribution of this paper is the exploration of two metrics on which a surveillance network can be optimized: outbreak intensity and outbreak timing. Sites chosen using either MCM variant consistently outperform the baseline IDPH network both in terms of outbreak intensity and timing. Furthermore, we found that preferential selection of sites can yield networks capable of achieving outbreak intensity and timing performance in line with the current IDPH network, requiring, in some cases, only a fraction of the number of sites. We found that, at least in the state of Iowa, the number of sites chosen seems not to matter for outbreak timing detection. This implies that using just a few strategically placed surveillance sites (e.g., Des Moines, Cedar Rapids, Davenport, Sioux City, and Iowa City – the five most populous cities in Iowa) may suffice to reliably and accurately determine the onset, peak, and end of the influenza season in Iowa.
It is important to recognize that although we analyze and compare networks using a dataset of confirmed Medicaid influenzarelated cases, network design is accomplished only considering population data. This means that our surveillance network design methods can be used in any location in the world where population data are available.
In practice, surveillance site recruitment, especially in locations where such involvement is voluntary, may prove difficult. This realization opens a new dimension for optimization: cost. Each site brings some inherent cost to the system; the cost may be a monetary value (e.g., incentives), manhours required for reporting, or some other measure. That is, the realworld optimization problem may actually need to be multidimensional. For example, the maximal coverage model may need to be minimal cost, maximal coverage in practice. This direction for future work requires careful consideration when deriving site costs. Additionally, in areas where surveillance site participation is voluntary, a site selected by the methods presented in this paper may decline or hesitate to join the network. The greedy algorithms used here allow for public health officials to rank site importance since, by definition, the most important sites are selected first. This can allow for an adjustment in resource allocation to incentivize important, but unwilling, sites.
In the future, we will look more closely at the problem of selecting the ICD9 codes worth considering for validation. Here, we only consider two sets of ICD9 codes: the entire set of all 30 influenzarelated ICD9 codes provided in our Medicaid dataset and an influenzaspecific ICD9 code subset containing only direct diagnoses of influenza (i.e., 487.x ICD9 codes). One possible approach is to apply machine learning techniques typically used for feature selection to the problem of finding which ICD9 codes should be used for validation. We will also examine other states exhibiting different population density and geographic characteristics from Iowa, and, eventually, nationwide and worldwide surveillance networks. Ultimately, our goal is to use computational methods to reliably advise public health officials how many surveillance sites suffice and where to place them in order to meet their specific needs.
There are several limitations of our work. First, it it important to recognize that all surveillance networks have difficulty making conclusions about uncovered areas. Our methods focus primarily on densely populated regions, so less densely populated regions may be left uncovered. Second, this paper focuses on the state of Iowa in the U.S., which is a relatively simple state geographically and geologically. A more geographically or geologically diverse state such as Colorado with its natural eastwest Rocky Mountain division may provide different obstacles in site placement. Third, our placement models ignore demographics, so it is possible the resulting networks are sampling some demographics more than others or possibly missing some demographics altogether. Moreover, the Medicaid data used in our simulations represent a particular demographic of Iowa: people and families with low incomes (these data, however, are complete with respect to that particular demographic). Fourth, all calculations consider the population of a ZIP code to be concentrated at the centroid of that ZIP code. In reality, populations are usually distributed in some fashion across the entire ZIP code region. Additionally, while our simplifying siteasZIP code abstraction may be reasonable for less densely populated regions, such as Iowa, it may prove to be problematic in more densely populated regions. A final limitation to our work is that we use administrative data (ICD9 codes) as a proxy for influenza activity. We would rather use actual ILI data or laboratorybased data, but these data sources were not available across the state.
Our webbased tool can aid public health officials in designing an effective disease surveillance system. We studied two metrics by which a surveillance network may be evaluated: outbreak intensity and outbreak timing. By simulating the spread of influenza across the state of Iowa, we show that the sites our tool selects perform better than the status quo in terms of both metrics. Additionally, we offer new insights that suggest that network size may only play a minimal role in outbreak timing detection. Finally, we show that it may be possible to reduce the size of a surveillance system without affecting the quality of surveillance information the system is able to produce.
Abbreviations
 CDC:

Centers for disease control and prevention
 EMS:

Emergency medical services
 HS:

Harmony search
 IDPH:

Iowa Department of Public Health
 ILI:

Influenzalike illness
 MCM:

Maximal coverage model
 MCMC:

Capacitated maximal coverage model
 MCMNC:

Noncapacitated maximal coverage model
 NPhard:

Nondeterministic polynomialtime hard
 Pearson’s r:

Pearson’s productmoment correlation coefficient
 U.S.:

United States.
References
 1.
Cooper L: Locationallocation problems. Oper Res. 1963, 11 (3): 331343. 10.1287/opre.11.3.331. [http://pubsonline.informs.org/doi/abs/10.1287/opre.11.3.331] []
 2.
Daskin MS, Dean LK: Location of health care facilities. Operations Research and Health Care Volume 70. Edited by: Brandeau ML, Sainfort F, Pierskalla WP. 2005, US: Springer, 4376. [http://link.springer.com/chapter/10.1007%2F1402080662_3] []
 3.
Jia H, Ordóñez F, Dessouky M: A modeling framework for facility location of medical services for largescale emergencies. IIE Trans. 2007, 39: 4155. 10.1080/07408170500539113. [http://www.tandfonline.com/doi/abs/10.1080/07408170500539113] []
 4.
Verter V, Lapierre SD: Location of preventive health care facilities. Ann Oper Res. 2002, 110: 123132. 10.1023/A:1020767501233. [http://link.springer.com/article/10.1023%2FA%3A1020767501233] []
 5.
Rahman Su, Smith DK: Use of locationallocation models in health service development planning in developing nations. Eur J Oper Res. 2000, 123 (3): 437452. 10.1016/S03772217(99)002891. [http://linkinghub.elsevier.com/retrieve/pii/S0377221799002891] []
 6.
Polgreen PM, Chen Z, Segre AM, Harris ML, Pentella MA, Rushton G: Optimizing influenza sentinel surveillance at the state level. Am J Epidemiol. 2009, 170 (10): 13001306. 10.1093/aje/kwp270. [http://aje.oxfordjournals.org/content/170/10/1300.short] []
 7.
Leira EC, Fairchild G, Segre AM, Rushton G, Froehler MT, Polgreen PM: Primary stroke centers should be located using maximal coverage models for optimal access. Stroke. 2012, 43 (9): 24172422. 10.1161/STROKEAHA.112.653394. [http://stroke.ahajournals.org/content/43/9/2417] []
 8.
CDC: Overview of influenza surveillance in the United States. 2012, [http://www.cdc.gov/flu/weekly/overview.htm] []
 9.
Scarpino SV, Dimitrov NB, Meyers LA: Optimizing provider recruitment for influenza surveillance networks. PLoS Comput Biol. 2012, 8 (4): e100247210.1371/journal.pcbi.1002472. [http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002472] []
 10.
Butler D: When Google got flu wrong. Nature. 2013, 494 (7436): 155156. 10.1038/494155a. [http://www.nature.com/news/whengooglegotfluwrong1.12413] []
 11.
University of Iowa site placement calculator. [http://compepi.cs.uiowa.edu/~gcfairch/siteplacement/] []
 12.
Church R, ReVelle C: The maximal covering location problem. Pap Reg Sci. 1974, 32: 101118. 10.1007/BF01942293. [http://onlinelibrary.wiley.com/doi/10.1111/j.14355597.1974.tb00902.x/abstract] []
 13.
Cornuejols G, Fisher ML, Nemhauser GL: Location of bank accounts to optimize float: an analytic study of exact and approximate algorithms. Manage Sci. 1977, 23 (8): 789810. 10.1287/mnsc.23.8.789. [http://www.jstor.org/stable/10.2307/2630709] []
 14.
Densham PJ, Rushton G: A more efficient heuristic for solving large pmedian problems. Pap Reg Science. 1992, 71 (3): 307329. 10.1007/BF01434270. [http://www.springerlink.com/content/q1g128582lx06862/] []
 15.
Church R: Location modelling and GIS. Geographical Information Systems: Principles, Techniques, Management and Applications, 2 edition. Edited by: Longley PA, Goodchild MF, Maguire DJ, Rhind DW. 1999, New York: John Wiley & Sons, Inc., 293303. [http://www.amazon.com/dp/0471321826] []
 16.
Kariv O, Hakimi SL: An algorithmic approach to network location problems. II: The pmedians. SIAM J Appl Math. 1979, 37 (3): 539560. 10.1137/0137041. [http://www.jstor.org/stable/10.2307/2100911] []
 17.
Teitz MB, Bart P: Heuristic methods for estimating the generalized vertex median of a weighted graph. Oper Res. 1968, 16 (5): 955961. 10.1287/opre.16.5.955. [http://www.jstor.org/stable/10.2307/168488] []
 18.
MarsdenHaug N, Foster VB, Gould PL, Elbert E, Wang H, Pavlin JA: Codebased syndromic surveillance for influenzalike illness by international classification of diseases, ninth revision. Emerg Infectious Diseases. 2007, 13 (2): 207216. 10.3201/eid1302.060557. [http://wwwnc.cdc.gov/eid/article/13/2/060557_article.htm] []
 19.
Lofgren E, Fefferman NH, Naumov YN, Gorski J, Naumova EN: Influenza seasonality: underlying causes and modeling theories. J Virol. 2007, 81 (11): 54295436. 10.1128/JVI.0168006. [http://jvi.asm.org/content/81/11/5429.short] []
 20.
Huff DL: A probabilistic analysis of shopping center trade areas. Land Econ. 1963, 39: 8190. 10.2307/3144521. [http://www.jstor.org/stable/10.2307/3144521] []
 21.
Boscoe FP, Henry KA, Zdeb MS: A nationwide comparison of driving distance versus straightline distance to hospitals. Prof Geographer. 2012, 64 (2): 188196. 10.1080/00330124.2011.583586. [http://www.tandfonline.com/doi/abs/10.1080/00330124.2011.583586] []
 22.
Batty M, Mackie S: The calibration of gravity, entropy, and related models of spatial interaction. Environ Plann. 1972, 4 (2): 205233. 10.1068/a040205. [http://envplan.com/abstract.cgi?id=a040205] []
 23.
Haines GHJr, Simon LS, Alexis M: Maximum likelihood estimation of centralcity food trading areas. J Mark Res. 1972, 9 (2): 154159. 10.2307/3149948. [http://www.jstor.org/stable/3149948] []
 24.
Hodgson MJ: Toward more realistic allocation in location  allocation models: an interaction approach. Environ Plann A. 1978, 10 (11): 12731285. 10.1068/a101273. [http://envplan.com/abstract.cgi?id=a101273] []
 25.
Haining RP: Estimating spatialinteraction models. Environ Plann A. 1978, 10 (3): 305320. 10.1068/a100305. [http://www.envplan.com/abstract.cgi?id=a100305] []
 26.
Nakanishi M, Cooper LG: Parameter estimation for a multiplicative competitive interaction model: least squares approach. J Mark Res. 1974, 11 (3): 303311. 10.2307/3151146. [http://www.jstor.org/stable/3151146] []
 27.
Geem ZW, Kim JH, Loganathan GV: A new heuristic optimization algorithm: harmony search. Simulation. 2001, 76 (2): 6068. 10.1177/003754970107600201. [http://sim.sagepub.com/cgi/doi/10.1177/003754970107600201] []
 28.
Geem ZW: Optimal cost design of water distribution networks using harmony search. Eng Optimization. 2006, 38 (3): 259277. 10.1080/03052150500467430. [http://www.tandfonline.com/doi/abs/10.1080/03052150500467430] []
 29.
Mahdavi M, Fesanghary M, Damangir E: An improved harmony search algorithm for solving optimization problems. Appl Math Comput. 2007, 188 (2): 15671579. 10.1016/j.amc.2006.11.033. [http://www.sciencedirect.com/science/article/pii/S0096300306015098] []
 30.
Omran MGH, Mahdavi M: Globalbest harmony search. Appl Math Comput. 2008, 198 (2): 643656. 10.1016/j.amc.2007.09.004. [http://www.sciencedirect.com/science/article/pii/S0096300307009320] []
 31.
Kim JH, Geem ZW, Kim ES: Parameter estimation of the nonlinear Muskingum model using harmony search. J Am Water Resour Assoc. 2001, 37 (5): 11311138. 10.1111/j.17521688.2001.tb03627.x. [http://doi.wiley.com/10.1111/j.17521688.2001.tb03627.x] []
 32.
Vasebi A, Fesanghary M, Bathaee SMT: Combined heat and power economic dispatch by harmony search algorithm. Int J Electrical Power Energy Syst. 2007, 29 (10): 713719. 10.1016/j.ijepes.2007.06.006. [http://www.sciencedirect.com/science/article/pii/S0142061507000634] []
 33.
Geem ZW: Harmony search algorithm for solving Sudoku. KnowledgeBased Intelligent Information and Engineering Systems. Edited by: Apolloni B, Howlett RJ, Jain L. 2007, Berlin, Heidelberg: Springer, 371378. [http://link.springer.com/chapter/10.1007/9783540748199_46] []
 34.
Geem ZW, Choi JY: Music composition using harmony search algorithm. Applications of Evolutionary Computing. Edited by: Giacobini M. 2007, Berlin, Heidelberg: Springer, 593600. [http://link.springer.com/chapter/10.1007/9783540718055_65] []
 35.
Geem ZW: Optimal scheduling of multiple dam system using harmony search algorithm. Computational and Ambient Intelligence. Edited by: Sandoval F, Prieto A, Cabestany J, Graña M. 2007, Berlin, Heidelberg: Springer, 316323. [http://link.springer.com/chapter/10.1007/9783540730071_39] []
 36.
pyHarmonySearch. [https://github.com/gfairchild/pyHarmonySearch] []
 37.
Pearson K: Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philos Trans R Soc London. Ser A, Containing Papers of a Mathematical or Physical Character. 1896, 187: 253318. 10.1098/rsta.1896.0007. [http://www.jstor.org/stable/90707] []
 38.
Cohen J: The significance of a product moment r. Statistical Power Analysis for the Behavioral Sciences, 2 edition. 1988, Lawrence Erlbaum Associates, Inc., 75107.
Acknowledgements
We would like to thank Pete Damiano at the University of Iowa Public Policy Center for access to the Medicaid data.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
GF wrote all of the software, drafted the manuscript, and developed the correlation method for quantifying outbreak timing detection. PMP provided medical guidance in understanding influenza’s spread across Iowa as well as the initial research directions based on the maximal coverage model. EF provided statistical guidance and help in understanding and processing data. GR provided geographic guidance and suggested the use of the Kmedian model as well as the Huff model. AMS helped coordinate project goals and ideas and guided validation efforts. All authors edited the manuscript. All authors read and approved the final manuscript.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Influenza
 Outbreak intensity
 Outbreak timing
 Disease surveillance
 Maximal coverage model
 Kmedian model
 Huff model
 Harmony search
 Medicaid
 Simulation