The Centers for Disease Control and Prevention (CDC) define surveillance to be the ongoing, systematic collection, analysis, interpretation, and dissemination of data about a health-related event for use in public health action to reduce morbidity and mortality and to improve health [1]. To control and prevent disease, it is surely important to be vigilant for infectious disease outbreaks or geographic areas of notably high chronic disease incidence. Indeed this is a primary aim of public health surveillance, and explains in part why surveillance plays an integral role in public health practice [2].
When caring for a single patient, the clinician understandably desires as much diagnostic information as possible, and at the highest possible level of precision. Analogously, a public health professional is concerned with diagnosing a public ailment, and should similarly desire all available information with the greatest possible level of precision. Thus it is noteworthy, in the context of public health surveillance, that for reasons of privacy, information is sometimes destroyed or intentionally degraded before being proffered to the analyst.
The argument to protect patient data for reasons of privacy could also be used to shield these data from clinicians. In a clinical setting, we choose not to protect the privacy of the patient by hiding relevant information from the clinician, because it is patently silly to do so. However, we often suffer from a similarly framed argument to obscure population level data, even when addressing matters of concern to the public health.
We argue that one important reason to retain important, specific information such as precise location is that the "requisite" aggregation for privacy necessarily reduces the power available for outbreak detection. To balance the cost of this and other troubles for spatial analysis [3], aggregation does indeed make it more difficult to identify individual patients. This is crucial if the data are made publicly available or if there are other reasons to safeguard privacy, but it also makes an already challenging surveillance task even more difficult.
A growing body of literature addresses statistical protection of privacy and its effects on analysis of surveillance data. Cox has written a useful survey of the general problem of confidentiality within small geographic areas, and the impacts of privacy concerns on public health policy and practice [4].
Armstrong et al. thoroughly discuss the design and implementation of several different approaches to protect privacy in the context of spatial analyses [5]. Importantly, methods were evaluated both on the impact on analysis as well as the effectiveness of preserving confidentiality. Yet the restriction of the quantitative assessment to the Cuzick-Edwards test statistic [6], which is no longer commonly used for spatial surveillance [7, 8], limits the application of this knowledge to a surveillance setting. Further, data with exact locations were not considered for this evaluation.
Waller and colleagues have written extensively on factors that may influence power of cluster detection methods. For example, they have studied the effects of geographic scale on focused tests of clustering [9, 10], and the importance of cluster location amidst a heterogeneous underlying population [11]. Notably, this group has investigated more than one statistical method, using several different measures for evaluation. However these studies generally use focused tests of clustering, where a putative exposure source has been identified a priori, whereas surveillance purposes typically require a general test of clustering [12].
Just as we trust clinicians and hospital personnel with sensitive and confidential information, so too, one can argue, we should find trustworthy individuals to handle surveillance data responsibly.
Informatics-based approaches offer a potential compromise to the trade-off between privacy and surveillance utility. For example, development of automated surveillance algorithms might allow sensitive data to be analyzed without human intervention [13]. But in order to evaluate the benefit that such an approach might provide, we must first better understand the costs in performance that the obfuscation or destruction of information may cause.
We reported briefly [14] that there is an undesirable loss of power to detect disease outbreaks when the spatial information provided is degraded from a continuous scale of measurement to a coarser, aggregate level. For example, often only a patient's ZIP code is available to a surveillance system, instead of the patient's listed residential address. Similar results have appeared in contemporaneous work [15], and a recent paper by the same group further confirms this basic premise [16]. However, those studies focused solely on exact locations compared to a single level of aggregation.
In our present work, we add to these previous results by considering multiple levels of aggregation. Using synthetic data, we systematically quantify the loss of cluster detection performance as a function of spatial resolution, while limiting confounding influences from a variety of complex factors that affect spatial analyses. We may interpret these results relative to geographic scales we might encounter while surveilling a large metropolitan city. In this way, we attempt to clarify the price one pays for aggregation, and in turn to better inform future policy decision-makers.