Multiple Testing Martin Kulldorff, Harvard Medical School 13 July 2010 This is a really nice paper, and it is interesting to see how the different methods compare for this data set. The author provides a clear example of the important differences between global clustering tests and cluster detection tests. For the hypothesis testing part of the kernel intensity function method, it seams that one statistical test is performed for each of the 40x40=1600 grid points (p8). If so, the method does not adjust for the multiple testing inherent in the many cluster locations evaluated. At the alpha=0.05 level, one would expect 0.05*1600=80 'statistically significant' grid points just by chance alone, which is slightly less than the 110 that were found according to figure 5. Whether the difference in 110 and 80 is statistically significant is hard to tell, since the 1600 different tests are highly correlated when the grid points are close to each other. What it means though is that, with this approach, any data set that was generated under the null hypothesis will have many 'statistically significant' clusters that are not actually statistically significant. With the spatial scan statistic there are even more potential clusters considered, but the method adjusts for the multiple testing. That is, if the data set was generated under the null hypothesis, the probability of seeing one or more statistically significant clusters anywhere on the map is 0.05. The lack of adjustment for multiple testing explains why the kernel based method has 'statistically significant' clusters while the other methods do not (p14). While the kernel approach is useful for descriptive purposes and the test based on 'the sum of squared log ratios of kernel intensity functions' (p8) is a nice global clustering test, the method should not be used to evaluate the statistical significance of local clusters. Competing interests I developed the spatial scan statistic and the SaTScan software used in this paper.