In this section we first present a review of Q-statistics, and extend them to provide global, local and focused tests that account for risk factors and covariates. We next describe an experimental data set for bladder cancer in southeastern Michigan, and apply these new methods to this dataset to illustrate the approach.

### Q-statistics

Jacquez et al. [4] developed global, local and focused tests for case-control clustering of residential histories. Readers unfamiliar with Q-statistics may wish to refer to that original work. We now briefly present these techniques and then extend them to account for risk factors and covariates.

Define the coordinate **u**_{i,t}= {*x*_{i,t}, *y*_{i,t}} to indicate the geographic location of the *i*^{th} case or control at time *t*. Residential histories can then be represented as the set of space-time locations:

**L**_{
i
}= {**u**_{i0}, **u**_{i1}, ..., **u**_{
iT
}} (Equation 2)

This defines individual *i* at location **u**_{i0 }at the beginning of the study (time 0), and moving to location **u**_{i1 }at time *t* = 1. At the end of the study individual *i* may be found at **u**_{iT}. *T* is defined to be the number of unique location observations on all individuals in the study. Define a case-control identifier, *c*_{
i
}, to be

${c}_{i}=\{\begin{array}{ll}1\hfill & \text{ifandonlyif}i\text{isacase}\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}\left(\text{Equation}3\right)$

Define n_{a} to be the number of cases and n_{b} to be the number of controls. The total number of individuals in the study is then *N* = *n*_{a} + *n*_{b}. Let *k* indicate the number of nearest neighbours to consider when evaluating nearest neighbour relationships and define a nearest neighbour indicator to be:

${\eta}_{i,j,k,t}=\{\begin{array}{ll}1\hfill & \text{ifandonlyif}j\text{isa}k\text{nearestneighboro}fi\text{attime}t\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}\left(\text{Equation}4\right)$

We define a binary matrix of *k*^{th} nearest neighbour relationships at a given time *t* as:

${\eta}_{k,t}=\left[\begin{array}{ccccc}0& {\eta}_{1,2,k,t}& .& .& {\eta}_{1,N,k,t}\\ {\eta}_{2,1,k,t}& 0& .\\ .& .& .\\ .& .& {\eta}_{N-1,N,k,t}\\ {\eta}_{N,1,k,t}& .& .& {\eta}_{N,N-1,k,t}& 0\end{array}\right]\left(\text{Equation}5\right)$

This matrix enumerates the *k* nearest neighbours for each of the *N* individuals. The entries of this matrix are 1 (indicating that *j* is a *k* nearest neighbour of *i* at time *t*) or 0 (indicating *j* is not a *k* nearest neighbour of *i* at time *t*). It may be asymmetric about the 0 diagonal since nearest neighbour relationships are not necessarily reflexive. Since two individuals cannot occupy the same location, we assume at any time *t* that any individual has *k* unique *k*-nearest neighbours. The row sums thus are equal to *k* (η_{i,•,k,t}= *k*) although the column sums vary depending on the spatial distribution of case control locations at time *t*. The sum of all the elements in the matrix is *Nk*. There exists a 1 × *T* + 1 vector denoting those instants in time when the system is observed and the locations of the individuals are recorded. We can then consider the sequence of *T* nearest neighbour matrices defined by

= {η

_{k,t}∀

*t* = 0..

*T*} (Equation 6)

This defines the sequence of *k* nearest neighbour matrices for each unique temporal observation recorded in the data set, and quantifies how spatial proximity among the *N* individuals changes through time.

Alternative specifications of the proximity metric may be used – the metrics do not have to be nearest neighbour relationships in order for the Q-statistics to work. In this study we prefer to use nearest neighbour relationships because they are invariant under changing population densities, unlike geographic distance and adjacency measures. There also is some evidence that nearest neighbour metrics are more powerful than distance- and adjacency-based measures [28]. Still, one then may be faced with the question of "how many nearest neighbours (*k*) should I consider"? In certain instances one may have prior information that suggests that clusters of a certain size should be expected, and this can serve as a guide to specification of *k*. When prior information is lacking one may wish to explore several levels of *k*. In these instances Tango [29, 30] advocates using the minimum p-value obtained under each level of *k* as the test statistic. In this paper we explore sensitivity by varying the number of nearest neighbors from *k* = 1,..10, 15, 25, 50 and 75. This allows us to evaluate how sensitive cluster location and strength is to the number of nearest neighbours. We then use concordance of results across different levels of *k* within the framework of strong inference to reach conclusions regarding clustering. Those concerned with strict statistical inference may wish to specify a single level of *k a priori* in order to avoid multiple testing, or to employ the min(p) approach of Tango.

Jacquez et al. (4) defined a spatially and temporally local case-control cluster statistic:

${Q}_{i,k,t}={c}_{i}{\displaystyle \sum _{j=1}^{N}{\eta}_{i,j,k,t}}{c}_{j}\left(\text{Equation}7\right)$

This is the count, at time *t*, of the number of *k* nearest neighbors of case *i* that are cases, and not controls. When *i* is a control *Q*_{i,k,t}= 0.

To determine whether there is statistically significant case clustering of residential histories throughout the study area and when the entire study time period is considered (a spatially and temporally global test) we use:

${Q}_{k}={\displaystyle \sum _{t=0}^{T}{Q}_{k,t}}\left(\text{Equation}8\right)$

This is the sum, over all *T*+1 time points, of the temporally local and spatially global statistic ${Q}_{k,t}={\displaystyle \sum _{i=1}^{N}{Q}_{i,k,t}}$. This will tell us whether there is global clustering of residential histories when all of the residential histories over the entire study period are considered simultaneously. Once global clustering is assessed, we next use Jacquez et al.'s *Q*_{i,k}to identify local clusters of residential histories.

${Q}_{i,k}={\displaystyle \sum _{t=0}^{T}{Q}_{i,k,t}}\left(\text{Equation}9\right)$

For the *i*^{th} residential history, this is the sum, over all *T*+1 time points, of the local spatial cluster statistic *Q*_{i,k,t}. It is the number of cases that are *k*-nearest neighbors of the *i*^{th} residential history (a case), summed over all *T*+1 time points. It will be large when cases tend to cluster around the *i*^{th} case through time. This statistic will be evaluated for each of the cases to identify those cases with low p-values. Notice the local statistics are a decomposition of the global statistic into local contributions, and the sum of the local statistics is equal to the global statistic.

We use the statistic *Q*_{F,k}to determine whether bladder cancer cases cluster near the business addresses of industries known to emit bladder cancer carcinogens. This will allow us to evaluate whether there was statistically significant clustering about a given industry *F* (e.g. a specific metal-plating business) over the life of its operation. Suppose that one suspects that the cases may be clustering about a specific focus defined by the business address history:

**L**_{
F
}= {**u**_{F,0}, **u**_{F,1},.., **u**_{F,T}} (Equation 10)

A test for spatial clustering of cases about the focus *F* at a given time *t* is then:

${Q}_{F,k,t}={\displaystyle \sum _{j=1}^{N}{\eta}_{F,j,k,t}}{c}_{j}\left(\text{Equation}11\right)$

Here η_{F,j,k,t}is the nearest neighbor index indicating at time *t* whether the *j*^{th} individual is a *k*^{th} nearest neighbor of the geographic location of the focus defined by **u**_{F,t}. The statistic *Q*_{F,k,t}is the count of the number of *k*-nearest neighbors about the focus at time *t* that are cases. We use this statistic to evaluate clustering about the address histories of specific industries. We sum this statistic over all industries considered and the entire study period to obtain a global measure of focused clustering. We call this statistic ${Q}_{k}^{F}$ and use it to assess whether there is focused clustering when we consider all industries simultaneously.

We employ the duration-weighted versions of the above Q-statistics as presented in the Appendix to Jacquez et al. [4]. Jacquez et al. [4] also defined spatially and temporally local Q-statistics for individuals for evaluating those places of residence and intervals of time for which case clustering occurred. In this publication our focus is on the life course, and we leave further demonstration of the more ephemeral spatially and temporally local statistics for another paper.

### Randomization accounting for covariates and risk factors

In the absence of knowledge of other risk factors and covariates, simple randomization may be used when evaluating the statistical significance of the above statistics. This is accomplished by holding the location histories for the cases and controls constant, and by then sprinkling the case-control identifiers at random over the residential histories. This corresponds to a null hypothesis in which the probability of an individual being declared a case (*c*_{i} = 1) is proportional to the number of cases in the data set, or:

$p({c}_{i}=1|{H}_{0,I})=\frac{{n}_{1}}{{n}_{0}+{n}_{1}}\left(\text{Equation}12\right)$

Here *n*_{1} is the number of cases and *n*_{0} is the number of controls, and *H*_{0,I}indicates a null hypothesis corresponding to Goovaerts and Jacquez's [14] type I neutral model of spatial independence. This null hypothesis assumes the risk of being declared a case is the same over all of the *N* case and controls.

#### Logistic model of the probability of being a case

In order to provide a more realistic null hypothesis we make the probability of being declared a case a function of the covariates and risk factors. Logistic models are used for binary response variables. Let **x** denote the vector of covariates and risk factors. Further, let p=Pr(*c* = 1|**x**) denote the response probability to be modeled, which is the probability of person *i* being a case. The linear logistic model is then:

logit(*p*) = log(*p*/1 - *p*) = α + **β'x** + ε_{
i
} (Equation 13a)

and the equation for predicting the probability of being a case given the vector of covariates and risk factors for the *i*^{th} individual is:

$\widehat{p}({c}_{i}=1|{x}_{i})=\frac{{e}^{\alpha +{\beta}^{\prime}{x}_{i}}}{1+{e}^{\alpha +{\beta}^{\prime}{x}_{i}}}\left(\text{Equation}13\text{b}\right)$

Here the logit function is the natural log of the odds, α is the intercept parameter, and **β** is the vector of regression (slope) coefficients. One then fits the regression model to the vector of covariates and risk factors to calculate the intercept and slope parameters.

#### Randomization accounting for risk factors and covariates

We use approximate randomization to evaluate the probability of a given Q-statistic under the null hypothesis that the probability of being a case is a function of the covariates and risk factors specified in Equation 13b. To evaluate the reference distribution for a given Q-statistic we follow these steps.

Step 1. Calculate statistic (Q*) for the observed data. This may be any one of the global, local or focused Q-statistics calculated from the observations.

Step 2. Sprinkle the case-control identifier *c*_{i} over the residential histories of the participants in a manner consistent with the desired null hypothesis, and conditioned on the observed number of cases. Assume we have *n*_{a} cases, *N* participants and that P_{i} is the probability of the *i*'th participant being a case. Notice the P_{i} are provided by the logistic equation.

Step 2.1 Rescale the P_{i} as follows: ${{P}^{\prime}}_{i}={P}_{i}/{\displaystyle \sum _{i=1}^{N}{P}_{i}}$

Step 2.2 Map the ${{P}^{\prime}}_{i}$ to the interval 0 .. 1. For example, assume we have N = 2 participants, n_{a} = 1 case and that P_{1} = .7 and P_{2} = .8. ${{P}^{\prime}}_{1}$ then maps to the interval [0 .. .7/1.5) and ${{P}^{\prime}}_{2}$ maps to the interval [0.7/1.5 .. 1.5/1.5).

Step 2.3 Allocate a case by drawing a uniform random number from the range [0..1). Set the case identifier equal to 1 (*c*_{i} = 1) where *i* is the identifier corresponding to the study participant whose interval for ${{P}^{\prime}}_{i}$ contains the random number.

Step 2.4 Rescale as shown in Step 2.1 but not including the probability for the participant whose case identifier was assigned in step 2.3.

Step 2.5 Repeat Steps 2.2–2.4 until all of the n_{a} case identifiers are assigned.

Step 2.6 Set the remaining *N* - *n*_{a} case identifiers to 0, these are the controls.

Notice steps 2.1–2.6 result in 1 realization of the distribution of case-control identifiers.

Step 3. Calculate Q for the realization from Step 2.

Step 4. Repeat steps 2–3 a specific number of times (we used 999) accumulating the reference distribution of Q under the null hypothesis.

Step 5. Compare Q* to this reference distribution to evaluate the statistical probability of observing Q* given the known risk factors and covariates.

### Statistical analyses

The Q-statistic for examining space-time clustering was computed using TerraSeer's STIS software [31]. To account for covariates and risk factors in the Q-statistic, we conducted unconditional logistic regression analysis using "proc logistic" in Statistical Analysis System^{®} (version 8.0; SAS Institute, Inc., Cary, NC). The following covariates and risk factors have been summarized by Silverman et al [25] as being significant for bladder cancer and were included in the logistic regression model. The variables used in the model were defined as follows.

**Age**: Participant's age at time of interview

**Gender**: 1 = Male, 2 = Female

**Education**: 1 = <8 years, 2 = 8–11 years, 3 = 12 years or high school graduate, 4 = post high school training, 5 = some college, 6 = college graduate, 7 = postgraduate education

**Race**: 1 = white, 2 = black, 3 = other

**Number of Cigarettes Smoked**: 0 = never smoked, 1 = smoked < 10 cigarettes daily, 2 = smoked 11–20 cigarettes daily, 3 = smoked 21–30 cigarettes daily, 4 = smoked > 30 cigarettes daily

The parameter estimates of the model were used to estimate a probability of being a case for each participant and included in the covariate-adjusted analysis of the Q-statistic in the STIS software.