Lecture 15: Cox model complications

1 Stratified Cox model

Sometimes the Cox model won’t be sufficiently flexible for our modeling needs. The issue is that despite the baseline time-varying hazard being an unknown unspecified function, it is assumed to describe the baseline hazard for all individuals in the population. This often won’t hold for heterogeneous populations. The solution is to use a stratified Cox model, which allows for the baseline hazard to depend on a known stratum. For example, let’s say that the time course of a disease is known to vary by treatment. Then we would want a model that could replicate that pattern. For patient \(i\) in treatment group \(j\), let the function \(j[i]: \Z^+ \to \Z^+\) be the map between patient index and treatment group (AKA stratum) membership. This is just a fancy way to say that there is a vector with \(n\) elements where each element represents the stratum of the \(i^\mathrm{th}\) individual. Let there be \(J\) treatment groups (i.e strata). Let the hazard ratio be defined as: \[\lambda_{ij}(t) = \lambda_{j[i]}(t) \exp(\mathbf{z}_i^T \boldsymbol{\beta}).\]

We can use the results from our derivation of the Cox partial likelihood to aid in our derviation of the stratified Cox model’s likelihood.

The joint likelihood for the observed data \(\{(t_{i}, \delta_{i}, \mathbf{z_i}, j[i]), i = 1, \dots, n\}\) is \[\begin{align} L(\{\lambda_j(t)\}, \boldsymbol{\beta}) = \prod_{i=1}^n \left( \lambda_{j[i]}(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i}\exp \left(-\exp(\mathbf{z}_i^T \boldsymbol{\beta})\int_0^\infty Y_i(u) \lambda_{j[i]}(u) du \right) \end{align}\] We can rewrite this as the product over \(j\) and an inner product over the units \(i\) such that \(j[i] = j\): \[\begin{align} L(\{\lambda_j(t)\}, \boldsymbol{\beta}) = \prod_{j=1}^J \prod_{i \mid j[i] = j} \left( \lambda_{j}(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i}\exp \left(-\exp(\mathbf{z}_i^T \boldsymbol{\beta})\int_0^\infty Y_i(u) \lambda_{j}(u) du \right) \end{align}\] Let each \(\Lambda_j(t)\) be unspecified right-continuous non-decreasing step functions that jump at the collection of times \(\{t_i \mid j[i] = j\}\).: \[\begin{align} L(\{\lambda_j(t)\}, \boldsymbol{\beta}) & = \left(\prod_{j=1}^J \prod_{i \mid j[i] = j} \left(\lambda_{j}(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i}\right)\exp(-\sum_{j=1}^J\textstyle\int_0^\infty \sum_{i \mid j[i] = j} \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(u) \lambda_{j}(u) du) \\ & = \left(\prod_{j=1}^J \prod_{i \mid j[i] = j} \left(\lambda_{j}(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i}\right)\exp(-\textstyle\sum_{j=1}^J\sum_{k \mid j[k] = j} \left(\sum_{i \mid j[i] = j} \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(t_k)\right)\lambda_{j}(t_k)) \end{align} \tag{1}\] Solving the score equations for Equation 1 gives an expression for \(\lambda_j(t_k)\): \[\begin{align} \hat{\lambda}_j(t_k) = \frac{1}{\sum_{i \mid R(t_k), j[i] = j} \exp(\mathbf{z}_i^T \boldsymbol{\beta})} \end{align}\] Plugging this back into the equations above gives \[\begin{align} L(\boldsymbol{\beta}) = \prod_{j=1}^J \prod_{i \mid j[i] = j} \left( \frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{k \mid R(t_i), j[k] = j} \exp(\mathbf{z}_k^T \boldsymbol{\beta})} \right)^{\delta_i}\exp \left(-\sum_{i=1}^n \delta_i \right) \end{align}\] Thus the likelihood is stratified by \(j\), but additive over \(j\). The likelihood enforces that information about \(\beta\) is shared across all units, but the risk set to which each failure is compared is limited to individuals within the same stratum.

2 Including time dependent variables in the Cox model

Let’s say we now have a variable \(W_i(t)\) that changes with \(t\) and we wish to include this variable in our Cox model. There is no mathematical reason we cannot include this variable in our model. The hazard ratio for an individual is now: \[\lambda_i(t) = \lambda_0(t) \exp(\mathbf{z}_i^T \boldsymbol{\beta} + \gamma W_i(t))\] where \(\gamma\) is the coefficient for the time-varying covariate. The first thing to note is that this is no longer a proportional hazards model, because \(\lambda_i(t)/ \lambda_0(t)\) is not constant in time. Instead: \[\lambda_i(t)/ \lambda_0(t) = \exp(\mathbf{z}_i^T \boldsymbol{\beta} + \gamma W_i(t)).\]

Now we construct the partial likelihood. \[\begin{align} PL(\boldsymbol{\beta}) & = \prod_{i=1}^n \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta} + \gamma W_i(t_i))}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta} + \gamma W_j(t_i))}\right)^{\delta_i} \end{align}\] The complication arises in the denominator, where we need to know the values of \(W_j\) at time \(t_i\).

This isn’t an issue for variables that are exogenous, or determined outside of the patient’s survival process. One example is \(W_i(t)\) is the dose of a medicine that is administered as part of clinical trial. This variable is controlled by investigators and is hypothetically known at every time point for every patient.

But for variables that are related to the patients’ survival processes, we might not know these variables at every point. Let’s say we’re running an influenza vaccine efficacy trial and we are measuring antibody levels at regular visits after administration of vaccines. Our primary outcome is influenza infection. Suppose that patient \(i\) becomes infected on day \(t_i\). What values should we use for \(W_i(t_i)\)? What about \(W_j(t_i)\)?

Let’s say that we have \(W_i(t_i - 1)\) for the \(i^\mathrm{th}\) participant, and we have \(W_j(t_i - 1)\) and \(W_j(t_i + 1)\) for participants \(j \in R(t_i)\). The value for the \(W_i(t_i)\) should be the \(W_i(t_i - 1)\) because we have no other choices. As for participants \(j \in R(t_i)\), we should use the \(W_j(t_i - 1)\), though it may be tempting to use \(W_j(t_i + 1)\). Using values in the future will bias our estimates of \(\gamma\) because this does not accurately represent the relative risk for unit \(i\) at time \(t_i\). In fact, using values in the past will also bias our estimates of \(\gamma\), similarly because this does not accurately represent the relative risk for unit \(i\) at time \(t_i\) either. Solutions for this

We can still use the Breslow estimator for cumulative baseline hazard: \[\begin{align} \hat{\Lambda}_0(t) = \sum_{i \mid t_i \leq t, \delta_i = 1} \frac{1}{\sum_{j \in R(t_i} \exp(\mathbf{z}_j^T \boldsymbol{\beta} + \gamma W_j(t_i))} \end{align}\]

2.1 Representing time-varying covariates in survival data

The way we represent time-varying covariates in a dataset is somewhat different than what we’re used to with simpler survival data. The reason for this is that every individual has different numbers of time-varying covariate values depending on their time to failure or censoring. Thus a datatset in wide format might look like Table 1.

Table 1: Wide-format dataset

id	time	status	age	atime1	atime2	atime3	a1	a2	a3
1	100	0	45	0	60	90	3.8	3.4	2.9
2	80	1	65	0	60	NA	2.8	2.4	NA

An alternative is called the counting process representation. The transformed data is now show in Table 2.

Table 2: Counting process representation

id	obs	time	age	a	start	stop	status
1	1	0	45	3.8	0	60	0
1	2	60	45	3.4	60	90	0
1	3	90	45	2.9	90	100	0
2	1	0	65	2.8	0	60	0
2	2	60	65	2.4	60	80	1

These sort of data can be fitted in the survival package using the following command: coxph(Surv(start, stop, status) ~ age + a, data = data).

2.2 Addressing the problem with time-varying covariates

This discussion follows (Tsiatis and Davidian 2004). Imagine we had the following set of random variables for each participant in a clinical trial: \[\{(T_i = \min(X_i, C_i), \mathbf{z}_i, W_i(u) \mid 0 \leq u \leq T_i), i = 1, \dots, n\}.\]

If we observed this process for each individual, we could fit a Cox PH model as above: \[\begin{align} PL(\boldsymbol{\beta}) & = \prod_{i=1}^n \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta} + \gamma W_i(t_i))}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta} + \gamma W_j(t_i))}\right)^{\delta_i} \end{align}\] where we know \(W_i(u)\) exactly for each individual for their entire at risk period.

This is obviously not realistic; typically participants will have intermittent measurements of biomarkers. An example might be CD4 counts, a measure of the concentration of T-cells in the blood, which is an important measurement for those with HIV. A low CD4 count can be an indication that an individual is at risk for AIDS; in fact a CD4 count of below 200 is one of the diagnostic criteria for AIDS.

Furthermore, we typically will not observe any underlying “true” values of \(W_i(u)\), but instead we’ll observe some noisy proxy for that variable, say \(\tilde{W}_i(u)\). The solution to this is to model \(W_i(u)\) as an unknown parameter that we learn about via observations of \(\tilde{W}_i(u)\) at different time points: \[\begin{align*} \tilde{W}_i(u_j) & = W_i(u_j) + \varepsilon_{i}(u_j)\\ \varepsilon_{i}(u_j) & \sim F \\ W_i(u_j) & = \alpha_{i0} + \alpha_{i1} u_j \\ (\alpha_{i0}, \alpha_{i1}) & \sim G \end{align*}\] Then the hazard ratio for \(X_i\) would be: \[\begin{align*} \lambda_i(t) & = \lambda_0(t) \exp(\gamma (\alpha_{i0} + \alpha_{i1} t) + \mathbf{z}_i^T \boldsymbol{\beta}) \end{align*}\] The key limitation for survival analysis is that we need access to \(W_i(u)\) for all \(u\) prior to censoring or failure. This is only possible via a model for \(W_i(u)\). One could imagine a nonparametric model taking the place of the simple linear model employed above. Despite its simplicity, the model above is investigated in (Tsiatis and Davidian 2004).

References

Tsiatis, Anastasios A., and Marie Davidian. 2004. “Joint Modeling of Longitudinal and Time-to-Event Data: An Overview.” Statistica Sinica 14: 809–34.