Lecture 13: Cox model
1 Cox Proportional Hazards Model
The Cox proportional hazards model is one of the most widely-used statistical models. It is a semiparametric model for the hazard ratio, which means we will avoid specifying a parameteric form for the baseline, time-varying hazard rate, while specifying a parametric model for the influence of covariates on the hazard rate: \[\lambda(t \mid \mathbf{z}) = \lambda_0(t) \exp(\mathbf{z}^T \boldsymbol{\beta})\]
In the following section we’ll derive the likelihood for the model in a similar way to our NPMLE derivation of the hazard rate.
1.1 Cox model likelihood derivation and the Breslow estimator without ties
Suppose we have the standard survival-analysis triplet of observable random variables for each unit \(i\) under study: \[\{(T_i = \min(X_i, C_i), \Delta_i = \mathbbm{1}\left(X_i \leq C_i\right), \mathbf{z}_i), i = 1, \dots, n\}\] where \(X_i\) is the absolutely continuous time to failure for unit \(i\), \(C_i\) is the absolutely continuous time to censoring, and \(\mathbf{z}_i\) is a length-\(p\) vector of time-invariant covariates that we are conditioning on.
As stated above, we assume that the hazard function for the distribution of \(X_i\) is: \[\lambda_i(t) = \lambda_0(t) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\] and we’ll leave the function \(\lambda_0(t)\) unspecified. As in our lecture on the Kaplan-Meier estimator, we’ll derive an estimator for \(\lambda_0(t)\) at the event times \(\{t_i, i = 1, \dots, n\}\). Let \(\Lambda(t)\) be the right-continuous-with-left-hand-limits (cádlág for short, in French) cumulative hazard function with mass points at \(t_i\). Note that \(\lambda(t_i) = \Lambda(t_i) - \Lambda(t_i-)\), and we define the integral \[\int_B F(t) \mathrm{d}\Lambda(t) = \sum_{i \mid t_i \in B} F(t_i) \lambda(t_i).\] This is a nonparametric estimator because as the data grow, so too does the dimension of the parameter space. This is akin to the Theoretical note in (Klein, Moeschberger, et al. 2003), and an exercise in (Aalen 1988).
The joint likelihood for the model for set of observed data \(\{(t_i, \delta_i, \mathbf{z}_i), i = 1, \dots, n\}\), which is assumed to have no ties in event times, is: \[\begin{align} L(\lambda_0, \boldsymbol{\beta}) = \prod_{i=1}^n \left(\lambda_0(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i} \exp(-\exp(\mathbf{z}_i^T \boldsymbol{\beta})\textstyle\int_0^{t_i} \lambda_0(u) du). \end{align} \tag{1}\] Recall the definition of \(Y_i(u)\): \[Y_i(u) = \mathbbm{1}\left(t_i \geq u\right).\] The function is left continuous with a jump from 1 to 0 at \(t_i\). Then we can rewrite the model as \[\begin{align} L(\lambda_0, \boldsymbol{\beta}) & = \prod_{i=1}^n \left(\lambda_0(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i} \exp(-\exp(\mathbf{z}_i^T \boldsymbol{\beta})\textstyle\int_0^\infty Y_i(u) \lambda_0(u) du) \\ & = \left(\prod_{i=1}^n \left(\lambda_0(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i}\right)\exp(-\textstyle\int_0^\infty \sum_{i=1}^n \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(u) \lambda_0(u) du) \\ & = \left(\prod_{i=1}^n \left(\lambda_0(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)^{\delta_i}\right)\exp(-\textstyle\sum_{j=1}^n \left(\sum_{i=1}^n \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(t_j)\right)\lambda_0(t_j)) \end{align} \tag{2}\] Let’s fix a value of \(\boldsymbol{\beta}\) and compute the NPMLE for \(\lambda_0(t_j)\). The log-likelihood for \(\lambda_0\) is \[\begin{align*} \ell(\lambda_0, \boldsymbol{\beta}) & = \sum_{i=1}^n \delta_i \log\left(\lambda_0(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})\right)-\textstyle\sum_{j=1}^n \left(\sum_{i=1}^n \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(t_j)\right)\lambda_0(t_j) \end{align*}\] Differentiating with respect to \(\lambda_0(t_j)\), provided \(\delta_j = 1\), gives \[\begin{align*} \frac{\partial}{\partial \lambda_0(t_j)} \ell(\lambda_0, \boldsymbol{\beta}) = \frac{1}{\lambda_0(t_j)} - \sum_{i=1}^n \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(t_j) \end{align*}\] We note that the second derivative with respect to \(\lambda_0(t_j)\) is strictly negative for all positive \(\lambda_0(t_j)\) Setting this expression equal to zero and solving for \(\hat{\lambda_0(t_j)}\) will give the unique NPMLE \[\begin{align} \hat{\lambda}_0(t_j) = \frac{1}{\sum_{i=1}^n \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(t_j)} \end{align}\] The denominator can be simplified if we define the set \(R(t_j): \{i \mid Y_i(t_j) = 1, i = 1, \dots, n\}\). This is called the risk set at time \(t_j\). \[\begin{align} \hat{\lambda}_0(t_j) = \frac{1}{\sum_{i \in R(t_j)} \exp(\mathbf{z}_i^T \boldsymbol{\beta})} \end{align}\]
Let’s substitute this NPMLE into the likelihood in Equation 1. Recognize that by definition of \(\delta_j\) and \(\lambda_0(t_j)\) only jumping at event-of-interest times, the estimator is equivalently defined as \[\begin{align} \hat{\lambda}_0(t_j) = \frac{\delta_j}{\sum_{i \in R(t_j)} \exp(\mathbf{z}_i^T \boldsymbol{\beta})} \end{align}\] Subbing this back into the last line in Equation 2 gives \[\begin{align} L(\boldsymbol{\beta}) & = \left(\prod_{i=1}^n \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right)^{\delta_i} \right)\exp\left(-\sum_{j=1}^n \delta_j \frac{\left(\sum_{i=1}^n \exp(\mathbf{z}_i^T \boldsymbol{\beta})Y_i(t_j)\right)}{\sum_{i \in R(t_j)} \exp(\mathbf{z}_i^T \boldsymbol{\beta})}\right)\\ & = \left(\prod_{i=1}^n \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right)^{\delta_i} \right)\exp\left(-\sum_{j=1}^n \delta_j\right) \end{align} \tag{3}\] Then we get the final term for the Cox model partial likelihood: \[\begin{align} L(\boldsymbol{\beta}) & \propto \prod_{i=1}^n \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right)^{\delta_i}\\ & =\left(\prod_{i \mid \delta_i = 1} \frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right) \end{align} \tag{4}\] where we note that maximizing Equation 4 will maximize Equation 3.
Let \(\hat{\boldsymbol{\beta}}\) be the MLE of the expression \(L(\boldsymbol{\beta})\). Then the estimator for the cumulative hazard is defined as: \[\begin{align} \hat{\Lambda}_0(t) = \sum_{t_i \mid \delta_i = 1, t_i \leq t} \frac{1}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \hat{\boldsymbol{\beta})}} \end{align}\] This is known as the Breslow estimator for the cumulative hazard.
1.2 Alternative view of the Cox model
The final form of the partial likelihood for the Cox model is: \[\begin{align} L(\boldsymbol{\beta}) & \propto \left(\prod_{i \mid \delta_i = 1} \frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right) \end{align}\] If we multiply top and bottom by \(\lambda_0(t_i)\) (essentially we’ll be multiplying by \(1\) a bunch of times), we get: \[\begin{align} L(\boldsymbol{\beta}) & \propto \left(\prod_{i \mid \delta_i = 1} \frac{\lambda_0(t_i) \exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \lambda_0(t_i) \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right) \end{align}\] If we write out the hazard function explicitly we get the following probability distribution: \[\begin{align} L(\boldsymbol{\beta}) & \propto \left(\prod_{i \mid \delta_i = 1} \frac{\lim_{dt \searrow 0} P(t_i \leq X_i < t_i + dt \mid X_i \geq t_i)}{\sum_{j \in R(t_i)} \lim_{dt \searrow 0} P(t_i \leq X_j < t_i + dt \mid X_j \geq t_i)}\right) \end{align}\] Under the assumption of independent event times, we can interpret this probability function as the probability the \(i^\mathrm{th}\) participant surviving to time \(t_i\) and failing just after \(t_i\) conditional on exactly 1 death occurring just after time \(t_i\) among those surviving to time \(t_i\). Finally, under noninformative censoring, and noting that this is \[\begin{align} L(\boldsymbol{\beta}) & \propto \left(\prod_{i \mid \delta_i = 1} \frac{\lim_{dt \searrow 0} P(t_i \leq X_i < t_i + dt \mid X_i \geq t_i, C_i \geq t_i)}{\sum_{j=1}^n \lim_{dt \searrow 0} P(t_i \leq X_j < t_i + dt \mid X_j \geq t_i, C_i \geq t_i)}\right) \end{align}\] Where the sum in the denominator follows because if \(C_i < t_i\) or \(X_i < t_i\), then \(P(t_i \leq X_j < t_i + dt \mid X_j \geq t_i, C_i \geq t_i) = 0\).
The no ties assumption is not problematic with absolutely continuous data. In practice there will be ties in the data.
1.2.1 Data with ties
The simplest approximation when there are ties in the data is to consider the times at which the ties occurred to be distinct, but mismeasured. We define \(\{\tau_j, j = 1, \dots r = \sum_i \delta_i\}\) to be the distinct times at which failures occur. We can expand the dataset with \[\{(\tau_j, d_j = \sum_{i\mid t_i = \tau_j} \delta_i, \mathbf{s}_j = \sum_{i\mid t_i = \tau_j}\mathbf{z}_i), j = 1, \dots, r\}.\] We still need access to the risk set function \(R(t) = \sum_i Y_i(t)\) and \(\mathbf{z}_i\). The partial likelihood can be written in terms of the new dataset: \[\begin{align} L(\boldsymbol{\beta}) & \propto \prod_{i=1}^n \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{j \in R(t_i)} \exp(\mathbf{z}_j^T \boldsymbol{\beta})}\right)^{\delta_i}\\ & =\prod_{j=1}^r \prod_{i \mid t_i = \tau_j} \left(\frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{k \in R(t_i)} \exp(\mathbf{z}_k^T \boldsymbol{\beta})}\right)\\ & = \prod_{j=1}^r \frac{\exp(\mathbf{s}_j^T \boldsymbol{\beta})}{\left(\sum_{k \in R(\tau_j)} \exp(\mathbf{z}_k^T \boldsymbol{\beta})\right)^{d_j}} \end{align}\]
This is not quite right because we’ve ignored the fact that when failure time is continuous, all true failure times are ordered in time. Let \(\tau_j\) be a time interval in which several units failed, and let the set of units that fail at \(\tau_j\) be \(D(\tau)\), defined as: \[D(\tau_j) = \{i \mid t_i = \tau_j, i = 1, \dots, n\}.\] The proper way to handle ties would be to integrate over all possible permutations of failure times.
1.2.1.1 Exact handling of ties
This section follows (Kalbfleisch and Prentice 2002). Let the indices of the set of units failing at time \(\tau_j\) be \(\{1, \dots, d_j\}\). Let \(Q_j\) be the set of permutations of \(D(\tau_j)\) and let \(P = (p_1, \dots, p_{d_j})\) be an element of this permutation. Finally, let the extended at risk set be \(R(\tau_j, P, m) = R(\tau_j) \setminus \{p_1, \dots, p_{m-1}\}\) where we let \(p_0\) be the empty set. For a single term in the likelihood at time \(\tau_j\), we need to integrate the term \[\begin{align} \prod_{i \mid \delta_i = 1, t_i = \tau_j} \frac{\exp(\mathbf{z}_i^T \boldsymbol{\beta})}{\sum_{k \in R(\tau_j)} \exp(\mathbf{z}_k^T \boldsymbol{\beta})} \end{align}\] over all permutations of \(D(\tau_j)\). We have no extra information to weight the orderings, so we give them all equal weight as \(\frac{1}{d_j!}\). This integral is \[\begin{align} \frac{\exp(\mathbf{s}_j^T \boldsymbol{\beta})}{d_j!} \sum_{P \in Q_j} \prod_{m=1}^{d_j} \frac{1}{\sum_{k \in R(\tau_j, P, m)} \exp(\mathbf{z}_k^T \boldsymbol{\beta})} \end{align}\] Not surprisingly, this calculation is prohibitively computationally intensive. The part that is computationally intensive is the sum.
Let’s look at at a simple example to see one way we might approximate this integral:
Example 1. Suppose we have \(d_j = 2\) for some \(\tau_j\). Let the risk set at \(\tau_j\) be \(R(\tau_j)\), and let \(i\) and \(m\) be the indices such that \(t_i = t_{m} = \tau_j\). Let \(c = \sum_{k \in R(\tau_j)} \exp(\mathbf{z}_k^T \beta)\), \(a = \exp(\mathbf{z}_i^T \beta)\), \(b = \exp(\mathbf{z}_m^T \beta)\). If \(t_i \leq t_m\), the terms in the Cox model should be: \[\frac{a}{c}\times\frac{b}{c - a}\] and if \(t_m < t_i\), we would instead have \[\frac{b}{c}\times\frac{a}{c - b}\] Because we have no information about which event is more probable, we weight them equally with \(\frac{1}{2}\): \[\frac{1}{2}\frac{a}{c}\times\frac{b}{c - a} + \frac{1}{2}\frac{b}{c}\times\frac{a}{c - b} = \frac{ab}{2}(\frac{1}{c}\times \frac{1}{c - a} + \frac{1}{c}\times \frac{1}{c - b})\] We could approximate this expression instead by: \[\frac{ab}{2c}(\frac{1}{c - a} + \frac{1}{c - b}) \approx \frac{ab}{2c}\left(\frac{2}{c - (a + b)/2}\right)\]
If we note that we have \(d_j!\) terms in the sum, we can approximate the sum by Efron’s approximation: \[\begin{align} \frac{d_j!}{\prod_{m=0}^{d_j-1}\left(\sum_{k \in R(\tau_j)} \exp(\mathbf{z}_k^T \boldsymbol{\beta}) - \frac{m}{d_j} \sum_{k \in D(\tau_j)} \exp(\mathbf{z}_k^T \boldsymbol{\beta})\right)} \end{align}\] The intuition for this method is that you approximate the decrement in the risk set by the average of the relative risk of failure of the failed units at time \(\tau_j\).
1.3 Interpretation of the Cox regression model
The key idea for the Cox regression model is that we can maximize the partial likelihood without worrying about specifying any form for the baseline hazard rate \(\lambda_0(t)\). Informally, this allows us to use standard asymptotic tests and confidence intervals for \(\boldsymbol{\beta}\) without worrying about the infinite dimensional (i.e. unknown function) baseline hazard rate. Thus we can use all the asymptotic likelihood techniques we developed for parametric models for the Cox model.