Missing data lecture 13: Causal inference
Set up
We’re often interested in making causal statements about the processes we study. Suppose we’re interested in whether a year-long jobs training program for disadvantaged workers leads to lower rates of unemployment in the year following the training program for these workers. Suppose we had data on post-training program employment records for these workers as well as those of workers with comparable backgrounds to the trainees in the same labor markets with putative access to the training program. This is a simplification of a question was examined in Ashenfelter (1978). We’d probably compare the observed employment status in the year following the training program \(Y_{i}\) for participants to nonparticipants. Let’s identify the units \(i=1, \dots, n\) in the training program as those with \(W_i = t\), and those not in the program as those with \(W_i = c\). Let the value \(n_t\) be the number of people in the training program, and \(n_c\) as those who aren’t. The simple comparison of mean employment is: \[ \hat{\tau}^\text{dif} = \frac{1}{n_t} \sum_{i=1}^{n} Y_i\ind{W_i = t} - \frac{1}{n_c} \sum_{i=1}^{n} Y_i\ind{W_i = c} \] We’d probably be tempted to ascribe causality to the comparison: “Participation in a jobs program decreased unemployment by 0.05 percentage points”, but we’ve learned that one should only say something like:“Participation in a jobs program predicted a decrease in unemployment of 0.05 percentage points”. Why can’t we use the causal language we’d prefer to use? When can we do so?
Core principles
Potential outcomes and causal effects
The key idea, shared in Imbens and Rubin (2015) is that the effect of a cause is really a statement about the comparison of two outcomes for a single individual that correspond to different actions taken. In the job training example, there are two actions: participating in a job training program, and not participating in a job training program, which we denote \(W_i \in \{t,c\}\) with \(c\) indicating lack of participation. The outcome would be the employment status the year after each action; \(Y_{i}^{W=t}\) would be the employment status after participating in the job training program, while \(Y_{i}^{W=c}\) would be employment status if one did not participate in the jobs program. These variables are called potential outcomes because they exist without regards to \(W_i\), the treatment actually chosen by the participants. The causal effect of the action is defined as the difference between the potential outcomes, namely employment status if one had participated in the program and employment status if one had not participated in the program: \[ \tau_i = Y_{i}^{W=t} - Y_{i}^{W=c} \]
We can write the different scenarios for each outcome to get the values for \(\tau_i\):
| \(\tau_i\) | \(Y_i^{W=t}\) | \(Y_i^{W=c}\) | Description |
|---|---|---|---|
| 0 | 0 | 0 | Always unemployed, no causal effect |
| 1 | 1 | 0 | Training led to employment |
| -1 | 0 | 1 | Training led to unemployment |
| 0 | 1 | 1 | Always employed, no causal effect |
Note that these causal effects are not dependent on which treatment was actually chosen because the effect compares two variables that don’t depend on the treatment chosen. The idea of potential outcomes is called the Rubin Causal Model (RCM). Counterfactual outcome is another term for these variables. In the RCM, these are considered fixed for each individual.
Stable Unit Treatment Value Assumption
Crucially, this individual-level causal effect is also not estimable, because we only get to see one outcome, namely the outcome chosen by the individual. The data we would see for a single set of individuals is
| \(W_i\) | \(Y_i^{W=t}\) | \(Y_i^{W=c}\) | \(Y_{i(0)}\) |
|---|---|---|---|
| \(c\) | ? | 0 | 0 |
| \(t\) | 0 | ? | 0 |
| \(t\) | 1 | ? | 1 |
| \(c\) | ? | 1 | 1 |
The first, consistency, means that we assume the following: \[ P(Y_{i(0)} = Y_i^{W=t} \mid W_i=t) = 1, \, P(Y_{i(0)} = Y_i^{W=c} \mid W_i=c) = 1 \]
This has several implications, all of which are important to chew through:
We assume that everyone receives the same type of treatment in the same way. This would be violated for instance if some job trainees received online instruction while others received in-person classes.
We assume that the act of treatment does not change the observation. This is a well-known problem in psychology studies, namely that when people are aware that they are being watched they will change their behavior, also known as the Hawthorne effect.
The second assumption is the no-interference assumption, which states that the potential outcomes for a single unit do not depend on treatment assignment of other units. This is the sort of assumption that would fail in a vaccine trial that measured, say, symptomatic disease caused by an infectious pathogen in households. In this setting, a vaccine may make participants less infectious, thus a person’s disease status would depend on the vaccination status of those around them.
Both of these assumptions make it possible to define an individual’s observed outcome, \(Y_{i(0)}\) as a function of their treatment assignment \(W_i\) and their potential outcomes alone: \[ Y_{i(0)} = \ind{W_i=t} Y_i^{W=t} + \ind{W_i=c} Y_i^{W=c} \] This implies that each unit is missing an observation, which we’ll denote as \(Y_{i(1)}\), in keeping with our missing data notation: \[ Y_{i(1)} = \ind{W_i=c} Y_i^{W=t} + \ind{W_i=t} Y_i^{W=c} \]
Causal inference as a missing data problem
We can invert these relationships to make the missingness more explicit:
\[ Y_i^{W=t} \begin{cases} Y_{i(0)} & \text{if } W_i = t\\ Y_{i(1)} & \text{if } W_i = c\\ \end{cases},\quad Y_i^{W=c} \begin{cases} Y_{i(1)} & \text{if } W_i = t\\ Y_{i(0)} & \text{if } W_i = c\\ \end{cases} \] Thus, if we wanted to predict individual causal effects, we would need to impute the missing observation depending on which treatment group they have been assigned to.
Missingness mechanisms
We know from our missing data material that if we want to make an inference about model parameters under missing data, we need to be in the MCAR or MAR setting. We can think of \(W_i\) as the causal version of \(M_i\). In the missing data realm, we would have \(M_{it}\) and \(M_{ic}\), with \(1\)s representing missingness of \(Y_{it}\) or \(Y_{ic}\), respectively. These variables give too many degrees of freedom for causal inference, however, because they are constrained so that \(M_{it} + M_{ic} = 1\) for every \(i\). Thus, we can write our missing data mechanism as a mechanism for \(\mathbf{W}\), the set of assignments for all \(n\) units under study. Let \(Y_{(0)}\) be the observed data for all participants, and \(Y_{(1)}\) be the missing values for all units. MCAR would be the following, with a slight change of notation: \[ P(\mathbf{W} = \tilde{w} \mid Y_{(0)} = y_{(0)}, Y_{(1)} = y_{(1)}) = P(\mathbf{W} = \tilde{w} \mid Y_{(0)} = y^\star_{(0)}, Y_{(1)} = y^\star_{(1)}), \forall y_{(0)}, y_{(1)}, y^\star_{(0)}, y^\star_{(1)} \] While MACAR would be:
\[ P(\mathbf{W} = w \mid Y_{(0)} = y_{(0)}, Y_{(1)} = y_{(1)}) = P(\mathbf{W} = w \mid Y_{(0)} = y^\star_{(0)}, Y_{(1)} = y^\star_{(1)}), \forall w, y_{(0)}, y_{(1)}, y^\star_{(0)}, y^\star_{(1)} \] One way to ensure this is to randomly assign units to treatment. If units are not assigned randomly to treatment, it may be that we can consider covariates for every individual, arranges into a \(n\times p\) matrix \(\mathbf{X}\),
\[ P(\mathbf{W} = \tilde{w} \mid Y_{(0)} = y_{(0)}, Y_{(1)} = y_{(1)}, \mathbf{X}) = P(\mathbf{W} = \tilde{w} \mid Y_{(0)} = y^\star_{(0)}, Y_{(1)} = y^\star_{(1)}, \mathbf{X}), \forall w, y_{(0)}, y_{(1)}, y^\star_{(0)}, y^\star_{(1)} \] However, if we don’t control the treatment assignment, like in the jobs training program, we likely have MNAR missingness:
\[ P(\mathbf{W} = \tilde{w} \mid Y_{(0)} = \tilde{y}_{(0)}, Y_{(1)} = y_{(1)}, \mathbf{X}) \neq P(\mathbf{W} = \tilde{w} \mid Y_{(0)} = \tilde{y}, Y_{(1)} = y^\star_{(1)}, \mathbf{X}) \text{ for some } y_{(1)} \neq y^\star_{(1)} \]
For instance, given that the job training program is a one-year program, people who decide to select into the program might have fewer employment prospects than the people who don’t select into the program.
Assignment mechanisms
The causal term of art for a missingness mechanism is an “assignment mechanism.”
As we showed above, we can write \(Y_{i(0)}\) and \(Y_{i(1)}\) in terms of the assignment \(W_i\) and the potential outcomes \(Y_i^{W=t}\) and \(Y_i^{W=c}\)\(Y_i^{W=t}\), so we can write \[ P(\mathbf{W} = w \mid Y_{(0)} = \tilde{y}_{(0)}, Y_{(1)} = y_{(1)}, \mathbf{X} = x) \] as \[ P(\mathbf{W} = w \mid Y^{W=t} = y_{t}, Y^{W=c} = y_{c}, \mathbf{X} = x) \] where we let \(Y^{W=t}\) represent the \(n\)-vector of all unit potential outcomes under treatment and \(Y^{W=c}\) is the \(n\)-vector of all unit potential outcomes under control, and the vectors \(y_t\) and \(y_c\) are dummy vectors in the space of binary \(n\)-vectors, or \(\{0,1\}^n\).
This probability distribution over vectors \(w \in \{t,c\}^n\) is called the assignment mechanism.
Definition 1 (Assignment mechanism) Let \(P(\mathbf{W} = \tilde{w} \mid \mathbf{X} = \mathbf{x}, Y_{(0)} = y_{(0)}, Y_{(1)} = y_{(1)})\) be a row-exchangeable probability mass function for \(\mathbf{W}\), a length-\(n\) vectors \(\{t,c\}^n\) of treatment assignments.
By its definition as a PMF we have
\[ \sum_{w \in \{t,c\}^n} P(\mathbf{W} = w \mid Y^{W=t} = y_{t}, Y^{W=c} = y_{c}, \mathbf{X} = x) = 1 \]
for all \(y_{(0)},y_{(1)}, x\).
Row-exchangeability means that ordering of the units doesn’t change the PMF’s value. Note this doesn’t mean it is independent.
The unit assignment probability is the marginal probability for treatment for the \(i^\mathrm{th}\) unit:
\[ p_i(x, y_{t}, y_{c}) = \sum_{w \in \{t,c\}^n \mid w_i = t} P(\mathbf{W} = w \mid Y^{W=t} = y_{t}, Y^{W=c} = y_{c}, \mathbf{X} = x) \]
We can define something called the propensity score:
Definition 2 The propensity socre is the probability of treatment for someone with a covariate \(X_i = x\)
\[ e(x) = \frac{1}{N(x)} \sum_{i \mid X_i = x} p_i(x,y_{t},y_{c}) \] where \(N(x)\) is the number of units with \(X_i = x\). If \(N(x) = 0\) we set \(e(x)=0\).
A typical assignment mechanism restriction is that of individualistic assignment:
Assumption 1 (Individualistic assignment) The unit assignment probability is dependent only on unit \(i\)’s potential outcomes and covariates:
\[ P(W_i = t \mid Y^{W=t} = y_{t}, Y^{W=c} = y_{c}, \mathbf{X} = x) = P(W_i = t \mid Y_i^{W=t} = y_{it}, Y_i^{W=c} = y_{ic}, \mathbf{X}_i = x_i) \]
Another common assumption about the assignment mechanism is probabilistic assignment.
Assumption 2 (Probabilistic assignment) All individuals have positive probability of being treated or untreated.
\[ 0 < P(W_i = t \mid Y^{W=t} = y_{t}, Y^{W=c} = y_{c}, \mathbf{X} = x) < 1, \forall x, y_t, y_c \]
Finally, we can define the most important assignment mechanism restriction, which is called unconfoundedness:
Assumption 3 (Unconfounded Assignment) \[ P(\mathbf{W} = w \mid Y^{W=t} = y_{t}, Y^{W=c} = y_{c}, \mathbf{X} = x) = P(\mathbf{W} = w \mid Y^{W=t} = y^\star_{t}, Y^{W=c} = y^\star_{c}, \mathbf{X} = x) \] for all \(w, x, y_t, y_c, y^\star_t, y^\star_c\).
Note that this definition is just MACAR, but for treatment assignment.
We can relax this assumption in the following way, as shown in Rubin (1978):
Assumption 4 (Ignorable Treatment Assignment) \[ P(\mathbf{W} = w \mid Y_{(0)} = y_{(0)}, Y_{(1)} = y_{(1)}, \mathbf{X} = x) = P(\mathbf{W} = w \mid Y_{(0)} = y_{(0)}, Y_{(1)} = y^\star_{(1)}, \mathbf{X} = x) \] for all \(y_{(1)}, y^\star_{(1)}\).
Note that this is a definition akin to MAR for missingness, where we moved back to missing data notation to highlight the potential dependence on observed outcomes but not unobserved outcomes.
Example 1 (Sequential randomized trial) Suppose we have \(3\) units which we label \(i=1,2,3\), and thus we have \(\mathbf{W} = (W_1, W_2, W_3)^T\).Let \(X=i\). In this example let the active treatment, denoted above as \(W_i = t\), be denoted as \(W_i = 1\) and let \(W_i = 0\) be denoted as control. The sample space for \(\mathbf{W}\) has \(2^3\) elements and is: \[ \mathbf{W} \in \left\{ \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix}, \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}, \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix}, \begin{pmatrix} 0 \\ 1 \\ 1 \end{pmatrix}, \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} \right\} \] The sampling routine is as follows:
Draw \(W_1 \sim \text{Bern}(0.5)\)
Draw \(W_2 = 1 - W_1\)
\(W_3\) is the treatment for which the observed causal effect is greatest, breaking ties towards active treatment: \[ W_3 = \begin{cases} 1 & \text{if } Y_1(1) \geq Y_2(0) \text{ and } W_1 = 1 \\ 1 & \text{if } Y_2(1) \geq Y_1(0) \text{ and } W_1 = 0 \\ 0 & \text{if } Y_1(1) < Y_2(0) \text{ and } W_1 = 1 \\ 0 & \text{if } Y_2(1) < Y_1(0) \text{ and } W_1 = 0 \end{cases} \] This is a case in which the treatment assignment is ignorable because it depends only on observed outcomes, not on unobserved outcomes
Note that the first two steps in the treatment assignment scheme mean that \(P(\mathbf{W} = w \mid Y^{W=1} = y^{W=1}, Y^{W=0} = y^{W=0}) = 0\) for any \(w \in \{0,1\}^3\) with \(w_1 = w_2\).
Definition 3 (Completely randomized experiment (CRE)) For a given number of individuals in the active treatment group, \(N_t\), a completely randomized experiment has the following assignment mechanism:
\[ P(\mathbf{W} = w \mid Y^{W=1} = y^{W=1}, Y^{W=0} = y^{W=0}, \mathbf{X} = x) = \begin{cases} \binom{N}{N_t}^{-1} & \text{if } \sum_{i=1}^N w_i = N_t \\ 0 & \text{otherwise} \end{cases} \]
Causal Estimands vs. estimators
Typically, we want to define an estimable quantity that is related to the population of interest. In causal inference, the population of interest tends to be the sample at hand, which is assumed to arise from an infinite superpopulation. In causal inference, we know that causal effects are comparisons of unit-level potential outcomes, so averages of these values would make for natural causal estimands.
One would be be the finite-sample mean of the individual causal effects: \[ \tau_{\text{fs}} = \frac{1}{n} \sum_{i=1}^n (Y^{W=1}_{i} - Y^{W=0}_{i}) \] Starting with the contrast from above: \[ \hat{\tau}^\text{dif} = \frac{1}{n_t} \sum_{i=1}^{n} Y_i\ind{W_i = t} - \frac{1}{n_c} \sum_{i=1}^{n} Y_i\ind{W_i = 0} \] we can show that this is an unbiased estimator for \(\tau_{\text{fs}}\) when the experiment is completely randomized.
\[ \begin{align} \hat{\tau}^\text{dif}& = \frac{1}{n_t} \sum_{i=1}^n \ind{W_i = 1}Y_{i(0)} - \frac{1}{n_c} \sum_{i=1}^n \ind{W_i = 0}Y_{i(0)}\\ & = \frac{1}{n_t} \sum_{i=1}^n \ind{W_i = 1}Y^{W=1}_{i} - \frac{1}{n_c} \sum_{i=1}^n \ind{W_i = 0}Y^{W=0}_{i} \tag{Consistency}\\ & = \frac{1}{n} \sum_{i=1}^n \frac{\ind{W_i = 1}Y^{W=1}_{i}}{n_t/n} - \frac{1}{n} \sum_{i=1}^n \frac{\ind{W_i = 0}Y^{W=0}_{i}}{n_c/n} \\ \end{align} \] Given that we have a CRE, we could take the expectation of both sides with respect to \(W_i \mid Y_{i(0)}, Y_{i(1)}\) to get
\[ \begin{align} &\Exp{\hat{\tau}^\text{dif} \mid Y_{i(0)}, Y_{i(1)}} \\ & \quad = \frac{1}{n} \sum_{i=1}^n\lp \frac{\Exp{\ind{W_i = 1}\mid Y_{i(0)}, Y_{i(1)}}Y^{W=1}_{i}}{n_t/n} - \frac{\Exp{\ind{W_i = 0}\mid Y_{(0)}, Y_{(1)}} Y^{W=0}_{i}}{n_c/n}\rp \\ & \quad = \frac{1}{n} \sum_{i=1}^n \lp\frac{\Exp{\ind{W_i = 1}}Y^{W=1}_{i}}{n_t/n} - \frac{\Exp{\ind{W_i = 0}} Y^{W=0}_{i}}{n_c/n}\rp \tag{Unconfoundedness} \\ & \quad = \frac{1}{n} \sum_{i=1}^n Y^{W=1}_{i} - Y^{W=0}_{i} \tag{Positivity}\\ & \quad = \tau_\text{fs} \end{align} \] The second line came from our MACAR assumption, which we could satisfy in a randomized experiment.
Conditional variance of \(\hat{\tau}^\text{dif}\)
Let’s work through the case of \(N=2\) in which \(N_t = 1\) and \(N_c = 1\). The estimator has two values depending on what \(W_1\) is. \[ \hat{\tau}^\text{dif} = \begin{cases} Y_1(1) - Y_2(0) & \text{if } W_1 = 1 \\ Y_2(1) - Y_1(0) & \text{if } W_1 = 0 \end{cases} \]
We can write the in one line: \[ \hat{\tau}^\text{dif} = W_1 \lp Y_1(1) - Y_2(0) \rp + (1 - W_1)\lp Y_2(1) - Y_1(0) \rp \]
Rewriting this in terms of \(D_i = 2 W_i - 1\) leads to \(W_i = (D_i + 1) / 2\), which leads to: \[ \begin{aligned} \hat{\tau}^\text{dif} & = \frac{D_1 + 1}{2} \lp Y_1(1) - Y_2(0) \rp + \frac{1 - D_1}{2}\lp Y_2(1) - Y_1(0) \rp \\ & = \frac{1}{2}\lp Y_1(1) - Y_1(0) + Y_2(1) - Y_2(0) \rp \\ & \quad + \frac{D_1}{2}\lp Y_1(1) + Y_1(0) - (Y_2(1) + Y_2(0)) \rp \end{aligned} \] The expectation of \(D_1\) is zero, so this shows that the mean of \(\hat{\tau}^{\text{dif}}\) conditional on the potential outcomes is the finite sample causal effect.
The variance is more complicated. The variance of \(D_1\) is 1, so the conditional variance of \(\hat{\tau}^{\text{dif}}\) is \[ \Var{\hat{\tau}^\text{dif} \mid \mathbf{Y}(1), \mathbf{Y}(0)} = \frac{1}{4}\lp Y_1(1) + Y_1(0) - (Y_2(1) + Y_2(0)) \rp^2 \]