Missing data lecture 14: Propensity score methods
Set up
Despite the importance of randomized trials, there are many causal queries we may wish to make for which a randomized trial is infeasible or unethical. In this case, if we can find an observational dataset in which unconfounded treatment assignment holds, then we may be able to do causal inference.
The sampling properties of the estimators we will cover in this lecture will be derived from taking a bit different persepctive than in last lecture. In the previous lectures we were focused on inference under the randomization distribution of the treatment assignment. In this chapter, we will consider the sample at hand to have been drawn from a broader, often infinite, super population, and it is under repeated samples from the super population under which we will derive the sampling properties of our estimators.
This lecture is nearly verbatim from Imbens and Rubin (2015).
Super population vs. finite sample inference
In the previous lecture we investigated the properties of difference-in-means estimator which was shown to be unbiased for the causal estimand the finite sample average causal effect: \[ \tau_{\text{fs}} = \frac{1}{n}\sum_{i=1}^N (Y_i(1) - Y_i(0)) \] In the superpopulation perspective, we will be targeting the super-population average treatement effect: \[ \tau_{\text{sp}} = \ExpA{Y_i(1) - Y_i(0)}{\text{sp}} \]
The generative model is something like: \[ \begin{aligned} (Y_i(0), Y_i(1), X_i) & \overset{\text{iid}}{\sim} F,\, i = 1, \dots, n\\ \mathbf{W} \mid (\mathbf{Y}(0), \mathbf{Y}(1), \mathbf{X}) & \sim P_{\mathbf{W} \mid \mathbf{Y}(0), \mathbf{Y}(1), \mathbf{X}} \end{aligned} \] In the setting of a randomized trial, the experimenter controls the assignment mechanism, but doesn’t control the sampling bit (though we could imagine exclusion critera in an RCT that excludes groups based on pretreatment covariates).
In this expression, the expectation is taken over repeated sampling from the super-population. Note that the expectation of \(\tau_{\text{fs}}\) over the super-population equals \(\tau_{\text{sp}}\).
It’s important to keep in mind that the treatment assignment is still random, it’s just that there is another level of randomness,
Let \[\sigma_t^2 = \VarA{Y_i(1)}{\text{sp}},\] \[\sigma_c^2 = \VarA{Y_i(0)}{\text{sp}},\]
\[\sigma_{tc}^2 = \VarA{Y_i(1) - Y_i(0)}{\text{sp}} = \ExpA{(Y_i(1) - Y_i(0) - \tau_{\text{sp}})^2}{\text{sp}}\] Note that \(\VarA{\tau_{\text{fs}}}{\text{sp}} = \VarA{\bar{Y}_(1) - \bar{Y}_(0)}{\text{sp}} = \sigma^2_{tc} / n\)
The total variance (over both the super-population sampling distribution and the finite sample distribution of the treatment assignment given the fixed potential outcomes) of the difference in means estimator: \[ \hat{\tau}^{\text{dif}} = \bar{Y}_t - \bar{Y}_c \] has a nice expression: \[ \VarA{\hat{\tau}^{\text{dif}}}{(W, Y(0), Y(1))} = \frac{\sigma_c^2}{n_c} + \frac{\sigma_t^2}{n_t} \] In what comes next, when we use \(\Var{Y}\), or \(\Exp{X}\) we mean the variance over the joint distrubtion of \(\mathbf{W}, \mathbf{Y}(0), \mathbf{Y}(1)\)
This comes from the following: \[ \begin{aligned} \Var{\hat{\tau}^{\text{dif}}} & = \Exp{(\bar{Y}_t - \bar{Y}_c - \Exp{\bar{Y}_t - \bar{Y}_c})^2}\\ & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \ExpA{\ExpA{\bar{Y}_t - \bar{Y}_c \mid \mathbf{Y}(0),\mathbf{Y}(1)}{\mathbf{W} \mid \mathbf{Y}(0),\mathbf{Y}(1)}}{\text{sp}}\right)^2}\\ & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \ExpA{\tau_{\text{fs}}}{\text{sp}}\right)^2}\\ & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{sp}}\right)^2}\\ & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}} + \tau_{\text{fs}} - \tau_{\text{sp}}\right)^2}\\ & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)^2} + \Exp{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)^2}\\ & \quad + 2\Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)} \end{aligned} \] Note \[ \begin{aligned} \ExpA{\ExpA{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)}{\mathbf{W} \mid \mathbf{Y}(0), \mathbf{Y}(1)}}{\text{sp}} & = \ExpA{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)\ExpA{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)}{\mathbf{W} \mid \mathbf{Y}(0), \mathbf{Y}(1)}}{\text{sp}} \\ & = \ExpA{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)\ExpA{\left(\bar{Y}_t - \bar{Y}_c\right)}{\mathbf{W} \mid \mathbf{Y}(0), \mathbf{Y}(1)} - \tau_{\text{fs}}}{\text{sp}} \\ & = \ExpA{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)\tau_{\text{fs}} - \tau_{\text{fs}}}{\text{sp}} \\ & = 0 \end{aligned} \] Thus our variance is the sum of two terms
\[ \begin{aligned} \Var{\hat{\tau}^{\text{dif}}} & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)^2} + \Exp{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)^2}\\ & = \Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)^2} + \ExpA{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)^2}{\text{sp}}\\ \end{aligned} \]
\(\ExpA{\left(\tau_{\text{fs}} - \tau_{\text{sp}}\right)^2}{\text{sp}}\) is equal to \(\sigma^2_{tc} / n\)
The first part \(\Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)^2}\) can be decomposed using the law of iterated expectations: \[ \begin{aligned} \Exp{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)^2} & = \ExpA{\left(\bar{Y}_t - \bar{Y}_c - \tau_{\text{fs}}\right)^2}{\mathbf{W} \mid \mathbf{Y}(0), \mathbf{Y}(1)} \\ & = \ExpA{\frac{S^2_t}{n_t} + \frac{S^2_c}{n_c} - \frac{S^2_{tc}}{n}}{\text{sp}} \\ & = \frac{\sigma^2_t}{n_t} + \frac{\sigma^2_c}{n_c} - \frac{\sigma^2_{tc}}{n} \end{aligned} \]
Combining the two results gives us the expression for the super population variance
Balancing scores
A balancing score \(b(X_i)\) is such that \(W_i \indy X_i \mid b(X_i)\).
We will show that \(P(W_i = 1 \mid X_i) = e(X_i)\) is a balancing score, or \(P(W_i = 1 \mid X_i, e(X_i)) = P(W_i = 1 \mid e(X_i))\)
Proof. \[ \begin{aligned} P(W_i = 1 \mid X_i, e(X_i)) & = P(W_i = 1 \mid X_i) \\ & = e(X_i) \end{aligned} \]
Follows from the fact that \(e(X_i)\) is a function of \(X_i\), so conditioning on \(X_i\) already carries all the information in \(e(X_i)\).
Now we need to show that \(P(W_i = 1 \mid e(X_i)) = e(X_i)\)
\[ \begin{aligned} P(W_i = 1 \mid e(X_i)) & = \ExpA{W_i \mid e(X_i)}{W_i} \\ & = \ExpA{\ExpA{W_i \mid X_i, e(X_i)}{W_i \mid X_i, e(X_i)}}{X_i \mid e(X_i)}\\ & = \ExpA{\ExpA{W_i \mid X_i}{W_i \mid X_i, e(X_i)}}{X_i \mid e(X_i)}\\ & = \ExpA{e(X_i)}{X_i \mid e(X_i)}\\ & = e(X_i) \end{aligned} \]
Now that we’ve established that \(e(X_i)\) is a balancing score, we now show that under unconfounded treatment assignment, \(W_i \indy Y_i(0), Y_i(1) \mid b(X_i)\)
Proof. We need to show that \(P(W_i = 1 \mid Y_i(0), Y_i(1), b(X_i)) = P(W_i = 1 \mid b(X_i))\). We have the following assumption, that \(W_i \indy Y_i(0), Y_i(1) \mid X_i\), and that by the definition of a balancing score, \(W_i \indy X_i \mid b(X_i)\). \[ \begin{aligned} P(W_i = 1 \mid Y_i(0), Y_i(1), b(X_i)) & = \Exp{W_i \mid Y_i(0), Y_i(1), b(X_i)} \\ & = \ExpA{\ExpA{W_i \mid X_i, Y_i(0), Y_i(1), b(X_i)}{W_i \mid Y_i(0), Y_i(1), b(X_i), X_i}}{X_i \mid b(X_i)} \\ & = \ExpA{\ExpA{W_i \mid X_i, b(X_i)}{W_i \mid Y_i(0), Y_i(1), b(X_i), X_i}}{X_i \mid b(X_i)} \\ & = \ExpA{\ExpA{W_i \mid b(X_i)}{W_i \mid Y_i(0), Y_i(1), b(X_i), X_i}}{X_i \mid b(X_i)} \\ & = \ExpA{P(W_i = 1 \mid b(X_i))}{X_i \mid b(X_i)} \\ & = P(W_i = 1 \mid b(X_i)) \end{aligned} \]
Propensity score estimation
For this section, we’re going to need a little more notation. Let \[ \begin{aligned} \mu_c(x) & = \ExpA{Y_i(0) \mid X_i = x}{\text{sp}} \\ \mu_t(x) & = \ExpA{Y_i(1) \mid X_i = x}{\text{sp}} \\ \sigma^2_c(x) & = \VarA{Y_i(0) \mid X_i = x}{\text{sp}} \\ \sigma^2_t(x) & = \VarA{Y_i(1) \mid X_i = x}{\text{sp}} \\ \tau_{\text{sp}}(x) & = \ExpA{Y_i(1) - Y_i(0) \mid X_i = x}{\text{sp}} \end{aligned} \]
There are two causal estimands that are typically of interest (though there are of course others that could be of interest):
\[ \begin{aligned} \tau_{\text{sp}} & = \ExpA{\tau_\text{sp}(X_i)}{\text{sp}} \tau_{\text{cond}} = \frac{1}{n} \sum_{i=1}^n \tau_{\text{sp}}(X_i) \end{aligned} \]
The first is the population average treatment effect, which requires two pieces: inference on \(\tau_{\text{sp}}(x)\) and \(f_\text{sp}(x)\). The second is the conditional average treatment effect, which subs in the empirical distribution in the sample, \(\mathbb{P}_n(x)\) for \(f_\text{sp}(x)\).
The nonparametric variance lower bound for \(\tau_{\text{sp}}\) is as follows: \[ \begin{aligned} \VarA{\hat{\tau}_{\text{sp}}}{\text{sp}} = \ExpA{\frac{\sigma^2_c(X_i)}{1 - e(X_i)} + \frac{\sigma^2_t(X_i)}{e(X_i)} + (\tau_\text{sp}(X_i) - \tau_\text{sp})^2}{\text{sp}} \end{aligned} \] This shows that population covariate balance between treated and control units will contribute to the asymptotic variance of our estimators.
On the other hand, the nonparametric variance lower bound for \(\hat{\tau}_{\text{cond}}\) is
\[ \begin{aligned} \VarA{\hat{\tau}_{\text{cond}}}{\text{sp}} = \ExpA{\frac{\sigma^2_c(X_i)}{1 - e(X_i)} + \frac{\sigma^2_t(X_i)}{e(X_i)}}{\text{sp}} \end{aligned} \]
Importantly, the super-population variance estimate depends on what estimand we choose.
Strategies for estimating causal effects
There are three strategies our text lays out for estimating causal estimands. They are:
Model-based inference
Weighting estimators
Blocking estimators
Matching estimators
The text argues against the first two and makes a case for the second two. The argument against model-based inference is straightforward.
The case against model-based inference
Suppose we use a regression estimator to learn the distribution of \(Y_i(1) \mid X_i\) within the treated units and \(Y_i(0) \mid X_i\) within the control units with population parameters \(\beta_t, \beta_c\) respectively.
Using ols on each group yields estimators \(\hat{\beta}_t\) and \(\hat{\beta}_c\). The estimand is \(\tau_{\text{sp}}\), or \[ \ExpA{ \frac{1}{n} \sum_{i=1}^n (Y_i(1) - Y_i(0)) }{\text{sp}} \] In practice, we can estimate \(Y_i(0)\) for those in the treated group by imputing the conditional mean of their outcome with \(\hat{Y}_i(0) = \hat{\beta}_c X_i\), and in the control group, we can similarly estimate \(\hat{Y}_i(1) = \hat{\beta}_t X_i\). Then the ols estimator for \(\tau_\text{sp}\) is : \[ \hat{\tau}^{\text{ols}}_\text{sp} = \frac{1}{n} \sum_{i=1}^n W_i(Y_i(1) - \hat{Y}_i(0)) + (1 - W_i) (\hat{Y}_i(1) - Y_i(0)) \]
We can rewrite this estimator as a weighted average of the estimate of the treatment effect in the treated group and in the control group: \[ \hat{\tau}^{\text{ols}}_\text{sp} = \frac{n_t}{n} \hat{\tau}^{\text{ols}}_t + \frac{n_c}{n} \hat{\tau}^{\text{ols}}_c \] where \[ \hat{\tau}^{\text{ols}}_t = \frac{1}{n_t}\sum_{i \mid W_i = 1} (Y_i - X_i \hat{\beta}_c) \] and \[ \hat{\tau}^{\text{ols}}_c = \frac{1}{n_t}\sum_{i \mid W_i = 1} (X_i \hat{\beta}_t - Y_i) \]
We can write \[ \begin{aligned} \hat{\tau}^{\text{ols}}_t & = \bar{Y}_t - \bar{X}_t \hat{\beta}_c \\ & = \bar{Y}_t - \bar{X}_t \hat{\beta}_c - \bar{Y}_c + \bar{X}_c \hat{\beta}_c \\ & = \bar{Y}_t - \bar{Y}_c - (\bar{X}_t - \bar{X}_c) \hat{\beta}_c \end{aligned} \]
\[ \begin{aligned} \hat{\tau}^{\text{ols}}_c & = \bar{X}_c \hat{\beta}_t - \bar{Y}_c \\ & = \bar{X}_c \hat{\beta}_t - \bar{Y}_c + \bar{Y}_t - \bar{X}_t \hat{\beta}_t \\ & = \bar{Y}_t - \bar{Y}_c - (\bar{X}_t - \bar{X}_c) \hat{\beta}_t \end{aligned} \]
The issue with both of these estimators are the terms \((\bar{X}_t - \bar{X}_c) \hat{\beta}_t\) and \((\bar{X}_t - \bar{X}_c) \hat{\beta}_c\). This wasn’t an issue when we had random treatment assignment because \(\ExpA{\bar{X}_t - \bar{X}_c }{W} = 0\), but when this is not the case, our estimates will be sensitive to our linearity assumption.
The case against weighting
We encountered weighting estimators earlier in the course when discussing imputation of a population mean under nonresponse.
The same issues that we discussed in that lecture will apply here, but in a different form.
Using the result \[ \begin{aligned} \ExpA{\frac{W_i Y_i}{e(X_i)}}{W, X, Y_i(0), Y_i(1)} & = \ExpA{\ExpA{\frac{W_i Y_i}{e(X_i)}}{W_i \mid X, Y_i(0), Y_i(1)}}{X, Y_i(0), Y_i(1)} \\ & = \Exp{Y_i(1)} \end{aligned} \] suggests the estimator: \[ \hat{\tau}^\text{ht}_{\text{sp}} = \frac{1}{n} \sum_{i=1}^n \frac{W_i Y_i}{\hat{e}(X_i)} - \frac{1}{n} \sum_{i=1}^n \frac{(1 - W_i) Y_i}{1 - \hat{e}(X_i)} \] The issue is that when treatment isn’t randomly assigned, one will likely have some units where \(\hat{e}(X_i)\) is near 1 or 0, leading to large weights and large variance.
The argument for blocking
The algorithm is as follows:
Construct the estimator for the propensity score \(\hat{e}(X_i)\)
Create \(J\) strata based on propensity scores such that \(B_i(j) = 1\) if \(b_{j-1} < \hat{e}(X_i) < b_j\) and \(0\) otherwise.
Estimate causal effects within each stratum: \[ \hat{\tau}_j = \frac{1}{\sum_{i \mid B_i(j) = 1} W_i} \sum_{i \mid B_i(j) = 1} Y_i W_i - \frac{1}{\sum_{i \mid B_i(j) = 1} (1 - W_i)} \sum_{i \mid B_i(j) = 1} Y_i (1 - W_i) \]
Estimate overall effect as \(\hat{\tau} = \sum_{j=1}^j \frac{\sum{i=1}^n B_i(j)}{n} \hat{\tau}_j\)
Matching
This is the simplest method to explain: for every treated unit in an analysis, match with one or more untreated units with the same covariates. The approach is simple, but not without issues when we have many covariates.
Mixtures of methods
The ultimate recommendation is to use a combination of approaches, like blocking + regression adjustment to get to efficient and less-biased estimators.