HW 1

1 Question 1

In this question, you’ll be replicating some of the results of the Kang and Schafer (2007) paper and expanding on them. In the paper, the authors test the robustness of IPW, AIPW, and regression estimators via. a simulation study. The authors simulate a survey where an outcome \(y_i\) is measured for \(n\) individuals, but a substantial number of observations are missing (\(\sim 50\%\)). The target estimand is the population mean of \(y_i\), which is simulated to be \(210\). The \(y_i\) are simulated via a linear model conditional on covariates \(Z_{i1}, Z_{i2}, Z_{i3}, Z_{i4}\); the missingness indicators are simulated via a logistic regression model, again using a linear model related to the \(Z\) predictors.

The catch is that it is assumed that the analyst has access only to complex transformations of the predictors, in the form of \(X_{i1}, X_{i2}, X_{i3}, X_{i4}\). These transformed predictors can be used in a linear model to produce fits that are “good enough” meaning that they predict the outcomes pretty well and don’t indicate poor fits to the data.

The operating characteristics, namely bias, variance and MSE are measured for several different estimators that involve linear models and logistic regressions for \(y_i \mid x_i\) and \(m_i \mid x_i\). Let \(e(x_i, \hat{\gamma})\) be the fitted probabilities of not being missing, or \(P(M_i = 0 \mid X_i = x_i)\) from the missingness model and let \(\mu(x_i, \hat{\beta})\) be the fitted values from a linear regression of \(y_i\) on \(x_i\).

An IPW estimator: \[ \bar{y}^{\text{IPW}} = \frac{\sum_{i \mid m_i = 0} y_i e(x_i, \hat{\gamma})^{-1}}{\sum_{i \mid m_i = 0} e(x_i, \hat{\gamma})^{-1}} \]
A regression estimator: \(\bar{y}^{\text{reg}} = \frac{1}{n} \sum_{i=1}^n \mu(x_i, \hat{\beta})\)
An AIPW estimator: \(\bar{y}^{\text{AIPW}} = \bar{y}^{\text{reg}} + \frac{\sum_{i \mid m_i = 0}\hat{\epsilon}_i e(x_i, \hat{\gamma})^{-1}}{\sum_{i \mid m_i = 0} e(x_i, \hat{\gamma})^{-1}}\)

The \(\hat{\epsilon}_i\) is the fitted residual from the regression on \(i\) such that \(m_i = 0\).

I want you to roughly recreate the results of part of the paper by running the smallest simulation study scenario. I’ve written the simulation function for you, you have to write the code to run the regressions and the code to compute the estimators.

gen_data <- function(n=200) {
  Z <- matrix(rnorm(n * 4), n, 4)
  y_hat <- 210 + 27.4 * Z[,1] + 13.7 * rowSums(Z[,2:4])
  y <- y_hat + rnorm(n)
  pi_y <- plogis(-Z[,1] + 0.5 * Z[,2] - 0.25 * Z[,3] - 0.1 * Z[,4])
  m <- rbinom(n, 1, 1 - pi_y)
  X <- Z
  X[,1] <- exp(Z[,1] / 2)
  X[,2] <- Z[,2] / (1 + exp(Z[,1])) + 10
  X[,3] <- (Z[,1] * Z[,3] / 25 + 0.6)^3
  X[,4] <- (Z[,2] + Z[,4] + 20)^2
  y_o <- y
  y_o[m == 1] <- NA
  df <- data.frame(
    y = y_o,
    m = m,
    x1 = X[,1],
    x2 = X[,2],
    x3 = X[,3],
    x4 = X[,4]
  )
  return(df)
}

1.1 Question 1a

Write functions that can take the data frame returned by gen_data and return a linear model for \(\Exp{Y_i \mid X_i}\) and a logistic regression for \(P(M_i = 0 \mid X_i)\).

1.2 Question 1b

Write functions that compute \(\bar{y}^{\text{IPW}},\bar{y}^{\text{AIPW}},\bar{y}^{\text{reg}}\). These functions should take a model fit and a datatset as arguments to return the estimators.

1.3 Question 1c

Generate 1000 simulated datasets and use the functions you wrote above to compute the three estimators for each dataset

1.4 Question 1d

Compute the bias, root mean square error, and % bias as a function of standard deviation of the estimator.

1.5 Question 1e

Pick your favorite machine learning method (random forests, support vector machines, gradient boosting machines, xgboost, etc.) and use your method to compute nonparametric versions of \(\mu(x_i, \beta)\) and \(e(x_i, \gamma)\). Recompute the estimators using your machine learning methods and compute the same summary statistics as you did in question 1d

1.6 Question 1f

Write several sentences summarizing your results

2 Question 2

Load the Surrogate package in R and load the dataset Schizo_PANSS:

library(Surrogate)
data("Schizo_PANSS")

The dataset combines five clinical trials aimed at determining if risperidone decreases the Positive and Negative Syndrome Score (PANSS) over time compared to a control treatment for patients with schizophrenia. These are longitudinal trials where patients are assessed at weeks 1, 2, 4, 6 and 8 after being assigned to a treatment arm.

Each row in the dataset is a different trial participant, and Week1, Week2, Week4, Week6, Week8 records the change in PANSS from baseline. The variable Treat represents whether the patient was enrolled in the control (-1) or if the patient was in the risperidone arm (1).

Subset the data to include only Week1, Week4 and Week8:

hw_data <- Schizo_PANSS[,c("Id","Treat","Week1","Week4","Week8")]

2.1 Question 2a

Summarize the missingness patterns in the hw_data dataset. How many missingness patterns are there? What are they? What proportion of patients are associated with each missingness pattern?

2.2 Question 2b

How would you assess whether there is evidence that treatment affects missingness? Is there evidence that treatment affects the missingness pattern?

2.3 Question 2c

How many people dropped out of the study after Week 1, or Week 4 vs. had intermittent missingness? For patients who dropped out, is there evidence that PANSS at the prior measurement predicted dropout? What about for the patients with intermittent missingness?

2.4 Question 2d

Is it reasonable to assume that missingness is MCAR for this dataset? Why or why not? What about MAR?

References

Kang, Joseph D. Y., and Joseph L. Schafer. 2007. “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data.” Statistical Science 22 (4). https://doi.org/10.1214/07-STS227.