`WeightIt`

contains several functions for estimating and
assessing balancing weights for observational studies. These weights can
be used to estimate the causal parameters of marginal structural models.
I will not go into the basics of causal inference methods here. For good
introductory articles, see Austin (2011), Austin and
Stuart (2015), Robins,
Hernán, and Brumback (2000), or Thoemmes and Ong (2016).

Typically, the analysis of an observation study might proceed as follows: identify the covariates for which balance is required; assess the quality of the data available, including missingness and measurement error; estimate weights that balance the covariates adequately; and estimate a treatment effect and corresponding standard error or confidence interval. This guide will go through all these steps for two observational studies: estimating the causal effect of a point treatment on an outcome, and estimating the causal parameters of a marginal structural model with multiple treatment periods. This is not meant to be a definitive guide, but rather an introduction to the relevant issues.

First we will use the Lalonde dataset to estimate the effect of a
point treatment. We’ll use the version of the data set that resides
within the `cobalt`

package, which we will use later on as
well. Here, we are interested in the average treatment effect on the
treated (ATT).

`library("cobalt")`

`## cobalt (Version 4.5.1, Build Date: 2023-04-27)`

```
data("lalonde", package = "cobalt")
head(lalonde)
```

treat | age | educ | race | married | nodegree | re74 | re75 | re78 |
---|---|---|---|---|---|---|---|---|

1 | 37 | 11 | black | 1 | 1 | 0 | 0 | 9930.0 |

1 | 22 | 9 | hispan | 0 | 1 | 0 | 0 | 3595.9 |

1 | 30 | 12 | black | 0 | 0 | 0 | 0 | 24909.5 |

1 | 27 | 11 | black | 0 | 1 | 0 | 0 | 7506.1 |

1 | 33 | 8 | black | 0 | 1 | 0 | 0 | 289.8 |

1 | 22 | 9 | black | 0 | 1 | 0 | 0 | 4056.5 |

We have our outcome (`re78`

), our treatment
(`treat`

), and the covariates for which balance is desired
(`age`

, `educ`

, `race`

,
`married`

, `nodegree`

, `re74`

, and
`re75`

). Using `cobalt`

, we can examine the
initial imbalance on the covariates:

```
bal.tab(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", thresholds = c(m = .05))
```

```
## Balance Measures
## Type Diff.Un M.Threshold.Un
## age Contin. -0.309 Not Balanced, >0.05
## educ Contin. 0.055 Not Balanced, >0.05
## race_black Binary 0.640 Not Balanced, >0.05
## race_hispan Binary -0.083 Not Balanced, >0.05
## race_white Binary -0.558 Not Balanced, >0.05
## married Binary -0.324 Not Balanced, >0.05
## nodegree Binary 0.111 Not Balanced, >0.05
## re74 Contin. -0.721 Not Balanced, >0.05
## re75 Contin. -0.290 Not Balanced, >0.05
##
## Balance tally for mean differences
## count
## Balanced, <0.05 0
## Not Balanced, >0.05 9
##
## Variable with the greatest mean difference
## Variable Diff.Un M.Threshold.Un
## re74 -0.721 Not Balanced, >0.05
##
## Sample sizes
## Control Treated
## All 429 185
```

Based on this output, we can see that all variables are imbalanced in
the sense that the standardized mean differences (for continuous
variables) and differences in proportion (for binary variables) are
greater than .05 for all variables. In particular, `re74`

and
`re75`

are quite imbalanced, which is troubling given that
they are likely strong predictors of the outcome. We will estimate
weights using `weightit()`

to try to attain balance on these
covariates.

First, we’ll start simple, and use inverse probability weights from
propensity scores generated through logistic regression. We need to
supply `weightit()`

with the formula for the model, the data
set, the estimand (ATT), and the method of estimation
(`"glm"`

) for generalized linear model propensity score
weights).

```
library("WeightIt")
<- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
W.out data = lalonde, estimand = "ATT", method = "glm")
#print the output W.out
```

```
## A weightit object
## - method: "glm" (propensity score weighting with GLM)
## - number of obs.: 614
## - sampling weights: none
## - treatment: 2-category
## - estimand: ATT (focal: 1)
## - covariates: age, educ, race, married, nodegree, re74, re75
```

Printing the output of `weightit()`

displays a summary of
how the weights were estimated. Let’s examine the quality of the weights
using `summary()`

. Weights with low variability are desirable
because they improve the precision of the estimator. This variability is
presented in several ways: by the ratio of the largest weight to the
smallest in each group, the coefficient of variation (standard deviation
divided by the mean) of the weights in each group, and the effective
sample size computed from the weights. We want a small ratio, a smaller
coefficient of variation, and a large effective sample size (ESS). What
constitutes these values is mostly relative, though, and must be
balanced with other constraints, including covariate balance. These
metrics are best used when comparing weighting methods, but the ESS can
give a sense of how much information remains in the weighted sample on a
familiar scale.

`summary(W.out)`

```
## Summary of weights
##
## - Weight ranges:
##
## Min Max
## treated 1.0000 || 1.000
## control 0.0092 |---------------------------| 3.743
##
## - Units with the 5 most extreme weights by group:
##
## 5 4 3 2 1
## treated 1 1 1 1 1
## 597 573 381 411 303
## control 3.0301 3.0592 3.2397 3.5231 3.7432
##
## - Weight statistics:
##
## Coef of Var MAD Entropy # Zeros
## treated 0.000 0.000 -0.000 0
## control 1.818 1.289 1.098 0
##
## - Effective Sample Sizes:
##
## Control Treated
## Unweighted 429. 185
## Weighted 99.82 185
```

These weights have quite high variability, and yield an ESS of close to 100 in the control group. Let’s see if these weights managed to yield balance on our covariates.

`bal.tab(W.out, stats = c("m", "v"), thresholds = c(m = .05))`

```
## Balance Measures
## Type Diff.Adj M.Threshold V.Ratio.Adj
## prop.score Distance -0.021 Balanced, <0.05 1.032
## age Contin. 0.119 Not Balanced, >0.05 0.458
## educ Contin. -0.028 Balanced, <0.05 0.664
## race_black Binary -0.002 Balanced, <0.05 .
## race_hispan Binary 0.000 Balanced, <0.05 .
## race_white Binary 0.002 Balanced, <0.05 .
## married Binary 0.019 Balanced, <0.05 .
## nodegree Binary 0.018 Balanced, <0.05 .
## re74 Contin. -0.002 Balanced, <0.05 1.321
## re75 Contin. 0.011 Balanced, <0.05 1.394
##
## Balance tally for mean differences
## count
## Balanced, <0.05 9
## Not Balanced, >0.05 1
##
## Variable with the greatest mean difference
## Variable Diff.Adj M.Threshold
## age 0.119 Not Balanced, >0.05
##
## Effective sample sizes
## Control Treated
## Unadjusted 429. 185
## Adjusted 99.82 185
```

For nearly all the covariates, these weights yielded very good
balance. Only `age`

remained imbalanced, with a standardized
mean difference greater than .05 and a variance ratio greater than 2.
Let’s see if we can do better. We’ll choose a different method: entropy
balancing (Hainmueller 2012), which guarantees
perfect balance on specified moments of the covariates while minimizing
the entropy (a measure of dispersion) of the weights.

```
<- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
W.out data = lalonde, estimand = "ATT", method = "ebal")
summary(W.out)
```

```
## Summary of weights
##
## - Weight ranges:
##
## Min Max
## treated 1.0000 || 1.000
## control 0.0188 |---------------------------| 9.419
##
## - Units with the 5 most extreme weights by group:
##
## 5 4 3 2 1
## treated 1 1 1 1 1
## 608 381 597 303 411
## control 7.1268 7.5013 7.9979 9.0355 9.4195
##
## - Weight statistics:
##
## Coef of Var MAD Entropy # Zeros
## treated 0.000 0.000 0.000 0
## control 1.834 1.287 1.101 0
##
## - Effective Sample Sizes:
##
## Control Treated
## Unweighted 429. 185
## Weighted 98.46 185
```

The variability of the weights has not changed much, but let’s see if there are any gains in terms of balance:

`bal.tab(W.out, stats = c("m", "v"), thresholds = c(m = .05))`

```
## Balance Measures
## Type Diff.Adj M.Threshold V.Ratio.Adj
## age Contin. 0 Balanced, <0.05 0.410
## educ Contin. 0 Balanced, <0.05 0.664
## race_black Binary 0 Balanced, <0.05 .
## race_hispan Binary -0 Balanced, <0.05 .
## race_white Binary -0 Balanced, <0.05 .
## married Binary -0 Balanced, <0.05 .
## nodegree Binary -0 Balanced, <0.05 .
## re74 Contin. -0 Balanced, <0.05 1.326
## re75 Contin. -0 Balanced, <0.05 1.335
##
## Balance tally for mean differences
## count
## Balanced, <0.05 9
## Not Balanced, >0.05 0
##
## Variable with the greatest mean difference
## Variable Diff.Adj M.Threshold
## married -0 Balanced, <0.05
##
## Effective sample sizes
## Control Treated
## Unadjusted 429. 185
## Adjusted 98.46 185
```

Indeed, we have achieved perfect balance on the means of the
covariates. However, the variance ratio of `age`

is still
quite high. We could continue to try to adjust for this imbalance, but
if there is reason to believe it is unlikely to affect the outcome, it
may be best to leave it as is. (You can try adding `I(age^2)`

to the formula and see what changes this causes.)

Now that we have our weights stored in `W.out`

, let’s
extract them and estimate our treatment effect.

```
# Attach weights to dataset
$weights <- W.out$weights
lalonde
# Fit outcome model
<- lm(re78 ~ treat * (age + educ + race + married +
fit + re74 + re75),
nodegree data = lalonde, weights = weights)
# G-computation for the treatment effect
library("marginaleffects")
avg_comparisons(fit, variables = "treat",
vcov = "HC3",
newdata = subset(lalonde, treat == 1),
wts = "weights")
```

term | contrast | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|

treat | 1 - 0 | 1273 | 822.8 | 1.547 | 0.1218 | -339.4 | 2886 |

Our confidence interval for `treat`

contains 0, so there
isn’t evidence that `treat`

has an effect on
`re78`

. Although some authors recommend using “robust”
sandwich standard errors to adjust for the weights (Robins, Hernán, and Brumback 2000; Hainmueller 2012) as we did above,
others believe these can misleading and recommend bootstrapping instead
(Reifeis and Hudgens 2020; Chan, Yam, and Zhang 2016). Both methods
are described in detail at
`vignette("estimating-effects")`

.

`WeightIt`

can estimate weights for longitudinal treatment
marginal structural models as well. This time, we’ll use the sample data
set `msmdata`

to estimate our weights. Data must be in “wide”
format, with one row per unit.

```
data("msmdata")
head(msmdata)
```

X1_0 | X2_0 | A_1 | X1_1 | X2_1 | A_2 | X1_2 | X2_2 | A_3 | Y_B |
---|---|---|---|---|---|---|---|---|---|

2 | 0 | 1 | 5 | 1 | 0 | 4 | 1 | 0 | 0 |

4 | 0 | 1 | 9 | 0 | 1 | 10 | 0 | 1 | 1 |

4 | 1 | 0 | 5 | 0 | 1 | 4 | 0 | 0 | 1 |

4 | 1 | 0 | 4 | 0 | 0 | 6 | 1 | 0 | 1 |

6 | 1 | 1 | 5 | 0 | 1 | 6 | 0 | 0 | 1 |

5 | 1 | 0 | 4 | 0 | 1 | 4 | 0 | 1 | 0 |

We have a binary outcome variable (`Y_B`

), pre-treatment
time-varying variables (`X1_0`

and `X2_0`

,
measured before the first treatment, `X1_1`

and
`X2_1`

measured between the first and second treatments, and
`X1_2`

and `X2_2`

measured between the second and
third treatments), and three time-varying binary treatment variables
(`A_1`

, `A_2`

, and `A_3`

). We are
interested in the joint, unique, causal effects of each treatment period
on the outcome. At each treatment time point, we need to achieve balance
on all variables measured prior to that treatment, including previous
treatments.

Using `cobalt`

, we can examine the initial imbalance at
each time point and overall:

```
library("cobalt") #if not already attached
bal.tab(list(A_1 ~ X1_0 + X2_0,
~ X1_1 + X2_1 +
A_2 + X1_0 + X2_0,
A_1 ~ X1_2 + X2_2 +
A_3 + X1_1 + X2_1 +
A_2 + X1_0 + X2_0),
A_1 data = msmdata, stats = c("m", "ks"),
which.time = .all)
```

```
## Balance by Time Point
##
## - - - Time: 1 - - -
## Balance Measures
## Type Diff.Un KS.Un
## X1_0 Contin. 0.690 0.276
## X2_0 Binary -0.325 0.325
##
## Sample sizes
## Control Treated
## All 3306 4194
##
## - - - Time: 2 - - -
## Balance Measures
## Type Diff.Un KS.Un
## X1_1 Contin. 0.874 0.340
## X2_1 Binary -0.299 0.299
## A_1 Binary 0.127 0.127
## X1_0 Contin. 0.528 0.201
## X2_0 Binary -0.060 0.060
##
## Sample sizes
## Control Treated
## All 3701 3799
##
## - - - Time: 3 - - -
## Balance Measures
## Type Diff.Un KS.Un
## X1_2 Contin. 0.475 0.212
## X2_2 Binary -0.594 0.594
## A_2 Binary 0.162 0.162
## X1_1 Contin. 0.573 0.237
## X2_1 Binary -0.040 0.040
## A_1 Binary 0.100 0.100
## X1_0 Contin. 0.361 0.148
## X2_0 Binary -0.040 0.040
##
## Sample sizes
## Control Treated
## All 4886 2614
## - - - - - - - - - - -
```

`bal.tab()`

indicates significant imbalance on most
covariates at most time points, so we need to do some work to eliminate
that imbalance in our weighted data set. We’ll use the
`weightitMSM()`

function to specify our weight models. The
syntax is similar both to that of `weightit()`

for point
treatments and to that of `bal.tab()`

for longitudinal
treatments. We’ll use `method = "glm"`

and
`stabilize = TRUE`

for stabilized propensity score weights
estimated using logistic regression.

```
<- weightitMSM(list(A_1 ~ X1_0 + X2_0,
Wmsm.out ~ X1_1 + X2_1 +
A_2 + X1_0 + X2_0,
A_1 ~ X1_2 + X2_2 +
A_3 + X1_1 + X2_1 +
A_2 + X1_0 + X2_0),
A_1 data = msmdata, method = "glm",
stabilize = TRUE)
Wmsm.out
```

```
## A weightitMSM object
## - method: "glm" (propensity score weighting with GLM)
## - number of obs.: 7500
## - sampling weights: none
## - number of time points: 3 (A_1, A_2, A_3)
## - treatment:
## + time 1: 2-category
## + time 2: 2-category
## + time 3: 2-category
## - covariates:
## + baseline: X1_0, X2_0
## + after time 1: X1_1, X2_1, A_1, X1_0, X2_0
## + after time 2: X1_2, X2_2, A_2, X1_1, X2_1, A_1, X1_0, X2_0
## - stabilized; stabilization factors:
## + baseline: (none)
## + after time 1: A_1
## + after time 2: A_1, A_2, A_1:A_2
```

No matter which method is selected, `weightitMSM()`

estimates separate weights for each time period and then takes the
product of the weights for each individual to arrive at the final
estimated weights. Printing the output of `weightitMSM()`

provides some details about the function call and the output. We can
take a look at the quality of the weights with `summary()`

,
just as we could for point treatments.

`summary(Wmsm.out)`

```
## Summary of weights
##
## Time 1
## Summary of weights
##
## - Weight ranges:
##
## Min Max
## treated 0.1527 |---------------------------| 57.08
## control 0.1089 |--------| 20.46
##
## - Units with the 5 most extreme weights by group:
##
## 4390 3440 3774 3593 5685
## treated 22.1008 24.1278 25.6999 27.786 57.0794
## 6659 6284 1875 6163 2533
## control 12.8943 13.09 14.5234 14.705 20.465
##
## - Weight statistics:
##
## Coef of Var MAD Entropy # Zeros
## treated 1.779 0.775 0.573 0
## control 1.331 0.752 0.486 0
##
## - Mean of Weights = 0.99
##
## - Effective Sample Sizes:
##
## Control Treated
## Unweighted 3306 4194
## Weighted 1193 1007
##
## Time 2
## Summary of weights
##
## - Weight ranges:
##
## Min Max
## treated 0.1089 |---------------------------| 57.08
## control 0.1501 |--------| 20.49
##
## - Units with the 5 most extreme weights by group:
##
## 4390 3440 3774 3593 5685
## treated 22.1008 24.1278 25.6999 27.786 57.0794
## 1875 6163 6862 1286 6158
## control 14.5234 14.705 14.8079 16.2311 20.4862
##
## - Weight statistics:
##
## Coef of Var MAD Entropy # Zeros
## treated 1.797 0.779 0.580 0
## control 1.359 0.750 0.488 0
##
## - Mean of Weights = 0.99
##
## - Effective Sample Sizes:
##
## Control Treated
## Unweighted 3701 3799.
## Weighted 1300 898.2
##
## Time 3
## Summary of weights
##
## - Weight ranges:
##
## Min Max
## treated 0.1089 |---------------------------| 57.08
## control 0.2085 |-----------| 25.70
##
## - Units with the 5 most extreme weights by group:
##
## 3576 4390 3440 3593 5685
## treated 20.5828 22.1008 24.1278 27.786 57.0794
## 6163 6862 168 6158 3774
## control 14.705 14.8079 16.9698 20.4862 25.6999
##
## - Weight statistics:
##
## Coef of Var MAD Entropy # Zeros
## treated 2.008 0.931 0.753 0
## control 1.269 0.672 0.407 0
##
## - Mean of Weights = 0.99
##
## - Effective Sample Sizes:
##
## Control Treated
## Unweighted 4886 2614.
## Weighted 1871 519.8
```

Displayed are summaries of how the weights perform at each time point with respect to variability. Next, we’ll examine how well they perform with respect to covariate balance.

`bal.tab(Wmsm.out, stats = c("m", "ks"), which.time = .none)`

```
## Balance summary across all time points
## Times Type Max.Diff.Adj Max.KS.Adj
## prop.score 1, 2, 3 Distance 0.095 0.074
## X1_0 1, 2, 3 Contin. 0.033 0.018
## X2_0 1, 2, 3 Binary 0.018 0.018
## X1_1 2, 3 Contin. 0.087 0.039
## X2_1 2, 3 Binary 0.031 0.031
## A_1 2, 3 Binary 0.130 0.130
## X1_2 3 Contin. 0.104 0.054
## X2_2 3 Binary 0.007 0.007
## A_2 3 Binary 0.154 0.154
##
## Effective sample sizes
## - Time 1
## Control Treated
## Unadjusted 3306 4194
## Adjusted 1193 1007
## - Time 2
## Control Treated
## Unadjusted 3701 3799.
## Adjusted 1300 898.2
## - Time 3
## Control Treated
## Unadjusted 4886 2614.
## Adjusted 1871 519.8
```

By setting `which.time = .none`

in `bal.tab()`

,
we can focus on the overall balance assessment, which displays the
greatest imbalance for each covariate across time points. We can see
that our estimated weights balance all covariates all time points with
respect to means and KS statistics. Now we can estimate our treatment
effects.

First, we fit a marginal structural model for the outcome using
`glm()`

with the weights included:

```
# Attach weights to dataset
$weights <- Wmsm.out$weights
msmdata
# Fit outcome model
<- glm(Y_B ~ A_1 * A_2 * A_3 * (X1_0 + X2_0),
fit data = msmdata, weights = weights,
family = quasibinomial)
```

Then, we compute the average expected potential outcomes under each
treatment regime using
`marginaleffects::avg_predictions()`

:

```
library("marginaleffects")
<- avg_predictions(fit,
(p vcov = "HC3",
newdata = datagridcf(A_1 = 0:1, A_2 = 0:1, A_3 = 0:1),
by = c("A_1", "A_2", "A_3"),
wts = "weights",
type = "response"))
```

A_1 | A_2 | A_3 | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0.6847 | 0.0169 | 40.57 | 0 | 0.6516 | 0.7178 |

0 | 0 | 1 | 0.5177 | 0.0398 | 13.00 | 0 | 0.4396 | 0.5957 |

0 | 1 | 0 | 0.4884 | 0.0217 | 22.51 | 0 | 0.4459 | 0.5309 |

0 | 1 | 1 | 0.4340 | 0.0309 | 14.06 | 0 | 0.3735 | 0.4945 |

1 | 0 | 0 | 0.6005 | 0.0218 | 27.61 | 0 | 0.5579 | 0.6431 |

1 | 0 | 1 | 0.5407 | 0.0334 | 16.21 | 0 | 0.4753 | 0.6061 |

1 | 1 | 0 | 0.3754 | 0.0170 | 22.02 | 0 | 0.3419 | 0.4088 |

1 | 1 | 1 | 0.4189 | 0.0285 | 14.70 | 0 | 0.3630 | 0.4747 |

We can compare the expected potential outcomes under each regime
using `marginaleffects::hypotheses()`

. To get all pairwise
comparisons, supply the `avg_predictions()`

output to
`hypotheses(., "pairwise")`

. To compare individual regimes,
we can use `hypotheses()`

, identifying the rows of the
`avg_predictions()`

output. For example, to compare the
regimes with no treatment for all three time points vs. the regime with
treatment for all three time points, we would run

`hypotheses(p, "b8 - b1 = 0")`

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

b8-b1=0 | -0.2658 | 0.0182 | -14.63 | 0 | -0.3014 | -0.2302 |

These results indicate that receive treatment at all time points reduces the risk of the outcome relative to not receiving treatment at all.

Austin, Peter C. 2011. “An Introduction to Propensity Score
Methods for Reducing the Effects of Confounding in Observational
Studies.” *Multivariate Behavioral Research* 46 (3):
399–424. https://doi.org/10.1080/00273171.2011.568786.

Austin, Peter C., and Elizabeth A. Stuart. 2015. “Moving Towards
Best Practice When Using Inverse Probability of Treatment Weighting
(IPTW) Using the Propensity Score to Estimate Causal
Treatment Effects in Observational Studies.” *Statistics in
Medicine* 34 (28): 3661–79. https://doi.org/10.1002/sim.6607.

Chan, Kwun Chuen Gary, Sheung Chi Phillip Yam, and Zheng Zhang. 2016.
“Globally Efficient Non-Parametric Inference of Average Treatment
Effects by Empirical Balancing Calibration Weighting.”
*Journal of the Royal Statistical Society: Series B (Statistical
Methodology)* 78 (3): 673–700. https://doi.org/10.1111/rssb.12129.

Hainmueller, J. 2012. “Entropy Balancing for Causal Effects:
A Multivariate Reweighting Method to Produce Balanced
Samples in Observational Studies.” *Political Analysis* 20
(1): 25–46. https://doi.org/10.1093/pan/mpr025.

Reifeis, Sarah A., and Michael G. Hudgens. 2020. “On Variance of
the Treatment Effect in the Treated Using Inverse Probability
Weighting.” *arXiv:2011.11874 [Stat]*, November. https://arxiv.org/abs/2011.11874.

Robins, James M., Miguel Ángel Hernán, and Babette Brumback. 2000.
“Marginal Structural Models and Causal Inference in
Epidemiology.” *Epidemiology* 11 (5): 550–60. https://doi.org/10.1097/00001648-200009000-00011.

Thoemmes, Felix J., and Anthony D. Ong. 2016. “A Primer on Inverse
Probability of Treatment Weighting and Marginal Structural
Models.” *Emerging Adulthood* 4 (1): 40–59. https://doi.org/10.1177/2167696815621645.