“Causal Inference and Policy Evaluation Without a Control Group”

Augusto Cerqua, Marco Letta, Fiammetta Menchetti

Application: the impact of the COVID-19 crisis on education in Italy

This Notebook replicates the key results of the empirical application reported in Section 4 of the paper “Causal inference and policy evaluation without a control group” (Cerqua et al., 2023) using the data that can be downloaded here.

To enhance computational speed, the main dataset (“dataset_replication_full_FINAL.csv”) includes only the restricted set of relevant variables selected by the preliminary random forest (as detailed in Cerqua et al., 2023). Additionally, in the same folder, you’ll find the file “metadata_replication.xlsx”, which provides detailed information regarding the variables included in both datasets.

The notebook serves as a guide for researchers interested in replicating our analysis or adapting it to new datasets for their own applications. While the explanations in this notebook are concise, we recommend referring to the full paper for comprehensive details on the application.

Running the analyses in this notebook with the original number of bootstrap replications (where standard errors were computed based on 1,000 replications) will be time-consuming. Therefore, the code below sets the option “nboot” to 100 in all instances and, as a result, the standard errors are slightly different from those reported in the paper. If you wish to obtain the same standard errors as in the paper, change it to “nboot = 1000”.”

Data preparation

Download the data from here.

rm(list=ls())
# Load required libraries
library(CAST)
library(caret)
library(gbm)
library(elasticnet)
library(pls)
library(stats)
library(utils)
library(randomForest)
library(rpart)

We start by loading and preparing the data:

# set the path where the data are stored for replication; custom functions
wd <- "C:/Users/39380/Documents/GitHub/MLCM-Replication-Notebook/docs"
source(paste0(wd, "/Functions/package.R"))
setwd(wd)

# Download dataset from https://github.com/marclet/MLCM-Replication-Notebook/tree/main/data into your working directory and load the raw data 
# Full dataset
dataset_full <- read.csv(paste0(wd, "/Data/dataset_replication_full_FINAL.csv"), na.strings=c("",".","NA"))
x.cate <- read.csv(paste0(wd, "/Data/CATE_dataset.csv"))
# Pre-intervention dataset
dataset_pre_2020 <- subset(dataset_full, year<2020) 
# Intervention date
int_date <- 2020

# Comment: this is a 579 X 4 panel dataset (i.e., the number of units in the panel is 579
#          the number of times is 4). The dataset is organized in a a long manner
#          i.e., it has 579 X 4 = 2316 rows, one column for ID variable ('id'), 
#          one column for the time variable ('year') and one column for the dependent 
#          variable ('std_math_score').

Panel cross-validation

This step involves a horse-race competition among the four selected machine learning (ML) algorithms: LASSO, Partial Least Squares, boosting, and random forest. We fine-tune their hyperparameters using pre-treatment data. This chunk of code produces the results needed to replicate Panel A, Table I (evaluating performance in terms of Mean Squared Error (MSE), and Figure 3 (which illustrates the distribution of forecasting errors) in our paper. It is important to note that to obtain MSE results, manual computation is required based on the Root Mean Squared Error (RMSE) estimates reported below. The ‘caret’ package, which is embedded in MachineControl, does not directly provide MSE results.

# Set seed
set.seed(1)

num_cov <- 20 # we keep the 20 most predictive variables selected by the preliminary random forest

### 2.2. Setting the tuning parameters of the Machine Learning algorithms 
# SGB
gbmGrid <-  expand.grid(interaction.depth = c(1, 2), 
                        n.trees=c(1000, 2000), 
                        shrinkage = c(0.001, 0.002, 0.005),
                        n.minobsinnode = c(5, 10))
bo <- list(method = "gbm",
           tuneGrid = gbmGrid)
# RF
rf <- list(method = "rf",
           ntree = 1000)
# LASSO
lassoGrid <- expand.grid(fraction = seq(0.1, 0.9, by = 0.1))
lasso <- list(method = "lasso",
              tuneGrid = lassoGrid,
              preProc = c("center", "scale"))
# PLS
pls <- list(method = "pls")

### 2.3. Applying Panel Cross Validation to select the best importance threshold and the best
###      performing algorithm based on the RMSE criterion. Note that we include here the whole dataset
###      because the 'PanelCrossValidation' function subsets the data internally via the argument 
###      'int_date'. See lines 202 and beyond of 'package.R'.

PCV <- lapply(num_cov, FUN = function(x){
  
  total_columns <- ncol(dataset_full)
  xvar <- colnames(dataset_full)[(total_columns-19):total_columns] ;
  data <- as.PanelMLCM(y = dataset_full[, "std_math_score"], 
                       timevar = dataset_full[, "year"], 
                       id = dataset_full[, "id"],
                       x = dataset_full[, xvar], y.lag = 0) ;
  rfGrid <- expand.grid(mtry = round(c(ncol(data)/2, ncol(data)/3, ncol(data)/4))) ;
  plsGrid <- expand.grid(ncomp = seq(1, (ncol(data)-2), by = 1)) ;
  rf$tuneGrid <- rfGrid ;
  pls$tuneGrid <- plsGrid ;
  PCV <- PanelCrossValidation(data = data, int_date = int_date, ML_methods = list(bo, rf, lasso, pls)) ;
  return(PCV)
})
PCV

## [[1]]
## [[1]]$best
## Random Forest 
## 
## 2316 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (2 reps) 
## Summary of sample sizes: 579, 1158 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    6    0.6246794  0.3505821  0.4494645
##    8    0.6265634  0.3447118  0.4495417
##   12    0.6284865  0.3420967  0.4506613
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
## 
## [[1]]$best.metric
## [1] 0.6246794
## 
## [[1]]$all_methods
## [[1]]$all_methods$gbm
## Stochastic Gradient Boosting 
## 
## 2316 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (2 reps) 
## Summary of sample sizes: 579, 1158 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE       Rsquared 
##   0.001      1                   5              1000     0.7433648  0.2842232
##   0.001      1                   5              2000     0.7094008  0.3056849
##   0.001      1                  10              1000     0.7423302  0.2827714
##   0.001      1                  10              2000     0.7078340  0.3068460
##   0.001      2                   5              1000     0.7192627  0.3046232
##   0.001      2                   5              2000     0.6761488  0.3299479
##   0.001      2                  10              1000     0.7169737  0.3084312
##   0.001      2                  10              2000     0.6751299  0.3325055
##   0.002      1                   5              1000     0.7117750  0.3041972
##   0.002      1                   5              2000     0.6739687  0.3327266
##   0.002      1                  10              1000     0.7089281  0.3048002
##   0.002      1                  10              2000     0.6718938  0.3323428
##   0.002      2                   5              1000     0.6760787  0.3323514
##   0.002      2                   5              2000     0.6406422  0.3574199
##   0.002      2                  10              1000     0.6763571  0.3312228
##   0.002      2                  10              2000     0.6390098  0.3583212
##   0.005      1                   5              1000     0.6677132  0.3365278
##   0.005      1                   5              2000     0.6491554  0.3494013
##   0.005      1                  10              1000     0.6580158  0.3417034
##   0.005      1                  10              2000     0.6387706  0.3544284
##   0.005      2                   5              1000     0.6340917  0.3588228
##   0.005      2                   5              2000     0.6258329  0.3602375
##   0.005      2                  10              1000     0.6310945  0.3593874
##   0.005      2                  10              2000     0.6251532  0.3596731
##   MAE      
##   0.5588770
##   0.5276571
##   0.5576323
##   0.5263673
##   0.5369687
##   0.4962970
##   0.5349393
##   0.4951087
##   0.5299082
##   0.4923507
##   0.5271612
##   0.4900280
##   0.4963562
##   0.4635841
##   0.4965611
##   0.4623442
##   0.4870398
##   0.4706248
##   0.4772684
##   0.4624474
##   0.4581731
##   0.4523273
##   0.4561507
##   0.4526742
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 2000, interaction.depth =
##  2, shrinkage = 0.005 and n.minobsinnode = 10.
## 
## [[1]]$all_methods$rf
## Random Forest 
## 
## 2316 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (2 reps) 
## Summary of sample sizes: 579, 1158 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    6    0.6246794  0.3505821  0.4494645
##    8    0.6265634  0.3447118  0.4495417
##   12    0.6284865  0.3420967  0.4506613
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
## 
## [[1]]$all_methods$lasso
## The lasso 
## 
## 2316 samples
##   20 predictor
## 
## Pre-processing: centered (20), scaled (20) 
## Resampling: Bootstrapped (2 reps) 
## Summary of sample sizes: 579, 1158 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE       Rsquared   MAE      
##   0.1       0.7367230  0.2773082  0.5538621
##   0.2       0.6909796  0.3135364  0.5092384
##   0.3       0.6456675  0.3455181  0.4647378
##   0.4       0.6461808  0.3435742  0.4802461
##   0.5       0.6519846  0.3477757  0.4898135
##   0.6       0.6682814  0.3502246  0.5098234
##   0.7       0.6729723  0.3513252  0.5140342
##   0.8       0.6692229  0.3531519  0.5071238
##   0.9       0.6680430  0.3560108  0.5030275
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.3.
## 
## [[1]]$all_methods$pls
## Partial Least Squares 
## 
## 2316 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (2 reps) 
## Summary of sample sizes: 579, 1158 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.7112737  0.2924137  0.5291251
##    2     0.7062755  0.3161344  0.5241280
##    3     0.6884140  0.3366985  0.5137346
##    4     0.6667587  0.3443483  0.5034227
##    5     0.6743780  0.3314012  0.4996907
##    6     0.6660256  0.3222271  0.4778473
##    7     0.6655573  0.3191797  0.4728610
##    8     0.6533465  0.3369176  0.4688101
##    9     0.6358664  0.3488689  0.4664175
##   10     0.6434264  0.3496949  0.4674317
##   11     0.6585116  0.3492879  0.4908873
##   12     0.6971911  0.3382168  0.5356902
##   13     0.6987688  0.3501165  0.5342187
##   14     0.6962606  0.3622101  0.5312014
##   15     0.6951668  0.3589637  0.5277679
##   16     0.6973423  0.3549228  0.5335104
##   17     0.7149147  0.3444206  0.5560214
##   18     0.6909637  0.3454737  0.5280727
##   19     0.6971244  0.3500397  0.5366003
##   20     0.6853445  0.3518117  0.5229671
##   21     0.6853445  0.3518117  0.5229671
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.

metrics <- sapply(PCV, FUN = function(x)(x$best.metric))
ind <- which(metrics == min(metrics))
best <- PCV[[ind]]$best
best

## Random Forest 
## 
## 2316 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (2 reps) 
## Summary of sample sizes: 579, 1158 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    6    0.6246794  0.3505821  0.4494645
##    8    0.6265634  0.3447118  0.4495417
##   12    0.6284865  0.3420967  0.4506613
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

Estimation of causal effects

Now that we know the best-performing model, we can proceed with estimating individual, average, and conditional average treatment effects. We then export these results to produce Figure 4 (the map of individual causal effects) and Figure 5 (data-driven or CATEs) as reported in the paper.

### 3.1. Definition of the final dataset, based on the selected variable importance threshold
# Set seed

total_columns <- ncol(dataset_full)
xvar <- colnames(dataset_full)[(total_columns-19):total_columns]
data <- as.PanelMLCM(y = dataset_full[, "std_math_score"], 
                     timevar = dataset_full[, "year"], 
                     id = dataset_full[, "id"],
                     x = dataset_full[, xvar], y.lag = 0) 

### 3.2. Causal effect estimation
causal_effects <- MLCM(data = data, int_date = int_date, inf_type = "block", PCV = best,
                       nboot = 100, CATE = TRUE, x.cate = x.cate)

# ATE 2020
ate_eff <- c(causal_effects$ate, causal_effects$conf.ate)
names(ate_eff) <- c("ATE", "Lower_bound", "Upper_bound")
ate_eff

##         ATE Lower_bound Upper_bound 
##  -0.7607986  -0.7947666  -0.6824086

# Individual effects
ind_eff <- data.frame(causal_effects$ind.effects)
names(ind_eff) <- c("ID", "IND_EFF")
# CATEs
cate_eff <- causal_effects$cate.inf
cate_eff

## $`2020`
##                        Node_9       Node_8       Node_5      Node_4      Node_6
## cate             -0.556417759 -0.756163272 -0.860365286 -1.22225606 -0.65858816
## var.cate          0.001602889  0.003484612  0.008754586  0.01885738  0.01085663
## cate.lower.2.5%  -0.624629899 -0.860572949 -1.042186062 -1.50574472 -0.82633711
## cate.upper.97.5% -0.483742368 -0.656333238 -0.669399354 -0.96212361 -0.47471424

causal_effects$cate

## [[1]]
## n= 579 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 579 390.52900 -0.7607986  
##   2) unempl_rate>=0.1019 270 276.85430 -0.8911293  
##     4) share_unigraduates_3034< 25.61417 191 220.86200 -0.9873113  
##       8) gini_index>=0.4248386 67  98.90982 -1.2222560 *
##       9) gini_index< 0.4248386 124 116.25550 -0.8603653 *
##     5) share_unigraduates_3034>=25.61417 79  49.95339 -0.6585882 *
##   3) unempl_rate< 0.1019 309 105.08110 -0.6469173  
##     6) share_unigraduates_3034< 22.81643 140  55.63804 -0.7561633 *
##     7) share_unigraduates_3034>=22.81643 169  46.38806 -0.5564178 *

# now export the results
write.csv(ind_eff, file = paste0(wd, "/Output/individual_effects.csv"))
write.csv(ate_eff, file = paste0(wd, "/Output/ate.csv"))
write.csv(cate_eff, file = paste0(wd, "/Output/cate.csv"))

Placebo test

Finally, we repeat the analysis on the pre-pandemic years (2018 and 2019) to compute placebo treatment effects. We then export these effects to generate Panel B of Table III, Figure 2 in the main text, and Figure 1 reported in Online Appendix.

# Set seed

# 2019 #
### 3.1. Definition of the final dataset, based on the selected variable importance threshold
dataset_full_2019 <- subset(dataset_full, year<2020) 
int_date_2019 <- 2019
data_placebo_2019 <- as.PanelMLCM(y = dataset_full_2019[, "std_math_score"], 
                                  timevar = dataset_full_2019[, "year"], 
                                  id = dataset_full_2019[, "id"],
                                  x = dataset_full_2019[, xvar], y.lag = 0) 

effects_placebo_2019 <- MLCM(data = data_placebo_2019, int_date = int_date_2019, inf_type = "block", PCV = best,
                             nboot = 100, CATE = FALSE)

# ATE (Placebo 2019)
ate_eff_placebo_2019 <- c(effects_placebo_2019$ate, effects_placebo_2019$conf.ate)
names(ate_eff_placebo_2019) <- c("ATE", "Lower_bound", "Upper_bound")
ate_eff_placebo_2019

##         ATE Lower_bound Upper_bound 
##  0.02538013 -0.02566334  0.07836224

# Individual effects (Placebo 2019)
ind_eff_placebo_2019 <- data.frame(effects_placebo_2019$ind.effects)
names(ind_eff_placebo_2019) <- c("ID", "IND_EFF")

write.csv(ind_eff_placebo_2019, file = paste0(wd, "/Output/individual_effects_placebo_2019.csv"))
write.csv(ate_eff_placebo_2019, file = paste0(wd, "/Output/ate_placebo_2019.csv"))


# 2018 #
### 3.1. Definition of the final dataset, based on the selected variable importance threshold
dataset_full_2018 <- subset(dataset_full, year<2019) 
int_date_2018 <- 2018
data_placebo_2018 <- as.PanelMLCM(y = dataset_full_2018[, "std_math_score"], 
                                  timevar = dataset_full_2018[, "year"], 
                                  id = dataset_full_2018[, "id"],
                                  x = dataset_full_2018[, xvar], y.lag = 0) 

effects_placebo_2018 <- MLCM(data = data_placebo_2018, int_date = int_date_2018, inf_type = "block", PCV = best,
                             nboot = 100, CATE = FALSE)

# ATE (Placebo 2018)
ate_eff_placebo_2018 <- c(effects_placebo_2018$ate, effects_placebo_2018$conf.ate)
names(ate_eff_placebo_2018) <- c("ATE", "Lower_bound", "Upper_bound")
ate_eff_placebo_2018

##         ATE Lower_bound Upper_bound 
##  0.05538092 -0.10438444  0.29053423

# Individual effects (Placebo 2018)
ind_eff_placebo_2018 <- data.frame(effects_placebo_2018$ind.effects)
names(ind_eff_placebo_2018) <- c("ID", "IND_EFF")

# export the results
write.csv(ind_eff_placebo_2018, file = paste0(wd, "/Output/individual_effects_placebo_2018.csv"))
write.csv(ate_eff_placebo_2018, file = paste0(wd, "/Output/ate_placebo_2018.csv"))

Reference

Cerqua, A., & Letta, M., & Menchetti, F. 2023. The Machine Learning Control Method for Counterfactual Forecasting. Available as arXiv preprint at the following link: https://arxiv.org/abs/2312.05858

MLCM Replication Notebook