Unveiling Double/Debiased Machine Learning (DML): A Practical Guide
Understanding the true effect of a variable (like a new medication or policy) on an outcome (such as health improvement or economic growth) can be challenging. Confounding variables—factors that affect both the treatment and the outcome—often complicate this task. Double/Debiased Machine Learning (DML) provides a powerful method to uncover these causal relationships, even in complex, high-dimensional settings. Let’s dive into the intuition and practical steps behind DML.
At its heart, DML aims to isolate the causal effect of a treatment by effectively controlling for confounding variables using machine learning techniques. Here’s a practical, step-by-step explanation of how DML works:
Step 1: Formulate the Problem
Suppose we want to understand the effect of a new educational program (the treatment) on students’ test scores (the outcome). We have additional data on various student characteristics (confounding variables), such as socioeconomic status, prior academic performance, and attendance.
Step 2: Predicting the Outcome and Treatment
We start by using machine learning models to predict both the outcome (test scores) and the treatment (participation in the educational program) based on the confounding variables. This helps in capturing the part of the outcome and treatment that can be explained by these confounders.
- Why Predict the Outcome? By predicting test scores using student characteristics, we understand how these characteristics influence the scores.
- Why Predict the Treatment? Similarly, predicting program participation based on student characteristics helps us see what factors influence the likelihood of joining the program.
Step 3: Calculating Residuals
Next, we calculate the residuals for both the outcome and the treatment. Residuals are the differences between the actual values and the predicted values. Why Calculate Residuals? The residuals represent the part of the test scores and program participation that cannot be explained by the student characteristics. Essentially, they strip away the influence of confounders, isolating the unique effect of the program.
Step 4: Regression on Residuals
We then perform a regression of the outcome residuals on the treatment residuals. This step provides an estimate of the treatment effect that is free from confounding bias. Why Regression on Residuals? This regression isolates the causal effect of the educational program on test scores, as the confounding effects have been removed from the residuals.
Putting It All Together: A Practical Example
Imagine we have a dataset of students, including their test scores (Y), whether they participated in the educational program (D), and various characteristics (X) such as socioeconomic status and prior grades.
- Predicting Outcomes and Treatments:
- Use machine learning models to predict test scores (Y) based on student characteristics (X).
- Similarly, predict program participation (D) using the same characteristics (X).
- Calculating Residuals:
- For each student, subtract the predicted test score from the actual test score to get the outcome residual.
- Subtract the predicted probability of program participation from the actual participation to get the treatment residual.
- Regression on Residuals: Perform a simple regression analysis where the outcome residuals are regressed on the treatment residuals.
This final regression gives us the causal effect of the educational program on test scores, free from the bias introduced by confounding variables.
DML leverages the power of machine learning to control for high-dimensional confounders effectively. By predicting both the treatment and the outcome and then focusing on the unexplained parts (residuals), it isolates the causal relationship. Cross-fitting—splitting the data into multiple parts to train and predict—ensures that the models do not overfit and that the estimates are reliable.
Step-by-Step Guide
Step 1: Import Required Libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import statsmodels.api as sm
import statsmodels.formula.api as smf
Step 3: Define DML Function
def dml(X, D, y, modely, modeld, *, nfolds, classifier=False):
'''
DML for the Partially Linear Model setting with cross-fitting
Input
-----
X: the controls
D: the treatment
y: the outcome
modely: the ML model for predicting the outcome y
modeld: the ML model for predicting the treatment D
nfolds: the number of folds in cross-fitting
classifier: bool, whether the modeld is a classifier or a regressor
time: array of time indices, eg [0,1,...,T-1,0,1,...,T-1,...,0,1,...,T-1]
clu: array of cluster indices, eg [1073, 1073, 1073, ..., 5055, 5055, 5055, 5055]
cluster: bool, whether to use clustered standard errors
Output
------
point: the point estimate of the treatment effect of D on y
stderr: the standard error of the treatment effect
yhat: the cross-fitted predictions for the outcome y
Dhat: the cross-fitted predictions for the treatment D
resy: the outcome residuals
resD: the treatment residuals
epsilon: the final residual-on-residual OLS regression residual
'''
cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds
yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y
# out-of-fold predictions for D
# use predict or predict_proba dependent on classifier or regressor for D
if classifier:
Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
else:
Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)
# calculate outcome and treatment residuals
resy = y - yhat
resD = D - Dhat
dml_data = pd.concat([pd.Series(resy, name = 'resy'), pd.Series(resD, name = 'resD')], axis=1)
ols_mod = smf.ols(formula = 'resy ~ 1 + resD', data = dml_data).fit()
point = ols_mod.params[1]
stderr = ols_mod.bse[1]
epsilon = ols_mod.resid
return point, stderr, yhat, Dhat, resy, resD, epsilon
Step 2: Define the Summary Function
def summary(point, stderr, yhat, Dhat, resy, resD, epsilon, X, D, y, *, name):
'''
Convenience summary function that takes the results of the DML function
and summarizes several estimation quantities and performance metrics.
'''
return pd.DataFrame({'estimate': point, # point estimate
'stderr': stderr, # standard error
'lower': point - 1.96*stderr, # lower end of 95% confidence interval
'upper': point + 1.96*stderr, # upper end of 95% confidence interval
'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y
'rmse D': np.sqrt(np.mean(resD**2)) # RMSE of model that predicts treatment D
}, index=[name])
Step 3: Apply DML to the Data
# OLS No Controls
Y = df['outcome']
D = df['treatment']
Z = df.drop(columns=['outcome', 'treatment'])
modely = make_pipeline(StandardScaler(), LinearRegression())
modeld = make_pipeline(StandardScaler(), LinearRegression())
# Run DML model with nfolds folds of cross-fitting
result_OLS = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)
table = summary(*result_OLS, Z, D, y, name = 'OLS No Controls')
table
# DML with Random Forests. RFs don't require scaling but we do it for consistency
modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))
modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))
# Run DML model with nfolds folds of cross-fitting (computationally intensive)
result_RF = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)
table = summary(*result_RF, Z,D, y, name = 'RF')
table
# Define ElasticNetCV models with n_splits folds of cross-validation
modely = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio = 0.5, cv=cv))
modeld = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio = 0.5, cv=cv))
# Run DML model with nfolds folds of cross-fitting
result_ENetCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False)
table = summary(*result_ENetCV, Z,D, y, name = 'ENetCV')
table
Step 4: Averaging the results of each model
# Least Squares Model Average
yhat_all = pd.concat([
pd.Series(result_OLS[2]),
pd.Series(result_ENetCV[2]),
pd.Series(result_RF[2]),
], axis=1)
Dhat_all = pd.concat([
pd.Series(result_OLS[3]),
pd.Series(result_ENetCV[3]),
pd.Series(result_RF[3])
], axis=1)
ma_y = sm.OLS(usedata['outcome'], yhat_all).fit()
ma_d = sm.OLS(usedata['logfssl'], Dhat_all).fit()
weights_y = ma_y.params
weights_d = ma_d.params
lm_k = sm.OLS(ma_y.resid, ma_d.resid).fit(cov_type='HC3')
lsma = pd.Series({"estimate":lm_k.params[0], "stderr":lm_k.bse[0]}, name="Least Squares Model Average").to_frame().T
lsma
Conclusion
In the realm of modern data analysis, understanding the true causal relationships between variables is paramount, especially when dealing with high-dimensional data. Double/Debiased Machine Learning (DML) offers a robust solution to this challenge by blending the predictive power of machine learning with rigorous statistical inference. By carefully partitioning data, leveraging advanced machine learning models, and employing techniques like cross-fitting and residual calculation, DML effectively mitigates biases that could otherwise compromise the integrity of causal estimates.
Through our journey, we explored how DML functions practically, breaking down each step to highlight its role in the overall process. We demonstrated that by predicting both the outcome and the treatment using separate models, and then analyzing the residuals, we can isolate and accurately estimate the causal effect of a treatment on an outcome. This method ensures that the confounding variables’ influence is minimized, providing more reliable and unbiased results.
The power of DML lies in its adaptability and precision. It allows researchers to confidently infer causal relationships in complex, high-dimensional settings, making it an indispensable tool in both academic research and practical applications. As we continue to embrace the era of big data, methods like DML will be crucial in unlocking deeper insights and driving informed decision-making across various fields.