Chapter 1 Summary: Applied Causal Inference Powered by ML and AI
Regression and the Best Linear Prediction Problem
Linear regression is a method for predicting a dependent variable (Y) using one or more independent variables (X). Here’s a simplified breakdown:
- Predicting Y: We aim to predict Y using a linear combination of the X variables. This means finding a line (or hyperplane) that best fits the data points.
- Regression Coefficients (β): These are the weights assigned to each X variable in the linear equation. Our goal is to find the best values for ββ that minimize the difference (error) between the predicted values and the actual values of Y.
- Mean Squared Error (MSE): The best linear prediction rule minimizes the MSE, which is the average of the squared differences between predicted and actual values. This helps ensure that our predictions are as close as possible to the real values.
- Normal Equations: These are equations derived from the optimization process of minimizing the MSE. Solving these equations gives us the best values for the regression coefficients.
- Residuals (ϵϵ): The residuals are the differences between the actual values and the predicted values. They represent the part of Y that our model cannot explain.
Practical Implications:
- Interpretability: Each coefficient shows the effect of a one-unit change in the corresponding X variable on Y, holding other variables constant.
- Optimization: By minimizing the MSE, we ensure that our model makes the most accurate predictions possible.
Best Linear Approximation Property
This property indicates that our best linear prediction (β′X) is also the best linear approximation to the conditional expectation of Y given X. This means that our linear model is as close as possible to the true relationship between Y and X.
From Best Linear Predictor to Best Predictor
- Feature Engineering: To improve predictions, we can create new features from the original ones. For example, using polynomials (e.g., X2, X3), interactions (products of variables), or other transformations (e.g., logarithms).
- Enhanced Predictive Power: By including these transformed features, our model can capture more complex patterns in the data. Even though the new model is still linear in terms of the transformed features, it can better approximate the true relationship between Y and X.
- Example: If W is a raw predictor, we can create new predictors like W2 or interactions like W×Z. This way, the model becomes more flexible and can provide better predictions.
Implementation Considerations:
- Avoid Overfitting: Adding too many features can lead to overfitting, where the model performs well on training data but poorly on new data. Regularization techniques (like Lasso or Ridge regression) can help prevent this.
- Computational Efficiency: More features mean more computations. Efficient algorithms and careful feature selection are necessary to manage this.
- Model Evaluation: It’s crucial to evaluate the model on a separate test dataset to ensure it generalizes well to new data.
Best Linear Prediction in Finite Samples
Real-world Data vs. Population Data:
- In practice, researchers don’t have access to all the data in the world (population data). Instead, they work with a sample of data points.
- We assume this sample is randomly selected from the population, meaning each data point is an independent and identical representation of the overall population.
Constructing the Best Linear Prediction Rule:
- Just like in the population case, we aim to predict Y using X by finding the best-fitting line or hyperplane.
- Instead of using theoretical averages (expectations), we use sample averages to find the best fit.
- The best fit in the sample is known as Ordinary Least Squares (OLS). It minimizes the average squared difference between the actual values of Y and the predicted values.
Sample Regression Coefficients:
- The coefficients (β^) we get from OLS are called sample regression coefficients.
- These coefficients can be found by solving the Sample Normal Equations, which are derived by minimizing the sample Mean Squared Error (MSE).
Residuals:
- Residuals (ϵ^) are the differences between the actual values and the predicted values. They represent the part of YY that our model cannot explain.
- The formula Yi=Xi′β^+ϵ^iYi=Xi′β^+ϵ^i shows that Yi can be decomposed into the explained part (Xi′β^) and the unexplained part (ϵ^i).
Properties of Sample Linear Regression
Estimating Population Parameters:
- The key question is whether our sample-based predictions (β^′X) are good approximations of the true population values (β′X).
- If our sample size (n) is large and the number of predictors (p) is small relative to n, our sample estimates will be close to the true population values.
Approximation of Population BLP by OLS:
- Under certain conditions, the sample linear regression will be close to the population linear regression if n is large and pp is much smaller than n.
- This means that as our sample size increases and the number of predictors remains manageable, our model will perform well in predicting new data.
Analysis of Variance
Decomposing Variation in Y:
- We can break down the total variation in Y into two parts: explained variation and unexplained variation.
- Explained variation shows how well our model predicts Y, while unexplained variation represents the residuals.
Population and Sample R2:
- R2 measures the proportion of the total variation in Y that is explained by the model.
- R2 ranges from 0 to 1, with higher values indicating better predictive performance.
- We calculate R2 in both the population and the sample to assess how well our model performs.
Overfitting: When p/n is Not Small
Overfitting:
- Overfitting occurs when our model fits the sample data too well, capturing noise rather than the true underlying pattern.
- When the number of predictors (p) is close to or greater than the number of observations (n), our model may overfit, leading to poor performance on new data.
Example of Overfitting:
- If p=n, the sample R2 will be 1, meaning the model perfectly fits the sample data, but this does not reflect its performance on new data.
- If pp is smaller but still large compared to n, the sample R2 will be artificially high, indicating overfitting.
Adjusted Measures:
- To correct for overfitting, we use adjusted R2 and adjusted MSE, which take into account the number of predictors relative to the sample size.
Measuring Predictive Ability by Sample Splitting
Data Splitting:
- To evaluate our model’s predictive performance, we split the data into training and testing sets.
- The training set is used to build the model, while the testing set is used to evaluate its performance on new, unseen data.
Cross-Validation:
- Cross-validation is a more efficient method where we repeatedly split the data, train the model on different subsets, and evaluate its performance, averaging the results.
- This approach provides a more robust measure of the model’s predictive ability.
Stratified Splitting:
- Stratified splitting ensures that the training and testing sets are similar in composition, which is especially important for moderate-sized samples.
By understanding these concepts, you can build linear regression models that not only fit the sample data well but also perform reliably on new data, avoiding the pitfalls of overfitting and ensuring robust predictions.