Navigating Multicollinearity in Random Forests

Feb 05, 2024
Navigating Multicollinearity in Random Forests
“Do highly-correlated variables in random forests distort accuracy and feature selection?”
In short, the answer is “No, they won’t.”

Introduction

Random Forest (RF), a powerful ensemble model, is often employed for classification and regression tasks. Before delving into how RF handles multicollinearity, let's briefly explore the intricacies of this model.

Understanding Multicollinearity

Multicollinearity arises when two or more explanatory variables in a regression model are highly correlated. This issue poses a significant challenge to regression models, as the assumption of independence between independent variables is compromised, leading to skewed results.
In essence, multicollinearity can inflate the estimated error of variables, diminishing trust in the model's accuracy. For instance, coefficients of correlated variables may appear too small, as they share and divide the impact of similar information.
To mitigate multicollinearity, traditional regression models often require preprocessing steps such as removing highly-correlated variables.

Why Random Forest Escapes the Pitfalls of Multicollinearity

Random Forest sidesteps the challenges posed by highly collinear variables through a combination of subsampling, pruning, and aggregation within its model structure.

Subsampling

In the process of subsampling, RF creates subsets using a technique known as bagging. These subsets share a parallel (uniform) structure, but the number of features (variables) included in each subset can be controlled by the hyperparameter 'max_features.'
By setting 'max_features' to a value less than the total number of features, RF generates diverse decision trees. Some trees may include only one highly-correlated variable, while others incorporate most of them. Alternatively, adjusting 'max_depth' or the number of trees ('n_estimate') in the scikit-learn library achieves a similar effect.
Through this subsampling strategy, RF neutralizes the potentially harmful impact of highly correlated variables through a collective voting mechanism.

Pruning

Pruning, a form of regularization in decision tree models, helps simplify and speed up the model by removing unimportant leaves. In RF, recursive feature elimination acts as a pruning mechanism, eliminating portions of highly-correlated variables.
Two feature selection methods, RFE (Recursive Feature Elimination) and RFECV (Recursive Feature Elimination with Cross-Validation), address multicollinearity concerns. RFECV, in particular, allows the model to identify the best combination of features based on performance metrics like accuracy.

Oblique Variables

RF addresses highly-correlated variables through the aggregation of redundant features by implementing a sparse oblique. This method, akin to dimension reduction in PCA, aggregates a small number of features into one split when sparse oblique splits are enabled ('split_axis = SPARSE_OBLIQUE').

Conclusion

Thanks to its structural properties, RF proves resilient to multicollinearity issues. However, it's crucial to exercise caution when interpreting data and feature importance, considering their correlation.

Derived Questions

An intriguing question arises: Does the XGBoost model also address multicollinearity? A topic for further exploration.

References

  1. Stack Exchange - Highly Correlated Variables in Random Forest
  1. Statistics by Jim - Multicollinearity in Regression Analysis
  1. Analytics Vidhya - Understand Random Forest Algorithms With Examples
  1. Scikit-Learn - Recursive Feature Elimination
  1. TensorFlow Decision Forests - Random Forest Model
 
Share article

hy0park