Navigating Multicollinearity in Random Forests
Feb 05, 2024
“Do highly-correlated variables in random forests distort accuracy and feature selection?”
In short, the answer is “No, they won’t.”
Introduction
Random Forest (RF), a powerful ensemble model, is often employed for classification and regression tasks. Before delving into how RF handles multicollinearity, let's briefly explore the intricacies of this model.
Understanding Multicollinearity
Multicollinearity arises when two or more explanatory variables in a regression model are highly correlated. This issue poses a significant challenge to regression models, as the assumption of independence between independent variables is compromised, leading to skewed results.
In essence, multicollinearity can inflate the estimated error of variables, diminishing trust in the model's accuracy. For instance, coefficients of correlated variables may appear too small, as they share and divide the impact of similar information.
To mitigate multicollinearity, traditional regression models often require preprocessing steps such as removing highly-correlated variables.
Why Random Forest Escapes the Pitfalls of Multicollinearity
Random Forest sidesteps the challenges posed by highly collinear variables through a combination of subsampling, pruning, and aggregation within its model structure.
Subsampling
In the process of subsampling, RF creates subsets using a technique known as bagging. These subsets share a parallel (uniform) structure, but the number of features (variables) included in each subset can be controlled by the hyperparameter 'max_features.'
By setting 'max_features' to a value less than the total number of features, RF generates diverse decision trees. Some trees may include only one highly-correlated variable, while others incorporate most of them. Alternatively, adjusting 'max_depth' or the number of trees ('n_estimate') in the scikit-learn library achieves a similar effect.
Through this subsampling strategy, RF neutralizes the potentially harmful impact of highly correlated variables through a collective voting mechanism.
Pruning
Pruning, a form of regularization in decision tree models, helps simplify and speed up the model by removing unimportant leaves. In RF, recursive feature elimination acts as a pruning mechanism, eliminating portions of highly-correlated variables.
Two feature selection methods, RFE (Recursive Feature Elimination) and RFECV (Recursive Feature Elimination with Cross-Validation), address multicollinearity concerns. RFECV, in particular, allows the model to identify the best combination of features based on performance metrics like accuracy.
Oblique Variables
RF addresses highly-correlated variables through the aggregation of redundant features by implementing a sparse oblique. This method, akin to dimension reduction in PCA, aggregates a small number of features into one split when sparse oblique splits are enabled ('split_axis = SPARSE_OBLIQUE').
Conclusion
Thanks to its structural properties, RF proves resilient to multicollinearity issues. However, it's crucial to exercise caution when interpreting data and feature importance, considering their correlation.
Derived Questions
An intriguing question arises: Does the XGBoost model also address multicollinearity? A topic for further exploration.
References
Share article