We use cookies and other tools to enhance your experience on our website and to analyze our web traffic.
For more information about these cookies and the data collected, please refer to our Privacy Policy.

Use and misuse of random forest variable importance metrics in medicine: demonstrations through incident stroke prediction


Random forest machine learning is a popular predictive tool in medical research. However, when attempting to determine why the random forest model is predictive, applied researchers continue to rely on ‘out of bag’ (OOB) variable importance metrics (VIMPs) that are known to have considerable limitations within the statistics community, including a bias towards highly correlated features.

What was the approach to solving the problem?

First, we evaluate current VIMP practices through an in-depth literature review and explain the limitations of OOB VIMPs. We then propose a novel analytic framework for identifying features contributing to random forest models, based on interpretable ‘knockoff VIMPs’ that were recently developed as an alternative to OOB VIMPs. To demonstrate our framework, we use a random forest model to predict 5-year incident stroke and compare results based on OOB VIMPs versus knockoff VIMPs.

What NSRR data were used?

Sleep Heart Health Study participants without history of stroke at the first visit (N=4,512).

What were the results?

Our literature review confirmed substantial limitations in the use of OOB VIMPs within applied medical research. In our demonstration, OOB VIMPs and knockoff VIMP suggested widely different sets of features indicating risk for incident stroke.

What were the conclusions and implications of this work?

Despite their popularity in medical research, the default OOB VIMPs may produce misleading results. To guide researchers towards more meaningful results, it is essential to bring modern, interpretable and unbiased VIMP methods such as knockoff VIMPs into widespread practice.

Are there any tools available?

R functions for computing knockoff VIMPs are posted on github

An R package (VIMPS) is under development and will be available soon.


Wallace ML, Wheeler BJ, Tapia AL, Richards, M, Zhou S, Yi L, Redline S, Buysse DJ. Use and Misuse of Machine Learning Variable Importance Metrics in Medicine: Demonstrations through Incident Stroke Prediction. BMC Medical Research Methods. 2023 Jun 19;23(1):144. doi: 10.1186/s12874-023-01965-x.

Candes E, Fan Y, Janson L, Lv J. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J R Stat Soc Ser B (Statistical Methodology). 2018;80(3):551–77.

Paper Summary

‘Out of bag’ (OOB) Variable Importance Metrics (VIMPs) are currently the default approach to determining which features are predictive within a random forest machine learning model. They are computationally efficient because they require only a single random forest to be developed; however, the cost of this efficiency is a serious drawback whereby groups of correlated features tend to have inflated OOB VIMPs. This limitation is troublesome because complex high-dimensional data nearly always contain correlated features, and these are the exact situations in which random forests are most useful. Our in-depth literature review confirmed the misuse of OOB VIMPs, underscoring the need to bring modern VIMP approaches into mainstream medical research.

We propose an alternative framework that incorporates modern ‘knockoff’ VIMP methods. Unlike OOB VIMPs, knockoff VIMPs directly compare the performance between two separate random forests: one including all features versus one where features of interest have been replaced with ‘knockoff’ features that have no true relationship to the outcome. Through this direct comparison, knockoff VIMPs quantify added predictive value (e.g., sensitivity, specificity) of the features of interest.

To demonstrate the potentially different conclusions that could be drawn using OOB VIMPs versus knockoff VIMPs, we used the NSRR Sleep Heart Health Study data to examine the predictive value of overnight polysomnography (PSG) features for predicting 5-year incident stroke, relative to other established clinical and self-report measures. Using OOB VIMPs, we identified two lung function features that are highly correlated (forced Vital Capacity and Forced Expiratory Volume). This finding aligns with the critique that OOB VIMPs are biased towards groups of correlated features. Conversely, using our organized knockoff VIMP strategy, we identified groups of features with the largest contributions to sensitivity and specificity of incident stroke prediction. These features included measured medical risk factors, age, diastolic blood pressure, self-reported medical risk factors, polysomnography features, and pack-years of smoking. Thus, the findings from knockoff VIMPs confirmed several features known to be important for stroke prediction, along with several novel yet plausible features (e.g., PSG variables).


Guest Blogger: Dr. Meredith L. Wallace

Paper Authors: Meredith L. Wallace1 , Lucas Mentch2 , Bradley J. Wheeler3 , Amanda L. Tapia1 , Marc Richards2 , Siyu Zhou2 , Lixia Yi2 , Susan Redline4 & Daniel J. Buysse1

1 Department of Psychiatry, University of Pittsburgh, 3811 O’Hara Street, Pittsburgh, PA
2 Department of Statistics, University of Pittsburgh, Pittsburgh, PA, USA
3 School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, USA
4 Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA

By szhivotovsky on January 18, 2024 Jan 18, 2024 in Guest Blogger
no comments
· sorted by
Write a Reply