Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection

Published 2021 in Entropy

ABSTRACT

Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state-of-the-art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We demonstrate that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.

PUBLICATION RECORD

Publication year
2021
Venue
Entropy
Publication date
2021-09-12
Fields of study
Mathematics, Computer Science, Medicine
Identifiers
DOI 10.3390/e24050687 arXiv 2109.05468 PMID 35626570 PMCID 9140774
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

MapReduce
2020cited by this paper
Interpretable Machine Learning
2019influential reference
Consistent Individualized Feature Attribution for Tree Ensembles
2018cited by this paper
Explainable machine-learning predictions for the prevention of hypoxaemia during surgery
2018influential reference
On the Universality of the Logistic Loss Function
2018cited by this paper
Towards A Rigorous Science of Interpretable Machine Learning
2017cited by this paper
A Unified Approach to Interpreting Model Predictions
2017cited by this paper
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
2017cited by this paper
CatBoost: unbiased boosting with categorical features
2017cited by this paper
Distribution-Free Predictive Inference for Regression
2016cited by this paper
XGBoost: A Scalable Tree Boosting System
2016cited by this paper
“Why Should I Trust You?”: Explaining the Predictions of Any Classifier
2016cited by this paper
Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance
2015influential reference
Comments on Fifty Years of Classification and Regression Trees.
2014cited by this paper
Fifty Years of Classification and Regression Trees
2014cited by this paper
Correlation and variable importance in random forests
2013cited by this paper
Classification and regression trees
2012influential reference
Parallel boosted regression trees for web search ranking
2011cited by this paper
The behaviour of random forest permutation-based variable importance measures under predictor correlation
2010cited by this paper
From RankNet to LambdaRank to LambdaMART: An Overview
2010cited by this paper
Feature Selection with the Boruta Package
2010cited by this paper
Learning from Imbalanced Data
2009cited by this paper
Feature selection for ranking using boosted trees
2009cited by this paper
Bias in random forest variable importance measures: Illustrations, sources and a solution
2007influential reference
Predicting clicks: estimating the click-through rate for new ads
2007cited by this paper
Unbiased Recursive Partitioning: A Conditional Inference Framework
2006influential reference
Induction of decision trees
2004influential reference
REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION
2002cited by this paper
Stochastic gradient boosting
2002cited by this paper
Random Forests
2001cited by this paper
Greedy function approximation: A gradient boosting machine.
2001influential reference
Using a Permutation Test for Attribute Selection in Decision Trees
1998cited by this paper
Classification and regression
1997influential reference
SPLIT SELECTION METHODS FOR CLASSIFICATION TREES
1997cited by this paper
Bagging Predictors
1996cited by this paper
Selecting multiway splits in decision trees
1996cited by this paper
LightGBM: A Highly Eﬀicient Gradient Boosting Decision Tree
year unknowncited by this paper
Journal of Machine Learning Research?? (200?)????-???? Submitted 11/07; Published?? Ranking Categorical Features Using Generalization Properties ∗
year unknowncited by this paper
Classi(cid:142) cation Trees With Bivariate Linear Discriminant Node Models
year unknowncited by this paper

CITED BY

Towards Reliable Feature Importance in Hashimoto's Thyroiditis Prediction: Reconstructing Machine Learning Frameworks.
2026cites this paper
Machine-Learning Applications in Predicting Students’ Non-Cognitive Skills: Evidence from PISA 2022
2026cites this paper
Ensuring reliable feature importance in food chemistry AI.
2026cites this paper
Integrating Machine Learning and Image-Based Damage Quantification to Predict Self-Healing Performance of Asphalt Mixtures
2026cites this paper
Revisiting AI Interpretability in Precision Oncology: Why Predictive Accuracy Does Not Ensure Stable Feature Importance.
2026cites this paper
Beyond prediction: Assessing stability in feature selection methods for materials science applications
2026cites this paper
Development of a novel prognostic risk model for pancreatic adenocarcinoma exploiting multi-omics data
2026cites this paper
Towards reliable feature interpretation in machine learning-based longevity prediction.
2026cites this paper
CAFE-GB: Scalable and Stable Feature Selection for Malware Detection via Chunk-wise Aggregated Gradient Boosting
2026cites this paper
When high accuracy misleads: Stability limits of supervised feature importance in QSAR biodegradation.
2026cites this paper
Assessing Relative Population Exposure and Non-Carcinogenic Health Risks of PM2.5 in Tehran Through Machine Learning, Satellite Observations, and Uncertainty Analysis
2026cites this paper
Synthetic data-driven explainable machine learning for groundwater salinity prediction in the Al-Qatif coastal aquifer of Saudi Arabia
2026cites this paper
Beyond prediction accuracy: Assessing SHAP-driven interpretability for refractory high entropy alloys hardness
2026cites this paper
A Novel Hybrid Approach for Identification of Discriminative Features in Phishing Emails
2026cites this paper
Vegetation phenological shift induced by rocky desertification governance: spatiotemporal characteristics and driving mechanisms
2025influential citation
Ensemble recursive feature elimination-based ensemble classification for medical diagnosis
2025cites this paper
Beyond XGBoost-SHAP: Strengthening THM Mechanistic Inference with Consistency and Dose-Response Validation
2025cites this paper
Development and Clinical Validation of Lightweight, Multimodal Machine Learning Models for Smartphone-Based Cataract Detection and Classification
2025cites this paper
Integrating remote sensing and meteorological data for AI-based land surface temperature prediction with feature selection approaches
2025cites this paper
Preoperative prediction of axillary lymph node metastasis in breast invasive ductal carcinoma patients using a deep learning model based on dynamic contrast-enhanced magnetic resonance imaging: a multicenter study
2025cites this paper
Automated Feature Engineering Based on Explainable Artificial Intelligence for Time Series Forecasting
2025cites this paper
Analyzing Groundwater Dynamics in Chhattisgarh Using Multi-Feature Correlation and Interactive Mapping with Folium-A Case Study of the Kharun Basin
2025cites this paper
Beyond accuracy: Stabilizing feature importance in GWRF/RF for soil heavy metal mapping.
2025cites this paper
A Hybrid GBM-LSTM and Feature Engineering Residual Learning Approach for Medium-Term Load Forecasting Framework in Resource-Constrained Power Grids
2025cites this paper
Limitations of SHAP-based interpretability in sepsis progression models and paths to more robust feature validation
2025cites this paper
Research on the Prediction Model of Zinc Coating Thickness of Continuous Hot‐Dip Galvanized Strip Steel Based on Gaussian Process Regression
2025cites this paper
The pivotal and transformative role of artificial intelligence in advanced multidimensional modeling and optimization of complex cefixime separation processes using 3-hydroxyphenol-formaldehyde nanostructures: A multi-layered analytical approach
2025cites this paper
Machine learning predictions of drug release from isocyanate-derived aerogels.
2025cites this paper
Feasibility of use limited data to establish a relationship between chemical composition and the enzymatic glucose yield using machine learning
2025cites this paper
Machine Learning–Based Predictive Model for Dispute Occurrence and Resolution Strategies in Pipeline Projects
2025cites this paper
Letter to the Editor regarding "Prediction of PFAS bioaccumulation in different plant tissues with machine learning models based on molecular fingerprints" by Song et al. (2024), Sci. Total Environ. 950 175091.
2025cites this paper
Reevaluating Feature Selection in Machine Learning-Based Radiomics for Hepatocellular Carcinoma: Bridging the Gap Between Predictive Accuracy and Biological Relevance.
2025cites this paper
Application of Mask R-CNN for automatic recognition of teeth and caries in cone-beam computerized tomography
2025cites this paper
Accurate and efficient prediction on the formation energy and potential profiles of sodium vanadium oxyfluorophosphate by machine learning
2025cites this paper
Exploring innovative assessment and driving mechanisms for achieving land degradation neutrality in rocky desertification areas: A case study of Yunnan-Guangxi-Guizhou, China
2025cites this paper
AI-powered deep ultraviolet laser diode design for resource-efficient optimization
2025cites this paper
Artificial intelligence-based optimization and modeling of cadmium reduction via ultraviolet-assisted malathion/sulfite reaction mechanisms
2025cites this paper
Machine learning models for predicting multimorbidity trajectories in middle-aged and elderly adults
2025cites this paper
Complementing interpretable machine learning with synergistic analytical strategies for thyroid cancer recurrence prediction.
2025cites this paper
Optimizing Random Forest Models for Stock Market Prediction with Hyperparameter Analysis
2025cites this paper
Machine learning-driven optimization for surface roughness prediction of vertical orientation measurements on 3D printed components
2025cites this paper
Digital biomarkers for interstitial glucose prediction in healthy individuals using wearables and machine learning
2025cites this paper
Beyond model-specific biases: An explainable multifaceted approach for robust PM10 source apportionment.
2025cites this paper
Clinical Machine Learning Pitfalls: Reliability of Feature Importance in Prediction of Continuous Renal Replacement Therapy in Acute Type A Aortic Dissection Assessment.
2025cites this paper
Uncertainty in machine learning feature importance for climate science: a comparative analysis of SHAP, PDP, and gain-based methods
2025cites this paper
Predicting nepetalactone accumulation in Nepeta persica using machine learning algorithms and geospatial analysis
2025cites this paper
Revisiting AI Model Interpretability in Lung Cancer Screening: Challenges in Balancing Predictive Performance and Reliability.
2025cites this paper
Pitfalls of XAI interpretation in environmental modeling: A warning on model bias in air quality data analysis
2025cites this paper
Beyond predictive accuracy: Statistical validation of feature importance in biomedical machine learning
2025cites this paper
Transformers and capsule networks vs classical ML on clinical data for alzheimer classification
2025cites this paper
Hybrid Meta-Heuristic Feature Selection Model for Network Traffic-Based Intrusion Detection in AIoT
2025cites this paper
The Shapley Value Contribution to Explainable Artificial Intelligence: A Comprehensive Survey
2025cites this paper
Improving Web Security through Machine Learning: A Feature-Based Methodology for Detecting Phishing URLs
2025cites this paper
Limitations of SHAP-based interpretations in environmental and membrane filtration applications.
2025cites this paper
Reassessing SHAP-Based Interpretations in QSAR: Model-Centric Limits and Unsupervised Alternatives for Fluorocarbon Inhalation Toxicity.
2025cites this paper
A Study towards Supervised Learning Techniques for a Well-Predictive Modelling Utilising Electronic Health Record Data
2025cites this paper
Entropy-guided sparse multiple kernel coordinate descent classifier algorithm for interpretable prediction
2025cites this paper
Fusion-enhanced multi-label feature selection with sparse supplementation
2025cites this paper
Skillful bias correction of offshore near-surface wind field forecasting based on a multi-task machine learning model
2025cites this paper
A comprehensive review, CFD and ML analysis of flow around tandem circular cylinders at sub-critical Reynolds numbers
2025cites this paper
Leveraging Azure Automated Machine Learning and CatBoost Gradient Boosting Algorithm for Service Quality Prediction in Hospitality
2025cites this paper
Determinants Shaping Human Preference for Thermal Springs in Palaeolithic Europe and Asia Minor
2025cites this paper
Developing clinical prognostic models to predict graft survival after renal transplantation: comparison of statistical and machine learning models
2025cites this paper
Multi-objective Optimization Framework for Nitrogen-containing Compound Generation in Nitrogen-enriched Pyrolysis: Integrating Fransfer Learning and Experimental Validation
2025cites this paper
Exploiting Data Distribution: A Multi-Ranking Approach
2025cites this paper
Classification of start-ups’ digital marketing adoption experiences: an investigation of characteristics and interactions
2025cites this paper
Robust diagnosis of abnormal events using fuzzy-based feature extraction in nuclear power plants
2025cites this paper
Comments on "Dialogue between algorithms and soil: Machine learning unravels the mystery of phthalates pollution in soil" by Pan et al. (2025).
2025cites this paper
Analysis of Gradient Boosting, XGBoost, and CatBoost on Mobile Phone Classification
2024cites this paper
Hybrid Causal Feature Selection for Cancer Biomarker Identification From RNA-Seq Data
2024cites this paper
Generative AI Voting: Fair Collective Choice is Resilient to LLM Biases and Inconsistencies
2024cites this paper
EFFECTIVENESS OF VARIABLE SELECTION METHODS FOR MACHINE LEARNING AND CLASSICAL STATISTICAL MODELS
2024cites this paper
Development of a Forest Fire Diagnostic Model Based on Machine Learning Techniques
2024cites this paper
Rolling Bearing Fault Diagnosis Based on Multi-scale Entropy Feature and Ensemble Learning
2024cites this paper
Integration Sentinel-1 SAR data and machine learning for land subsidence in-depth analysis in the North Coast of Central Java, Indonesia
2024cites this paper
Securing the Internet of Health Things: Embedded Federated Learning-Driven Long Short-Term Memory for Cyberattack Detection
2024cites this paper
Advanced Machine Learning Strategies for Landslide Detection
2024cites this paper
Features that influence bike sharing demand
2024cites this paper
Deep learning for classifying neuronal morphologies: combining topological data analysis and graph neural networks
2024cites this paper
A Machine Learning Algorithms for Detecting Phishing Websites: A Comparative Study
2024cites this paper
Exploratory Analysis of Machine Learning Methods for Total Organic Carbon Prediction Using Well-Log Data of Kolmani Field
2024cites this paper
Stacked generalization ensemble learning strategy for multivariate prediction of delamination and maximum thrust force in composite drilling
2024cites this paper
Combined Method Comprising Low Burden Physiological Measurements with Dry Electrodes and Machine Learning for Classification of Visually Induced Motion Sickness in Remote-Controlled Excavator
2024cites this paper
Enhancing Smart Grid Security: An Data-Driven Anomaly Detection Framework
2024cites this paper
Predicting the probability of failure in medicinesʼ public procurement
2024cites this paper
Detecting Invasive Alien Plant Species Using Remote Sensing, Machine Learning and Deep Learning
2024cites this paper
Spatiotemporal Estimation of Black Carbon Concentration in Tehran Using Aerosol Optical Depth Remote Sensing Data and Meteorological Parameters: Health Risk Assessment and Relationship with Green Spaces
2024cites this paper
Reassessing Feature Importance Biases in Machine Learning Models for Infection Analysis.
2024cites this paper
Stacked Ensemble with Machine Learning Regressors on Optimal Features (SMOF) of hyperspectral sensor PRISMA for inland water turbidity prediction
2024cites this paper
A comparative study of CNN-capsule-net, CNN-transformer encoder, and Traditional machine learning algorithms to classify epileptic seizure
2024cites this paper
Identifying and understanding cognitive profiles in multiple sclerosis: a role for visuospatial memory functioning
2024cites this paper
Beyond the Claims: Emerging AI Models and Predictive Analytics in Property & Casualty Insurance Risk Assessment
2024cites this paper
Machine learning-driven simplification of the hypomania checklist-32 for adolescent: a feature selection approach
2024cites this paper
Cognitive Frailty Classification Models for Older Adults in a Point-of-Care System
2024cites this paper
Selekcja zmiennych metodami statystycznymi i uczenia maszynowego. Porównanie podejść na przykładzie danych finansowych
2024cites this paper
Fake News Detection Model Basing on Machine Learning Algorithms
2024cites this paper
A machine learning model for predicting acute exacerbation of in-home chronic obstructive pulmonary disease patients
2024cites this paper
Utilizing Machine Learning Techniques to Analyze Primary Cilia in Postmortem Tissue Samples
2024cites this paper
Factors affecting students’ performance on national assessments of mathematics in Italy: a random forest approach
2024cites this paper
Predicting survival in small cell lung cancer patients undergoing various treatments: a machine learning approach
2023cites this paper