XGBoost: A Scalable Tree Boosting System

Published 2016 in Knowledge Discovery and Data Mining

ABSTRACT

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

PUBLICATION RECORD

Publication year
2016
Venue
Knowledge Discovery and Data Mining
Publication date
2016-03-09
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1145/2939672.2939785 arXiv 1603.02754
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

Insights on cache access patterns, data compression, and sharding are combined to build a scalable tree boosting system.
Confidence 0.88

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
A weighted quantile sketch algorithm is proposed for approximate tree learning.
Confidence 0.90

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
A novel sparsity-aware algorithm is proposed for handling sparse data in tree boosting.
Confidence 0.90

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
XGBoost scales beyond billions of examples using far fewer resources than existing tree boosting systems.
Confidence 0.92

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review

CONCEPTS

approximate tree learning
method

A method for constructing decision trees using approximate split-finding rather than exact enumeration of all candidate splits.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
cache access patterns
optimization technique

Memory access strategies analyzed in this paper to improve data locality and computational efficiency during tree boosting.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
data compression
optimization technique

Techniques used in XGBoost to reduce the storage footprint of training data for more efficient processing.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
scalability
property

The ability of XGBoost to handle datasets exceeding billions of examples with reduced computational resources compared to prior systems.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
sharding
optimization technique

Partitioning of data across multiple disks or machines used in XGBoost to enable out-of-core and distributed computation.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
sparse data
data type

Input data with many missing or zero-valued features, which the proposed sparsity-aware algorithm is designed to handle.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
sparsity-aware algorithm
algorithm

A novel algorithm proposed in this paper to efficiently handle sparse input features during tree construction.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
tree boosting
method

An ensemble machine learning technique that builds additive models of decision trees sequentially to minimize a loss function.

Aliases: gradient boosted trees

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
weighted quantile sketch
algorithm

An approximate algorithm proposed in this paper to find candidate split points for tree learning using weighted data summaries.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review
xgboost
system, method

A scalable end-to-end tree boosting system introduced in this paper, designed for large-scale machine learning tasks.

뀨 (7c402c1b98) extractionAll you need is Python (5d7gwfm5zu) reviewq (76h6bfydm6) review

REFERENCES

Multivariate spearman's ρ for aggregating ranks using copulas
2016cited by this paper
Efficient Second-Order Gradient Boosting for Conditional Random Fields
2015cited by this paper
MLlib: Machine Learning in Apache Spark
2015cited by this paper
Practical Lessons from Predicting Clicks on Ads at Facebook
2014cited by this paper
General Functional Matrix Factorization Using Gradient Boosting
2013cited by this paper
Parallel boosted regression trees for web search ranking
2011influential reference
Learning Nonlinear Functions Using Regularized Greedy Forest
2011cited by this paper
Scaling up machine learning: parallel and distributed approaches
2011cited by this paper
Scikit-learn: Machine Learning in Python
2011cited by this paper
Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost
2010cited by this paper
Yahoo! Learning to Rank Challenge Overview
2010cited by this paper
From RankNet to LambdaRank to LambdaMART: An Overview
2010cited by this paper
Stochastic gradient boosted distributed decision trees
2009cited by this paper
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
2009cited by this paper
LIBLINEAR: A Library for Large Linear Classification
2008cited by this paper
A Fast Algorithm for Approximate Quantiles in High Speed Data Streams
2007cited by this paper
The Netflix Prize
2007cited by this paper
McRank: Learning to Rank Using Multiple Classification and Gradient Boosting
2007cited by this paper
Generalized Boosted Models: A guide to the gbm package
2006cited by this paper
Importance Sampled Learning Ensembles
2003cited by this paper
Stochastic gradient boosting
2002cited by this paper
Greedy function approximation: A gradient boosting machine.
2001cited by this paper
Space-efficient online computation of quantile summaries
2001cited by this paper
Random Forests
2001influential reference
Special Invited Paper-Additive logistic regression: A statistical view of boosting
2000cited by this paper

CITED BY

Learning Stable Tabular Representations for Predicting via Field Decorrelation and Diversity-Regularized Fusion
2026cites this paper
Enhancing chronic kidney disease diagnosis with polynomial feature expansion and KMeans-guided feature-aware stratified splitting: a comparative machine-learning study
2026cites this paper
A concise real-time identification method of maize phenological period based on remote sensing time information and segmented machine learning algorithm
2026cites this paper
Potentials of machine learning models in estimating the seasonal variability of suspended particulate matters in river ecosystems
2026cites this paper
Spatiotemporal dynamics and future predictions of ecosystem service synergies and trade-offs in China.
2026cites this paper
Integrating ecosystem adaptability into drought resilience assessment: a case study of the Yellow River Basin, China
2026cites this paper
Machine learning-based projections of earth skin temperature anomalies in Nigeria using ERA5-land data
2026cites this paper
Machine learning framework for early detection of polio outbreaks from acute flaccid paralysis surveillance data
2026cites this paper
Burned area mapping across the Arctic-boreal zone with Landsat and Sentinel-2 imagery
2026cites this paper
Improving CO2 Sequestration Through Machine-Learning-Driven Prediction of Wettability in Tight Reservoirs
2026cites this paper
A machine learning analysis of school-level drivers of organisational innovativeness
2026cites this paper
Study on a Process Parameter-Driven Deep Learning Prediction Model for Multi-Physical Fields in Flange Shaft Welding
2026cites this paper
Predicting glass transition temperature using molecular structure factors: Organic molecular compounds & polymers
2026cites this paper
Machine Learning-Based Detection of Fusarium Wilt in Tomato using Ground-based and Satellite Hyperspectral Imagery
2026cites this paper
Predicting diabetes related complications in Alberta, Canada for health jurisdictions: A machine learning prediction approach
2026cites this paper
Deep Bayesian Networks for Failure Probability Estimation in Biomedical Sensors
2026cites this paper
Research on solubility prediction method for milk powder quality assessment: based on morphological characteristics.
2026cites this paper
Statistical Feature Engineering for Robot Failure Detection: A Comparative Study of Machine Learning and Deep Learning Classifiers
2026cites this paper
Exploring Public Health Perspectives on Travel Behavior Using a Machine Learning Approach: Thailand Case Study
2026cites this paper
Robust Greenhouse Tomato Growth Prediction under Severe Data Resolution Mismatch using a Feedback-Guided Diffusion Approach
2026cites this paper
An ML-Based System for the Early Detection of Earth Slope Failures Using IoT Sensing Technology
2026cites this paper
Integrated assessment and prediction of irrigation water quality in the Kuttiyadi River Basin using Fuzzy-AHP and XGBoost
2026cites this paper
A ground-based microwave radiometer temperature and humidity profile retrieval method integrating swarm intelligence optimization and attention mechanism under clear-sky conditions
2026cites this paper
Anomalies of land-atmospheric parameters in the 2023 Turkey earthquake (Mw7.8) using GNSS and ERA5
2026cites this paper
Regional assessment of heavy metal–related ecological risks in the Houjing river sediments in Taiwan: A study using indexes and Extreme Gradient Boosting model
2026cites this paper
Machine Learning-Based Prediction of Surface Potential in Bipolar Junction Transistors
2026cites this paper
QeRSIF: a quality-enhanced red solar-induced chlorophyll fluorescence dataset for TROPOMI satellite
2026cites this paper
Bayesian Calibration of TBM Cutter Wear Under Geological Uncertainty
2026cites this paper
Intrusion detection using ensemble learning and deep learning for IoT network security
2026cites this paper
Application of Machine Learning for Identification of Hidden Rock Sites Using Earthquake Records
2026cites this paper
Optimized Groundwater Vulnerability Assessment Using Machine Learning: A Case Study of Luyi County, China
2026cites this paper
Building-level urban population mapping based on SDGSAT-1 nighttime light and multisource geospatial data
2026cites this paper
Quantifying evapotranspiration and shading cooling of urban vegetation across climates under extreme heat using an integrated SCOPE-SEB model and surface temperature analysis
2026cites this paper
Causal modelling and quality control of complex product assembly processes driven by data and knowledge fusion
2026cites this paper
Comparative study of multifactor dimensionality reduction and machine learning-based methods for analyzing gene-gene interaction effects on survival time
2026cites this paper
“How we roll:” The effects of human, vehicle, and environmental contributors to the occurrence and severity of rollover crashes
2026cites this paper
Deep learning-based real-time prediction of tunnel cable fire physical fields: Combustion characteristics and temperature profile
2026cites this paper
A framework for integrating spatiotemporal deep learning methods with landsat for annual land cover and impervious surface mapping
2026cites this paper
A comparative assessment of supervised models for landslide susceptibility mapping: a case study of Qiongzhong county, Hainan Island, China
2026cites this paper
Predicting Consumer Purchase Intention for Pre-Prepared Meals Based on Random Forest and Explainable AI (SHAP): A Study in Jilin Province, China
2026cites this paper
Fractional Tchebichef-ResNet-SE: A Hybrid Deep Learning Framework Integrating Fractional Tchebichef Moments with Attention Mechanisms for Enhanced IoT Intrusion Detection
2026cites this paper
Learning from the Rare: Overcoming Class Imbalance in Archaeological Object Detection with Boosting Methods
2026cites this paper
Development of Detection Software for Atrial Fatal Heart Disorders
2026cites this paper
Integrating Artificial Intelligence for Ecological Forecasting and Environmental Damage Remediation
2026cites this paper
Optimized XGBoost Model for High-Fidelity Load Forecasting in Rural Hybrid Energy Systems: A Case Study in Brikcha, Morocco
2026cites this paper
Transporting Causal Effects in Ecology: Concepts, Models and Software
2026cites this paper
Applying machine learning to predict stunting in children under 5 years old based on water, sanitation and hygiene behaviors and infrastructure.
2026cites this paper
A Comparative Study of Machine Learning and Deep Learning Models for Real-Time UAV Positioning Error Estimation
2026cites this paper
More than words: valuation of words for stock price by using the combination of natural language processing, time-series panel and gradient boosting
2026cites this paper
Disease classification via interpretable machine learning based on multi-center routine coagulation test
2026cites this paper
Machine-Learning-Assisted Viscoelastic Characterization of PC/ABS Blends via Multi-Frequency Dynamic Mechanical Analysis
2026cites this paper
Sử dụng mô hình học máy trong phát hiện tác dụng phụ tiềm ẩn của thuốc
2026cites this paper
SegFusion: A Lattice-Based Dynamic Ensemble Framework for Chinese Word Segmentation with Unsupervised Statistical Features
2026cites this paper
Beyond the Heat: Nonlinear and Uneven Climate Change Impacts on Mental Health
2026cites this paper
overhang_surrogates: A Python package for sampling, training and visualising surrogate models for building energy simulations
2026cites this paper
Multi-crop early detection of spider mite damage using hyperspectral data and XGBoost
2026cites this paper
Adaptive Spatio-Temporal Federated Learning for Traffic Flow Prediction: Framework and Aggregation Approaches Evaluation
2026cites this paper
Drivers of long-term grassland CO 2 fluxes: effects of management and meteorological conditions during regrowth periods
2026cites this paper
Artificial Intelligence-Enhanced Flexible Sensors for Human Motion and Posture Sensing
2026cites this paper
Advancing atomic layer deposition through artificial intelligence: Generative modeling and predictive analytics for TiO2 growth rate prediction from TDMAT/O3
2026cites this paper
Frequency-aware gradient modulated boosted trees for interpretable financial distress prediction
2026cites this paper
Integrated machine-learning modelling for mechanical property prediction – A case study on laser-welded TC4 titanium alloy
2026cites this paper
Trait Association for Flowering Time in Lentil from Global Multi-Environment Data Using GWAS and Machine Learning
2026cites this paper
Matemática Aplicada e Machine Learning: fundamentos teóricos e estruturação algorítmica
2026cites this paper
A Prediction of Users Repurchase Based on Machine Learning Theory
2025cites this paper
PoWER-M: Prediction of Writing-Based Emotional Risk with Mental-Adaptive Multi-Modal Learning
2025cites this paper
An Investigation of Data Granularity in RAG Pipelines for Personalized Medicine
2025cites this paper
The Convergence of Sports, Technology, and Personalization
2025cites this paper
A Preoperative Data Sentenceization Method for Postoperative Major Adverse Cardiovascular Event Prediction
2025cites this paper
A Unified Blockchain Analytics Platform for Multi-Cryptocurrency Forensics with Explainable AI
2025cites this paper
FEC-Real: Enhancing Financial Time Series Task with a Hybrid Encoder
2025cites this paper
TabGSL: Graph Structure Learning for Tabular Data Predictions
2025cites this paper
Mobility Data Asset Valuation for Spatial Environment Design
2025cites this paper
Physics-Guided Gradient Boosting Under Distributed PDE Constraints: A Unified Theoretical Framework for Scalable Spatiotemporal Learning
2025cites this paper
NutriLite: Balancing Accuracy and Efficiency in Food Nutrient Estimation with Small Language Models
2025cites this paper
MPC-XGB: Privacy-Preserving Vertical Federated XGBoost via Secure Multiparty Computation
2025cites this paper
Building a Robust and Explainable IDS Using ML Techniques
2025cites this paper
Early Detection of Harmful Algal Blooms using Machine Learning
2025influential citation
Explainable Machine Learning for Pump Health Monitoring: Integrating Ensemble Methods with SHAP-based Critical Threshold Identification in Multi-Sensor Systems
2025cites this paper
HMR-LLM: Hybrid Multimodal Sequential Recommendation with LLM Reranking
2025cites this paper
Information Granulation for Hierarchical Feature Selection in Detection of Anomalies in IoT Devices
2025cites this paper
ESPNet: Edge-Aware Graph Representation Learning Over Analyst–Firm Bipartite Networks for Earnings Surprise Prediction
2025cites this paper
DataWiz: A No-Code Platform for Automated Data Visualization and Model Recommendation
2025cites this paper
Beyond Tabular Data: Interpretable Promotion Effectiveness with a Heterogeneous Graph Attention Network
2025cites this paper
USFF: A Unified Sales Forecasting Framework for Vending Machines
2025cites this paper
Hotel Review Sentiment Analysis with Fusion of BERT and XGBoost Models
2025cites this paper
Knowledge Distillation for Lightweight ECU Fingerprinting in CAN Bus Security
2025cites this paper
An Interpretable and Efficient Random Undersampling-enhanced SHAP Framework for Medicare Fraud Detection
2025cites this paper
Scalable Cross-Location Calibration of Low-Cost Air Quality Sensors Using Heterogeneous Data
2025cites this paper
Retrieval-Reranker: A Two-Stage Pipeline for Knowledge Graph Completion
2025cites this paper
A Stacking-Based Ensemble Deep Learning Approach to Forecast Offshore Wind Farm Power Output
2025cites this paper
Optimizing Memory Allocation in Distributed Clusters with Predictive Modeling
2025cites this paper
Fast and Faithful: A Lightweight Spatio-Temporal GNN for Semi-Supervised Air Quality Forecasting with Inductive Capability
2025cites this paper
Transformers for Payment Transactions Via Periodic Positional Encoding and Contrastive Learning
2025cites this paper
Interpretable Teacher Competence Evaluation via Neural Networks with Inverse Inference
2025cites this paper
Brain Tumor Detection Using Advanced Deep Learning Models - Vision Transformer and EfficientNetV2
2025cites this paper
Sales Forecasting Prediction using Machine Learning
2025cites this paper
RetinaRisk: Advanced Hypertension and Heart Attack Prediction Using Retinal and Speech Analysis
year unknowncites this paper
Drivers of long-term grassland CO 2 ﬂuxes: effects of management and meteorological conditions during regrowth periods
year unknowncites this paper
Using Machine Learning Methods to Predict Temperatures in Iraq (Baghdad)
year unknowninfluential citation