Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
XGBoost: A Scalable Tree Boosting System
Published 2016 in Knowledge Discovery and Data Mining
ABSTRACT
PUBLICATION RECORD
- Publication year
2016
- Venue
Knowledge Discovery and Data Mining
- Publication date
2016-03-09
- Fields of study
Mathematics, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
CONCEPTS
- approximate tree learning
A method for constructing decision trees using approximate split-finding rather than exact enumeration of all candidate splits.
- cache access patterns
Memory access strategies analyzed in this paper to improve data locality and computational efficiency during tree boosting.
- data compression
Techniques used in XGBoost to reduce the storage footprint of training data for more efficient processing.
- scalability
The ability of XGBoost to handle datasets exceeding billions of examples with reduced computational resources compared to prior systems.
- sharding
Partitioning of data across multiple disks or machines used in XGBoost to enable out-of-core and distributed computation.
- sparse data
Input data with many missing or zero-valued features, which the proposed sparsity-aware algorithm is designed to handle.
- sparsity-aware algorithm
A novel algorithm proposed in this paper to efficiently handle sparse input features during tree construction.
- tree boosting
An ensemble machine learning technique that builds additive models of decision trees sequentially to minimize a loss function.
Aliases: gradient boosted trees
- weighted quantile sketch
An approximate algorithm proposed in this paper to find candidate split points for tree learning using weighted data summaries.
- xgboost
A scalable end-to-end tree boosting system introduced in this paper, designed for large-scale machine learning tasks.
REFERENCES
Showing 1-25 of 25 references · Page 1 of 1