Extreme gradient boosting (XGBoost) is a distributed software library that employs machine learning algorithms to solve complex data science problems quickly and efficiently. The library, created by Tianqi Chen, can be downloaded to a machine and accessed from multiple interfaces, including Command Line Interface (CLI), C++, Python, and Java. Many developers have made contributions to the open source software since its creation. Today, XGBoost is among the larger collection of data science problem-solving tools in the distributed machine learning community.
A scalable tree boosting system, XGBoost employs sharding and data compression, among other techniques, to solve problems with billions of potential outcomes. Boosting is a common problem-solving method that involves stringing together a learning algorithm to create a strong and more accurate learner via several sequentially connected weak learners. Each successive tree in this series attempts to minimize the mistakes of the tree before it, resulting in an accurate and efficient model.
To optimize memory resources and compute time, XGBoost includes vital algorithm implementation features such as continued training to allow developers to further boost fitted models on new datasets and block structure to enhance the parallelization of tree construction. Moreover, sparse aware automatically takes care of missing data values. XGBoost system features include out-of-core computing for substantial datasets, distributed computing that employs a cluster of machines to train large models, and cache optimization of algorithm and data structures. The software library supports gradient boosting, stochastic gradient boosting, and regularized gradient boosting.
While there are several other implementations of gradient boosting, XGBoost is largely preferred for its model performance and execution speed. Szilard Pafka, in a 2015 blog post, wrote about his analysis of various gradient boosting implementations and referred to XGBoost as “fast, memory efficient, and of high accuracy.” In Pafka's experiment involving random forests with 500 trees and default hyper-parameter values, XGBoost almost always produced accurate results quicker than benchmarked implementations from Python, Spark, R, and H20.
In regard to the efficiency of model performance, XGBoost is the go-to algorithm for tabular or structured datasets involving regression and classification predictive modeling problems. Chen, who created the library when researching variants of tree boosting as an undergrad, introduced XGBoost to the developer community by entering it in the Higgs Boson Challenge at Kaggle. It has since been used by many top-three finishers in the Kaggle data science competition.
To date, XGBoost has been used by Kaggle first-place finishers in competitions such as IEE-CIS Fraud Detection, Santander Value Prediction Challenge, Liberty Mutual Property Inspection, and Crowdflower Search Results Relevance. Qingchen Wang, who finished first in the Liberty Mutual Property Inspection, said he only used XGBoost for predictive modeling in his winner's interview.
Because it is an open source software library, XGBoost is free to use for developers who have the permissive Apache-2 license. It can be installed on three operating systems (Linux, Windows, and Mac OS X) as well as various platforms, including R and Python. The target library to install XGBoost on each system is as follows: Linux (libxgboost.so), Windows (xgboost.dll), and Mac OS X (libxgboost.dylib). Systems must have CMake 3.14 or higher as well as a C++ compiler that supports C++11 (g++-5.0 or higher) to build the shared library.