XGBoost: Scalable GPU Accelerated Learning

Published 29 Jun 2018 in cs.LG and stat.ML | (1806.11248v1)

Abstract: We describe the multi-GPU gradient boosting algorithm implemented in the XGBoost library (https://github.com/dmlc/xgboost). Our algorithm allows fast, scalable training on multi-GPU systems with all of the features of the XGBoost library. We employ data compression techniques to minimise the usage of scarce GPU memory while still allowing highly efficient implementation. Using our algorithm we show that it is possible to process 115 million training instances in under three minutes on a publicly available cloud computing instance. The algorithm is implemented using end-to-end GPU parallelism, with prediction, gradient calculation, feature quantisation, decision tree construction and evaluation phases all computed on device.

Abstract PDF Upgrade to Chat

Citations (32)

View on Semantic Scholar

Summary

The paper presents a multi-GPU gradient boosting algorithm that significantly accelerates training using efficient data compression and quantile-based feature quantization.
It integrates end-to-end GPU parallelism for key operations like gradient calculation and decision tree construction, achieving nearly threefold speedup over competitors.
The approach demonstrates scalability by processing 115M instances rapidly, paving the way for future extensions to distributed and terabyte-scale data challenges.

XGBoost: Scalable GPU Accelerated Learning

Introduction

The paper "XGBoost: Scalable GPU Accelerated Learning" focuses on enhancing the efficiency and scalability of the XGBoost library through the implementation of a multi-GPU gradient boosting algorithm. This advancement enables rapid and large-scale training on multi-GPU systems by leveraging GPU memory efficiently via data compression techniques. Theoretical considerations are paired with practical implementations to demonstrate the method's applicability, processing up to 115 million training instances in under three minutes on standard cloud infrastructure.

Methods

The core methods introduced involve end-to-end GPU parallelism for several pivotal stages in the gradient boosting process, including prediction, gradient calculation, feature quantization, and decision tree construction. The proposed method employs a quantile-based decision tree algorithm, augmented with parallel GPU computations for optimizing performance and memory utilization.

Feature Quantile Generation

The algorithm utilizes a quantile representation of feature space inputs, which streamlines the computation-intensive decision tree construction into a gradient summation task, completing this operation entirely on the GPU.

Data Compression

A data compression step is integrated into the pipeline, reducing memory consumption significantly by packing quantized matrix values. This is accomplished using bitwise operations, allowing for efficient runtime decompression.

Decision Tree Construction

A detailed multi-GPU decision tree construction process is outlined, as shown in Algorithm 1. GPUs independently handle subsets of data, calculating partial gradient histograms that are merged across devices, optimizing for split gains via a parallel prefix sum.

Figure 1: Gradient Boosting Pipeline.

Evaluation

The evaluation section compares the algorithm to competitors (LightGBM and CatBoost) using multiple datasets processed on an AWS instance with 8 Tesla V100 GPUs. The implementation shows strong numerical results, particularly with the largest dataset (115M rows), achieving close to three times the speedup against its nearest competitor. The scalability is further demonstrated by illustrating how runtime decreases with the addition of more GPUs on the airline dataset, maintaining efficient resource usage.

Implications and Future Work

Practically, these improvements allow for considerably faster processing times on massive datasets, making XGBoost an ideal choice for high-dimensional data tasks involving regression, classification, and ranking. Theoretically, the integration of GPU acceleration into robust machine learning libraries like XGBoost provides a path for further innovation in computationally efficient model training.

Future developments will extend this work to distributed systems, accommodating datasets beyond on-device memory limitations, potentially up to the terabyte scale. Moreover, plans to apply similar GPU-accelerated methods to large-scale linear modeling are indicative of the continued evolution of the XGBoost framework.

Conclusion

The advancements presented in this paper significantly enhance the capabilities of the XGBoost library via GPU acceleration, establishing a foundation for handling more extensive and complex datasets efficiently. Through continued development, the framework is poised to remain a leading tool in machine learning for both current and emerging data challenges.