CatBoost: Unbiased Gradient Boosting
- CatBoost is a gradient boosting library that uses ordered boosting and oblivious trees to deliver unbiased, accurate predictions on heterogeneous data.
- It employs innovative ordered target statistics to encode high-cardinality categorical features, effectively eliminating target leakage.
- Optimized for both CPU and GPU, CatBoost offers state-of-the-art efficiency and interpretability across domains like astrophysics, finance, and cybersecurity.
CatBoost (Categorical Boosting) is a gradient-boosted decision tree (GBDT) machine learning library designed to deliver unbiased, high-accuracy predictions on heterogeneous datasets containing both numerical and high-cardinality categorical variables. CatBoost’s core algorithmic innovations, notably ordered boosting and permutation-driven target statistics, systematically address target leakage and gradient bias, setting it apart from classical GBDT approaches such as XGBoost and LightGBM. CatBoost supports both CPU and GPU acceleration and has demonstrated state-of-the-art performance and efficiency across domains including astrophysics, insurance, finance, and cybersecurity (Dorogush et al., 2018, Prokhorenkova et al., 2017, Li et al., 2022, So, 2023, Fajar et al., 2024, Wang et al., 2021).
1. Algorithmic Foundations: Ordered Boosting and Oblivious Trees
The CatBoost learning strategy centers on the minimization of an empirical loss functional
where is a GBDT ensemble composed of a sequence of base learners (trees) fit to pointwise gradients and, in second-order mode, Hessians of the loss (Prokhorenkova et al., 2017, Dorogush et al., 2018).
Standard GBDT algorithms suffer from “prediction shift”—bias introduced when gradient (and categorical encoding) calculations for each example rely on models that have already seen (Prokhorenkova et al., 2017). CatBoost eliminates this shift through ordered boosting: - Draws random permutations of the data; - For each data point at position in , its pre-fit gradient and categorical encoding are computed using only the preceding examples; - Maintains logarithmic-sized supporting models per permutation to ensure unbiased, “out-of-fold” gradient and encoding computations while retaining per-iteration cost.
CatBoost’s base learners are oblivious trees, i.e., symmetric binary trees of fixed depth , wherein every node at depth splits on the same feature and threshold across all paths. This yields extremely efficient evaluation: the path to each leaf corresponds to a unique -bit integer and all leaf values are stored in contiguous arrays (Dorogush et al., 2018, Mironov et al., 2022).
2. Categorical Feature Encoding: Ordered Target Statistics
Handling categorical variables without information leakage is a central challenge. CatBoost encodes categorical features using ordered target statistics (“CTR” features) computed as follows, for feature , category value , and permutation (Prokhorenkova et al., 2017, Dorogush et al., 2018):
Here, is a prior-weight and the prior (e.g., the global mean). Multiple permutations (usually or $2$) stabilize the encoding. This approach enables robust use of very high-cardinality categoricals and dynamic greedy feature combinations. The encoding is strictly “out-of-fold” with respect to model fitting and all downstream splits, eliminating target leakage (Prokhorenkova et al., 2017, Dorogush et al., 2018, Wang et al., 2021).
3. Boosting Optimization and Tree Construction
At each iteration, CatBoost fits an oblivious tree to negative gradients (and Hessians for Newton mode). After the tree structure is built, optimal leaf values are set using a regularized Newton step:
where , are the loss gradient and Hessian for each sample in leaf and is the L2-leaf regularization parameter (Dorogush et al., 2018, Prokhorenkova et al., 2017, Wang et al., 2021). This procedure generalizes to arbitrary convex loss functions including regression, classification, and distributional objectives (see CatBoostLSS below).
Key hyperparameters include: number of trees (iterations), depth, learning rate (), L2_leaf_reg (), feature bin count, random strength, and bagging temperature (Dorogush et al., 2018, Prokhorenkova et al., 2017). CatBoost employs Bayesian bootstrap gradient weights and random permutations to enhance regularization.
4. System Engineering and Computational Performance
CatBoost implements aggressive optimizations for both CPU and GPU backends (Dorogush et al., 2018, Mironov et al., 2022):
- CPU Inference: All features are pre-binarized; leaf index calculation is performed via bitwise operations; oblivious tree structure enables vectorized, branchless code. AVX2 and AVX-512 kernels with FP16 leaf value support accelerate prediction by 20–70% with negligible numerical error (Mironov et al., 2022).
- GPU Training: Histogram-based split search; features packed into compact representations; composite CTR encoding mapped via hashing and processed using sort/scan primitives; multi-GPU training achieved via feature-parallelism (Dorogush et al., 2018).
Quantitatively, CatBoost is 2.6–15× faster than its own CPU scoring, and outperforms XGBoost and LightGBM in both accuracy and runtime, especially at high tree depths or large ensemble sizes (Dorogush et al., 2018, Mironov et al., 2022).
5. Extensions: Probabilistic Forecasting and Zero-Inflated Modeling
CatBoostLSS is an extension for distributional and probabilistic forecasting that directly predicts parameters of a user-specified distribution (e.g., Normal, Poisson, Generalized Beta, Zero-inflated Poisson) (März, 2020). For K-parameter distributions, CatBoostLSS alternates Newton boosting over each parameter, using the full negative log-likelihood as the loss:
Gradients and Hessians for each parameter are propagated and fit with regular boosting rounds and backfitting (März, 2020, So, 2023).
Zero-inflated Poisson CatBoost models have also been constructed for count data with heavy excess zeros, enabling simultaneous modeling of mean and inflation probability within or across tree ensembles (So, 2023).
6. Interpretability and Feature Analysis
CatBoost computes feature importances by aggregating the loss increase when a split uses a given feature, supporting robust variable selection and diagnosis (Prokhorenkova et al., 2017, Dorogush et al., 2018, Fajar et al., 2024). Support is native for the SHAP (Shapley Additive Explanations) algorithm, which provides consistent local and global interpretation:
- SHAP summary plots and interaction scatter plots expose non-linear and synergistic effects among features, as applied in insurance telematics and phishing detection (So, 2023, Fajar et al., 2024, Li et al., 2022).
- CatBoost’s permutation-based feature importance avoids over-reliance on single variables or overfitting seen in other GBDT models (Fajar et al., 2024).
7. Practical Applications and Benchmarking
CatBoost has been deployed in diverse high-data domains:
- Astrophysics: Regression for photometric redshift estimation in large surveys (DESI Legacy, Euclid), yielding σ_NMAD as low as 0.0156, with error fractions under 1%, and outperforming MLP and Random Forest in both accuracy and AUC (Li et al., 2022, Collaboration et al., 17 Apr 2025).
- Insurance: Superior pseudo-R² and deviance for auto claim frequency via zero-inflated Poisson boosting, with advanced SHAP-based interpretation (So, 2023).
- Finance: Enhanced loan risk modeling using target-guided synthetic feature generation, with AUC reaching 98.80% (Wang et al., 2021).
- Cybersecurity: Robust phishing URL detection, maintaining near-perfect accuracy under aggressive feature selection and outperforming XGBoost/EBM in situations with complex feature interactions (Fajar et al., 2024).
Empirical evaluations across domains confirm that CatBoost outperforms or matches alternate tree-based approaches in most settings, even at default hyperparameters. Notable limitations include degraded prediction quality at out-of-distribution data (e.g., high-redshift galaxies with training set gaps) and the need for careful permutation scaling for massive datasets (Li et al., 2022, Collaboration et al., 17 Apr 2025).
References:
- CatBoost: gradient boosting with categorical features support (Dorogush et al., 2018)
- CatBoost: unbiased boosting with categorical features (Prokhorenkova et al., 2017)
- Optimization of Oblivious Decision Tree Ensembles Evaluation for CPU (Mironov et al., 2022)
- CatBoostLSS: An extension of CatBoost to probabilistic forecasting (März, 2020)
- Enhanced Gradient Boosting for Zero-Inflated Insurance Claims… (So, 2023)
- Euclid preparation. Estimating galaxy physical properties using CatBoost… (Collaboration et al., 17 Apr 2025)
- Photometric redshift estimation… DESI Legacy Imaging (Li et al., 2022)
- Enhancing Phishing Detection… CatBoost, XGBoost, and EBM Models (Fajar et al., 2024)
- Classification of Fermi-LAT unidentified gamma-ray sources using CatBoost… (Coronado-Blázquez, 2022)
- CatBoost model with synthetic features in application to loan risk… (Wang et al., 2021)