Papers
Topics
Authors
Recent
Search
2000 character limit reached

CatBoost: Unbiased Gradient Boosting

Updated 16 March 2026
  • CatBoost is a gradient boosting library that uses ordered boosting and oblivious trees to deliver unbiased, accurate predictions on heterogeneous data.
  • It employs innovative ordered target statistics to encode high-cardinality categorical features, effectively eliminating target leakage.
  • Optimized for both CPU and GPU, CatBoost offers state-of-the-art efficiency and interpretability across domains like astrophysics, finance, and cybersecurity.

CatBoost (Categorical Boosting) is a gradient-boosted decision tree (GBDT) machine learning library designed to deliver unbiased, high-accuracy predictions on heterogeneous datasets containing both numerical and high-cardinality categorical variables. CatBoost’s core algorithmic innovations, notably ordered boosting and permutation-driven target statistics, systematically address target leakage and gradient bias, setting it apart from classical GBDT approaches such as XGBoost and LightGBM. CatBoost supports both CPU and GPU acceleration and has demonstrated state-of-the-art performance and efficiency across domains including astrophysics, insurance, finance, and cybersecurity (Dorogush et al., 2018, Prokhorenkova et al., 2017, Li et al., 2022, So, 2023, Fajar et al., 2024, Wang et al., 2021).

1. Algorithmic Foundations: Ordered Boosting and Oblivious Trees

The CatBoost learning strategy centers on the minimization of an empirical loss functional

J(F)=i=1nL(Yi,F(Xi)),J(F) = \sum_{i=1}^n L(Y_i, F(X_i)),

where FF is a GBDT ensemble composed of a sequence of base learners (trees) fit to pointwise gradients and, in second-order mode, Hessians of the loss (Prokhorenkova et al., 2017, Dorogush et al., 2018).

Standard GBDT algorithms suffer from “prediction shift”—bias introduced when gradient (and categorical encoding) calculations for each example xkx_k rely on models that have already seen xkx_k (Prokhorenkova et al., 2017). CatBoost eliminates this shift through ordered boosting: - Draws random permutations σ\sigma of the data; - For each data point at position kk in σ\sigma, its pre-fit gradient and categorical encoding are computed using only the preceding k1k-1 examples; - Maintains logarithmic-sized supporting models per permutation to ensure unbiased, “out-of-fold” gradient and encoding computations while retaining O(sn)O(s n) per-iteration cost.

CatBoost’s base learners are oblivious trees, i.e., symmetric binary trees of fixed depth dd, wherein every node at depth ll splits on the same feature and threshold across all paths. This yields extremely efficient evaluation: the path to each leaf corresponds to a unique dd-bit integer and all leaf values are stored in contiguous arrays (Dorogush et al., 2018, Mironov et al., 2022).

2. Categorical Feature Encoding: Ordered Target Statistics

Handling categorical variables without information leakage is a central challenge. CatBoost encodes categorical features using ordered target statistics (“CTR” features) computed as follows, for feature kk, category value xi,kx_{i,k}, and permutation σ\sigma (Prokhorenkova et al., 2017, Dorogush et al., 2018):

xσp,k=j=1p1[xσj,k=xσp,k]Yσj+aPj=1p1[xσj,k=xσp,k]+a.x'_{\sigma_p, k} = \frac{ \sum_{j=1}^{p-1} [x_{\sigma_j, k} = x_{\sigma_p, k}] \cdot Y_{\sigma_j} + a \cdot P }{ \sum_{j=1}^{p-1} [x_{\sigma_j, k} = x_{\sigma_p, k}] + a }.

Here, aa is a prior-weight and PP the prior (e.g., the global mean). Multiple permutations (usually s=1s = 1 or $2$) stabilize the encoding. This approach enables robust use of very high-cardinality categoricals and dynamic greedy feature combinations. The encoding is strictly “out-of-fold” with respect to model fitting and all downstream splits, eliminating target leakage (Prokhorenkova et al., 2017, Dorogush et al., 2018, Wang et al., 2021).

3. Boosting Optimization and Tree Construction

At each iteration, CatBoost fits an oblivious tree to negative gradients (and Hessians for Newton mode). After the tree structure is built, optimal leaf values γ\gamma_\ell are set using a regularized Newton step:

γ=igiihi+λ\gamma_\ell = - \frac{ \sum_{i \in \ell} g_i }{ \sum_{i \in \ell} h_i + \lambda }

where gig_i, hih_i are the loss gradient and Hessian for each sample in leaf \ell and λ\lambda is the L2-leaf regularization parameter (Dorogush et al., 2018, Prokhorenkova et al., 2017, Wang et al., 2021). This procedure generalizes to arbitrary convex loss functions including regression, classification, and distributional objectives (see CatBoostLSS below).

Key hyperparameters include: number of trees (iterations), depth, learning rate (η\eta), L2_leaf_reg (λ\lambda), feature bin count, random strength, and bagging temperature (Dorogush et al., 2018, Prokhorenkova et al., 2017). CatBoost employs Bayesian bootstrap gradient weights and random permutations to enhance regularization.

4. System Engineering and Computational Performance

CatBoost implements aggressive optimizations for both CPU and GPU backends (Dorogush et al., 2018, Mironov et al., 2022):

  • CPU Inference: All features are pre-binarized; leaf index calculation is performed via bitwise operations; oblivious tree structure enables vectorized, branchless code. AVX2 and AVX-512 kernels with FP16 leaf value support accelerate prediction by 20–70% with negligible numerical error (Mironov et al., 2022).
  • GPU Training: Histogram-based split search; features packed into compact representations; composite CTR encoding mapped via hashing and processed using sort/scan primitives; multi-GPU training achieved via feature-parallelism (Dorogush et al., 2018).

Quantitatively, CatBoost is 2.6–15× faster than its own CPU scoring, and outperforms XGBoost and LightGBM in both accuracy and runtime, especially at high tree depths or large ensemble sizes (Dorogush et al., 2018, Mironov et al., 2022).

5. Extensions: Probabilistic Forecasting and Zero-Inflated Modeling

CatBoostLSS is an extension for distributional and probabilistic forecasting that directly predicts parameters of a user-specified distribution (e.g., Normal, Poisson, Generalized Beta, Zero-inflated Poisson) (März, 2020). For K-parameter distributions, CatBoostLSS alternates Newton boosting over each parameter, using the full negative log-likelihood as the loss:

(y;θ1,,θK)=lnp(y;θ1,,θK).\ell(y; \theta_1,\ldots,\theta_K) = -\ln p(y; \theta_1, \ldots, \theta_K).

Gradients and Hessians for each parameter are propagated and fit with regular boosting rounds and backfitting (März, 2020, So, 2023).

Zero-inflated Poisson CatBoost models have also been constructed for count data with heavy excess zeros, enabling simultaneous modeling of mean and inflation probability within or across tree ensembles (So, 2023).

6. Interpretability and Feature Analysis

CatBoost computes feature importances by aggregating the loss increase when a split uses a given feature, supporting robust variable selection and diagnosis (Prokhorenkova et al., 2017, Dorogush et al., 2018, Fajar et al., 2024). Support is native for the SHAP (Shapley Additive Explanations) algorithm, which provides consistent local and global interpretation:

  • SHAP summary plots and interaction scatter plots expose non-linear and synergistic effects among features, as applied in insurance telematics and phishing detection (So, 2023, Fajar et al., 2024, Li et al., 2022).
  • CatBoost’s permutation-based feature importance avoids over-reliance on single variables or overfitting seen in other GBDT models (Fajar et al., 2024).

7. Practical Applications and Benchmarking

CatBoost has been deployed in diverse high-data domains:

  • Astrophysics: Regression for photometric redshift estimation in large surveys (DESI Legacy, Euclid), yielding σ_NMAD as low as 0.0156, with error fractions under 1%, and outperforming MLP and Random Forest in both accuracy and AUC (Li et al., 2022, Collaboration et al., 17 Apr 2025).
  • Insurance: Superior pseudo-R² and deviance for auto claim frequency via zero-inflated Poisson boosting, with advanced SHAP-based interpretation (So, 2023).
  • Finance: Enhanced loan risk modeling using target-guided synthetic feature generation, with AUC reaching 98.80% (Wang et al., 2021).
  • Cybersecurity: Robust phishing URL detection, maintaining near-perfect accuracy under aggressive feature selection and outperforming XGBoost/EBM in situations with complex feature interactions (Fajar et al., 2024).

Empirical evaluations across domains confirm that CatBoost outperforms or matches alternate tree-based approaches in most settings, even at default hyperparameters. Notable limitations include degraded prediction quality at out-of-distribution data (e.g., high-redshift galaxies with training set gaps) and the need for careful permutation scaling for massive datasets (Li et al., 2022, Collaboration et al., 17 Apr 2025).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CatBoost.