Papers
Topics
Authors
Recent
2000 character limit reached

CatBoost: Gradient Boosting for Categorical Data

Updated 4 December 2025
  • CatBoost is an advanced gradient boosting model that efficiently handles categorical features using ordered boosting and bias-corrected target encoding to prevent prediction shift.
  • It employs ordered target statistics and automatic feature combinations to capture high-order interactions, improving generalization and computational efficiency.
  • Empirical results show CatBoost outperforms competitors like XGBoost and LightGBM on diverse datasets while offering robust interpretability and fast CPU/GPU implementations.

Categorical Boosting (CatBoost) refers to a class of gradient boosting models and associated methods that provide principled handling of categorical variables, with the CatBoost library (developed by Prokhorenkova et al.) being the canonical reference implementation. CatBoost is characterized by its theoretically motivated algorithms for unbiased boosting and high-performance permutation-driven target encoding, specifically designed to address the prediction shift and target leakage problems endemic to gradient boosting with categorical features (Dorogush et al., 2018, Prokhorenkova et al., 2017, So, 2023). These innovations lead to empirically superior generalization and computational efficiency on tasks involving high-cardinality and interacting categorical predictors, positioning CatBoost as a primary tool for both academic and applied machine learning focused on structured data.

1. Background: Challenges in Gradient Boosting with Categorical Variables

Gradient boosting machines (GBMs) construct an ensemble F(x)F(x) by sequentially adding "weak" base models—typically shallow decision trees—such that each new component hth^t minimizes a loss L\mathcal{L} over the residuals from the current ensemble. At each iteration, pseudo-responses derived from negative gradients or Hessians of the loss function guide tree growing (Dorogush et al., 2018).

Handling categorical features in GBMs is non-trivial. Traditional schemes such as one-hot encoding introduce high memory and computation costs for high-cardinality variables while ignoring any ordinal or structural information. Simple target-statistics approaches (replacing each category by its empirical mean label) present severe overfitting risk due to target leakage, especially with rare categories or small datasets. Even leave-one-out encodings or greedy statistics do not fully prevent information leakage or bias (Prokhorenkova et al., 2017).

2. Core Algorithmic Innovations in CatBoost

CatBoost incorporates the following algorithmic elements, providing unbiased learning and superior empirical accuracy (Prokhorenkova et al., 2017):

  • Ordered boosting: For each iteration, a random permutation σ\sigma of the training examples is drawn. For each sample kk, the pseudo-residual and target-statistic encoding use only the prefix of σ\sigma preceding kk, ensuring no label-dependent leakage into the encoding or gradient estimation. The resulting supporting models MiM_i are constructed as a hierarchical sketch to avoid O(n2)O(n^2) complexity, reducing practical cost to O(nlogn)O(n\log n) per tree (Prokhorenkova et al., 2017).
  • Ordered target statistics (TS)/CTR encoding: For each categorical variable and permutation, values are encoded as

x^ki=jDk1{xji=xki}yj+apjDk1{xji=xki}+a\widehat{x}_k^i = \frac{\sum_{j\in D_k}\mathbf{1}\{x_j^i=x_k^i\} y_j + a p}{\sum_{j\in D_k}\mathbf{1}\{x_j^i=x_k^i\} + a}

where DkD_k is the σ\sigma-prefix, pp is a global prior, and aa is a smoothing constant. For test points, the full dataset is used. Bayesian bootstrap may be employed to regularize the statistics via a Dirichlet-weighted sampling (Prokhorenkova et al., 2017).

  • Feature combinations: High-order interactions are automatically captured by greedily constructing combinations of categoricals at each split, encoding these with ordered TS using the same prefix-sum strategy (Prokhorenkova et al., 2017, Dorogush et al., 2018).
  • Oblivious (symmetric) trees: Each level in the tree applies the same split across all nodes, yielding perfectly balanced structures. This facilitates highly efficient inference (bit-packing of binary splits) and guards against overfitting (Dorogush et al., 2018).

3. Loss Functions, Gradient Computation, and Unbiasedness

CatBoost supports arbitrary twice-differentiable loss functions $\Loss(y,a)$, including classical regression and classification criteria and, as demonstrated in recent actuarial applications, custom losses such as zero-inflated Poisson (ZIP) for distributional modeling (So, 2023). At each stage, the gradient and (optionally) the Hessian are computed for each training example, denoted

$g_i = \frac{\partial \Loss(y_i,a)}{\partial a}\Big|_{a=F_{t-1}(x_i)},\qquad h_i = \frac{\partial^2 \Loss(y_i,a)}{\partial a^2}\Big|_{a=F_{t-1}(x_i)}$

The unique aspect in CatBoost is the use of ordered (unbiased) residuals for both tree-growing and final leaf value updates. Theoretical analysis demonstrates that the ordered construction eliminates the O(1/n)O(1/n) prediction-shift bias that arises when the same data is reused in both the model and gradient computation, a bias present in all prior boosting toolkits (Prokhorenkova et al., 2017).

For advanced losses such as ZIP, CatBoost calculates closed-form gradients and Hessians—see [(So, 2023), Equations (18)-(19)] for detailed expressions—enabling direct fit for complex insurance-claim count models.

4. High-Performance Implementations and Computational Characteristics

CatBoost implements both CPU and GPU backends with extensive algorithmic optimizations (Dorogush et al., 2018):

  • Histogram-based split search: Features (including continuous, binned, and encoded categorical) are pre-binned, enabling rapid histogram-based gain computation for each candidate split.
  • Efficient data layouts: Binned feature groups are organized into 32-bit words, optimizing for memory bandwidth and GPU shared memory utilization.
  • Oblivious tree inference: Bit-wise evaluation of splits enables highly vectorized, branch-free inference. Each tree's path corresponds to a binary code of depth DD, yielding a look-up table for final prediction. CatBoost achieves 25×–60× faster scoring than XGBoost or LightGBM for comparable ensembles (Dorogush et al., 2018).
  • Logarithmic-complexity ordered boosting: Supporting models are maintained at powers-of-two prefixes to avoid O(n2)O(n^2) overhead inherent to naïve ordered boosting, with practical complexity per tree O(nlogn)O(n\log n) (Prokhorenkova et al., 2017).
  • GPU performance: CatBoost achieves up to 15× speedup on real-world data (e.g., Criteo dataset on Tesla V100), with training times for large datasets improved by an order of magnitude relative to CPU baselines (Dorogush et al., 2018).

5. Empirical Performance and Comparative Analysis

CatBoost has been evaluated on a diverse array of benchmarks, including UCI Adult, Amazon Reviews, Criteo, Epsilon, and proprietary actuarial/telematics datasets. CatBoost consistently delivers statistically significant improvements in log-loss and AUC over leading alternatives (LightGBM, XGBoost) (Dorogush et al., 2018, Prokhorenkova et al., 2017, So, 2023):

  • On Amazon Reviews: CatBoost achieves 0.1377 (log-loss) vs LightGBM 0.1636 (+18.8%), XGBoost 0.1633 (+18.6%) (Dorogush et al., 2018).
  • For zero-inflated Poisson boosting on French MTPL data: CatBoost ZIPB2 achieves R2=0.520R^2=0.520, deviance = $0.970$, outperforming LightGBM (R2=0.453R^2=0.453, deviance $1.105$) and XGBoost (R2=0.455R^2=0.455, deviance $1.100$) (So, 2023).
  • Ablation studies indicate that permutation-driven ordered TS yields a 0.5–1.5% improvement in log-loss, ordered boosting adds 0.2–0.7% further gain, and high-order feature combinations contribute up to −11.3% in log-loss on select datasets (Dorogush et al., 2018, Prokhorenkova et al., 2017).
  • CatBoost demonstrates robustness on high-cardinality categorical and heterogeneous data, delivering superior out-of-sample calibration and predictive accuracy in sparse-data regimes (So, 2023).

6. Interpretability and Practical Implementation

CatBoost offers interpretability tools essential for scientific and actuarial applications (So, 2023):

  • Feature importance calculated as average gain across splits, normalized to sum to 100%.
  • Feature interaction strength measured via difference in average prediction with/without joint splits on feature pairs.
  • SHAP value computation compatible with standard GBDT SHAP toolkits, enabling per-instance explanation and marginal effect visualization.
  • Greedy feature combination construction ensures discovery of high-order interactions, particularly critical for telematics and insurance pricing applications.

Parameterization is straightforward: only categorical columns need to be specified, with hyperparameters such as learning rate, tree depth (default 6–8), L2 regularization, one-hot thresholds (≤2–5 unique values default to one-hot encoding), number of permutations (1–3, increasing for stability), and bagging subsample or temperature (Dorogush et al., 2018). Early stopping via validation monitoring is standard practice.

7. Limitations, Practical Considerations, and Comparative Perspective

While CatBoost’s design makes it a leading gradient boosting framework for categorical data, it implicitly imposes only a linear ("ordinal") structure on categoricals via its mean-target encoding, similar to other statistical boosting libraries. In domains where categorical variables possess domain-specific or graph structure (e.g., spatial adjacency, taxonomic hierarchies), methods such as StructureBoost may provide further improvements by explicitly encoding this structure and offering graph-aware split sampling (Lucena, 2020). CatBoost’s approach does not naturally interpolate to unseen categorical values in the way that graph-based structured boosting can.

A plausible implication is that, for tasks with highly structured or relational categorical domains, direct incorporation of external structure (e.g., adjacency graphs) as in StructureBoost may yield improved calibration and robustness, while CatBoost remains advantageous for general tabular data and scenarios requiring efficient, bias-corrected encoding for large-scale heterogeneous inputs.

References

  • CatBoost: gradient boosting with categorical features support (Dorogush et al., 2018)
  • CatBoost: unbiased boosting with categorical features (Prokhorenkova et al., 2017)
  • Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM (So, 2023)
  • StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables (Lucena, 2020)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Categorical Boosting (CatBoost).