Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cyclic Gradient Boosting

Updated 30 January 2026
  • Cyclic gradient boosting is a coordinate-descent approach that sequentially cycles through parameter updates to minimize the negative log-likelihood loss.
  • It employs per-parameter hyperparameters such as tree depth and learning rate, enabling tailored regularization and model complexity control.
  • While offering flexible customization for actuarial claim frequency and severity modeling, cyc-GBM incurs higher computational cost compared to standard boosting methods.

Cyclic gradient boosting, abbreviated as cyc-GBM, is a coordinate-descent style algorithm for probabilistic decision tree ensembles that generalizes classical gradient boosting to multi-parameter distributions by sequentially cycling through updates to each parameter. This method is designed to fit models where the conditional distribution YF(y;p1(x),,pK(x))Y \sim F(y; p_1(\mathbf{x}), \ldots, p_K(\mathbf{x})) depends on KK parameters, and is particularly applicable in actuarial contexts for claim frequency and severity prediction. Cyc-GBM explicitly interleaves the boosting of each parameter within each iteration, allowing differentiated model complexity and regularization across parameters.

1. Mathematical Formulation

The objective of cyc-GBM is to minimize the negative log-likelihood loss over the training set D\mathcal{D}:

Ltotal=i=1nL(yi,pi),\mathcal{L}_{total} = \sum_{i=1}^{n} L(y_i, \mathbf{p}_i),

where pi=[p1,i,,pK,i]\mathbf{p}_i = [p_{1,i}, \ldots, p_{K,i}]^\top are the current parameter estimates for observation ii, and each unconstrained tree output fk(x)Rf_k(\mathbf{x}) \in \mathbb{R} is mapped to pkp_k via a link function ϕk\phi_k, i.e., pk=ϕk1(fk)p_k = \phi_k^{-1}(f_k). For each boosting iteration and parameter kk, cyc-GBM constructs the vector of predictions:

s^k,i(m)=[f1(m),,fk1(m),fk(m1),fk+1(m1),,fK(m1)](xi),\hat{\mathbf{s}}_{k,i}^{(m^*)} = [f_1^{(m)}, \ldots, f_{k-1}^{(m)}, f_k^{(m-1)}, f_{k+1}^{(m-1)}, \ldots, f_K^{(m-1)}](\mathbf{x}_i),

with all parameters before kk updated in the current iteration, kk at its previous step, and those after at their previous step.

The negative gradient (pseudo-residual) for pkp_k is given by:

gi,k(m)=L(yi,p)pkp=ϕ1(s^k,i(m)).g_{i,k}^{(m)} = -\frac{\partial L(y_i, \mathbf{p})}{\partial p_k} \Bigr|_{\mathbf{p} = \phi^{-1}(\hat{\mathbf{s}}_{k,i}^{(m^*)})}.

Regression trees of depth dkd_k are fit to {(xi,gi,k(m)):i=1,,n}\{(\mathbf{x}_i, g_{i,k}^{(m)}): i = 1, \ldots, n\}, and for each leaf jj a one-dimensional line-search solves:

bj,k(m)=argminbi:xiRj,kmL(yi,ϕ1(s^k,i(m))+ukb),b_{j,k}^{(m)} = \arg\min_b \sum_{i: \mathbf{x}_i \in R_{j,k}^m} L(y_i, \phi^{-1}(\hat{\mathbf{s}}_{k,i}^{(m^*)}) + \mathbf{u}_k \cdot b),

where uk\mathbf{u}_k is the unit vector in the kkth coordinate. The tree prediction is updated via:

fk(m)(x)=fk(m1)(x)+λkj=1Jm,kbj,k(m)1xRj,km.f_k^{(m)}(\mathbf{x}) = f_k^{(m-1)}(\mathbf{x}) + \lambda_k \cdot \sum_{j=1}^{J_{m,k}} b_{j,k}^{(m)} 1_{\mathbf{x} \in R_{j,k}^m}.

The process cycles through k=1,,Kk = 1,\ldots,K for each boosting iteration mm and stops updating parameter kk once all MkM_k steps are used (Chevalier et al., 2024).

2. End-to-End Implementation Procedure

The training of cyc-GBM requires the initialization of each parameter's model fk(0)(x)f_k^{(0)}(\mathbf{x}) as the global maximum likelihood estimate under homogeneity. For boosting iteration mm up to Mmax=maxkMkM_{max} = \max_k M_k and for each kk, the update sequence is:

  1. If m>Mkm > M_k, retain previous parameter values.
  2. Form current predictions s^k,i\hat{\mathbf{s}}_{k,i} for all ii.
  3. Compute pseudo-responses gi,kg_{i,k} using the gradient of the negative log-likelihood.
  4. Fit a regression tree to these pseudo-responses to create leaf partitions Rj,kmR_{j,k}^m.
  5. For each leaf jj, solve the line-search for best increment bj,kb_{j,k}.
  6. Update the tree model for kk using the learning rate λk\lambda_k and the calculated leaf increments.

At prediction time for new x\mathbf{x}, output (p^1,,p^K)=(ϕ11(f1(M)(x)),,ϕK1(fK(M)(x)))(\hat{p}_1, \ldots, \hat{p}_K) = (\phi_1^{-1}(f_1^{(M)}(\mathbf{x})), \ldots, \phi_K^{-1}(f_K^{(M)}(\mathbf{x}))) (Chevalier et al., 2024).

3. Distinctions from Standard Gradient Boosting Variants

Cyc-GBM diverges from standard GBM (Friedman 2001) and XGBoost-style methods in several respects:

  • Tree Growing for Multiple Parameters: Classical GBM fits a single tree sequence to one loss, typically the mean; cyc-GBM fits KK tree sequences for KK parameters, cycling through them within each iteration.
  • Leaf Optimization: XGBoost and LightGBM employ a second-order (Newton) loss approximation and closed-form leaf weights, while cyc-GBM uses a direct line-search over the original loss and only utilizes the gradient.
  • Parameter Update Schemes: XGBoostLSS implements a sequential update strategy cycling through parameters with repeated passes. Cyc-GBM executes a single k=1…K pass each boosting cycle, allowing individual control of tree depth, learning rates, and iterations per parameter.
  • Model Complexity Control: Cyc-GBM's architecture allows parameters to have distinct tree complexities and regularizations depending on domain knowledge or modeling requirements.

A plausible implication is that cyc-GBM is more flexible for arbitrary differentiable loss functions and multi-parameter distributions but at increased computational expense and without the efficiency of second-order methods (Chevalier et al., 2024).

4. Per-Parameter Hyperparameters and Tuning Strategy

Cyc-GBM introduces the following per-parameter hyperparameters:

  • MkM_k (Number of Trees): Determines the number of boosting iterations for parameter kk; may be set lower for less variable parameters or zero to maintain constancy.
  • dkd_k (Tree Depth): Controls the complexity of x\mathbf{x}-dependence for each parameter; for global (constant) parameters, set dk=0d_k=0.
  • λk\lambda_k (Learning Rate): Regulates regularization and convergence speed per parameter; typical values are λk0.1\lambda_k \leq 0.1, with fine-tuning individually or shared.
  • ϕk\phi_k (Link Function): Common choices include log-link for positive parameters and identity for unconstrained parameters.

The recommended tuning procedure utilizes a small fixed λk\lambda_k (e.g., 0.01), grid-searches over values for MkM_k and dkd_k (e.g., Mk{100,200,500,}M_k \in \{100,200,500,…\}, dk{1,3,5}d_k \in \{1,3,5\}), and relies on out-of-sample deviance or CRPS for evaluation. Initial coarse grid-search is advised before local refinement due to slow training (Chevalier et al., 2024).

5. Comparative Empirical Performance and Guidelines

Benchmarks by Delong et al. on five insurance datasets show cyc-GBM is the slowest among multi-parameter methods, with training times 2×–5× those of XGBoostLSS or NGBoost under identical hardware. Its optimization of exact loss in every leaf via line-search results in significant computational overhead.

Predictive performance analyses reveal valid probabilistic forecasts, but cyc-GBM did not consistently outperform competing algorithms in metrics such as McFadden’s R2R^2, CRPS, or coverage. Often, cyc-GBM ranked last among probabilistic boosting methods, attributed to the absence of second-order information and increased variance in line-search updates (Chevalier et al., 2024).

A plausible implication is that cyc-GBM should be reserved for modeling tasks requiring per-parameter customization, exotic loss functions, or interpretability with selective complexity control. Otherwise, alternatives such as XGBoostLSS or NGBoost generally offer superior efficiency and competitive accuracy.

6. Contextual Applications and Domain Relevance

Cyc-GBM is well-suited to actuarial and insurance modeling, especially for frequency and severity prediction involving high-cardinality categorical variables. Its flexibility for handling arbitrary differentiable losses and custom multi-parameter distributions is valuable in domains where standard boosting methods are less tractable. Exposure-to-risk can be seamlessly integrated into boosting frequency models with cyc-GBM’s architecture.

The ability to maintain interpretability—by constraining certain parameters to remain simple (e.g., setting dk=0d_k=0)—while flexibly modeling others reflects the method’s domain adaptability. However, practitioners must weigh the computational cost and predictive yield, reserving cyc-GBM for specific research needs where standard boosting approaches may lack necessary expressiveness or interpretability (Chevalier et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cyclic Gradient Boosting.