Cyclic Gradient Boosting

Updated 30 January 2026

Cyclic gradient boosting is a coordinate-descent approach that sequentially cycles through parameter updates to minimize the negative log-likelihood loss.
It employs per-parameter hyperparameters such as tree depth and learning rate, enabling tailored regularization and model complexity control.
While offering flexible customization for actuarial claim frequency and severity modeling, cyc-GBM incurs higher computational cost compared to standard boosting methods.

Cyclic gradient boosting, abbreviated as cyc-GBM, is a coordinate-descent style algorithm for probabilistic decision tree ensembles that generalizes classical gradient boosting to multi-parameter distributions by sequentially cycling through updates to each parameter. This method is designed to fit models where the conditional distribution $Y \sim F(y; p_1(\mathbf{x}), \ldots, p_K(\mathbf{x}))$ depends on $K$ parameters, and is particularly applicable in actuarial contexts for claim frequency and severity prediction. Cyc-GBM explicitly interleaves the boosting of each parameter within each iteration, allowing differentiated model complexity and regularization across parameters.

1. Mathematical Formulation

The objective of cyc-GBM is to minimize the negative log-likelihood loss over the training set $\mathcal{D}$ :

$\mathcal{L}_{total} = \sum_{i=1}^{n} L(y_i, \mathbf{p}_i),$

where $\mathbf{p}_i = [p_{1,i}, \ldots, p_{K,i}]^\top$ are the current parameter estimates for observation $i$ , and each unconstrained tree output $f_k(\mathbf{x}) \in \mathbb{R}$ is mapped to $p_k$ via a link function $\phi_k$ , i.e., $p_k = \phi_k^{-1}(f_k)$ . For each boosting iteration and parameter $k$ , cyc-GBM constructs the vector of predictions:

$\hat{\mathbf{s}}_{k,i}^{(m^*)} = [f_1^{(m)}, \ldots, f_{k-1}^{(m)}, f_k^{(m-1)}, f_{k+1}^{(m-1)}, \ldots, f_K^{(m-1)}](\mathbf{x}_i),$

with all parameters before $k$ updated in the current iteration, $k$ at its previous step, and those after at their previous step.

The negative gradient (pseudo-residual) for $p_k$ is given by:

$g_{i,k}^{(m)} = -\frac{\partial L(y_i, \mathbf{p})}{\partial p_k} \Bigr|_{\mathbf{p} = \phi^{-1}(\hat{\mathbf{s}}_{k,i}^{(m^*)})}.$

Regression trees of depth $d_k$ are fit to $\{(\mathbf{x}_i, g_{i,k}^{(m)}): i = 1, \ldots, n\}$ , and for each leaf $j$ a one-dimensional line-search solves:

$b_{j,k}^{(m)} = \arg\min_b \sum_{i: \mathbf{x}_i \in R_{j,k}^m} L(y_i, \phi^{-1}(\hat{\mathbf{s}}_{k,i}^{(m^*)}) + \mathbf{u}_k \cdot b),$

where $\mathbf{u}_k$ is the unit vector in the $k$ th coordinate. The tree prediction is updated via:

$f_k^{(m)}(\mathbf{x}) = f_k^{(m-1)}(\mathbf{x}) + \lambda_k \cdot \sum_{j=1}^{J_{m,k}} b_{j,k}^{(m)} 1_{\mathbf{x} \in R_{j,k}^m}.$

The process cycles through $k = 1,\ldots,K$ for each boosting iteration $m$ and stops updating parameter $k$ once all $M_k$ steps are used (Chevalier et al., 2024).

2. End-to-End Implementation Procedure

The training of cyc-GBM requires the initialization of each parameter's model $f_k^{(0)}(\mathbf{x})$ as the global maximum likelihood estimate under homogeneity. For boosting iteration $m$ up to $M_{max} = \max_k M_k$ and for each $k$ , the update sequence is:

If $m > M_k$ , retain previous parameter values.
Form current predictions $\hat{\mathbf{s}}_{k,i}$ for all $i$ .
Compute pseudo-responses $g_{i,k}$ using the gradient of the negative log-likelihood.
Fit a regression tree to these pseudo-responses to create leaf partitions $R_{j,k}^m$ .
For each leaf $j$ , solve the line-search for best increment $b_{j,k}$ .
Update the tree model for $k$ using the learning rate $\lambda_k$ and the calculated leaf increments.

At prediction time for new $\mathbf{x}$ , output $(\hat{p}_1, \ldots, \hat{p}_K) = (\phi_1^{-1}(f_1^{(M)}(\mathbf{x})), \ldots, \phi_K^{-1}(f_K^{(M)}(\mathbf{x})))$ (Chevalier et al., 2024).

3. Distinctions from Standard Gradient Boosting Variants

Cyc-GBM diverges from standard GBM (Friedman 2001) and XGBoost-style methods in several respects:

Tree Growing for Multiple Parameters: Classical GBM fits a single tree sequence to one loss, typically the mean; cyc-GBM fits $K$ tree sequences for $K$ parameters, cycling through them within each iteration.
Leaf Optimization: XGBoost and LightGBM employ a second-order (Newton) loss approximation and closed-form leaf weights, while cyc-GBM uses a direct line-search over the original loss and only utilizes the gradient.
Parameter Update Schemes: XGBoostLSS implements a sequential update strategy cycling through parameters with repeated passes. Cyc-GBM executes a single k=1…K pass each boosting cycle, allowing individual control of tree depth, learning rates, and iterations per parameter.
Model Complexity Control: Cyc-GBM's architecture allows parameters to have distinct tree complexities and regularizations depending on domain knowledge or modeling requirements.

A plausible implication is that cyc-GBM is more flexible for arbitrary differentiable loss functions and multi-parameter distributions but at increased computational expense and without the efficiency of second-order methods (Chevalier et al., 2024).

4. Per-Parameter Hyperparameters and Tuning Strategy

Cyc-GBM introduces the following per-parameter hyperparameters:

$M_k$ (Number of Trees): Determines the number of boosting iterations for parameter $k$ ; may be set lower for less variable parameters or zero to maintain constancy.
$d_k$ (Tree Depth): Controls the complexity of $\mathbf{x}$ -dependence for each parameter; for global (constant) parameters, set $d_k=0$ .
$\lambda_k$ (Learning Rate): Regulates regularization and convergence speed per parameter; typical values are $\lambda_k \leq 0.1$ , with fine-tuning individually or shared.
$\phi_k$ (Link Function): Common choices include log-link for positive parameters and identity for unconstrained parameters.

The recommended tuning procedure utilizes a small fixed $\lambda_k$ (e.g., 0.01), grid-searches over values for $M_k$ and $d_k$ (e.g., $M_k \in \{100,200,500,…\}$ , $d_k \in \{1,3,5\}$ ), and relies on out-of-sample deviance or CRPS for evaluation. Initial coarse grid-search is advised before local refinement due to slow training (Chevalier et al., 2024).

5. Comparative Empirical Performance and Guidelines

Benchmarks by Delong et al. on five insurance datasets show cyc-GBM is the slowest among multi-parameter methods, with training times 2×–5× those of XGBoostLSS or NGBoost under identical hardware. Its optimization of exact loss in every leaf via line-search results in significant computational overhead.

Predictive performance analyses reveal valid probabilistic forecasts, but cyc-GBM did not consistently outperform competing algorithms in metrics such as McFadden’s $R^2$ , CRPS, or coverage. Often, cyc-GBM ranked last among probabilistic boosting methods, attributed to the absence of second-order information and increased variance in line-search updates (Chevalier et al., 2024).

A plausible implication is that cyc-GBM should be reserved for modeling tasks requiring per-parameter customization, exotic loss functions, or interpretability with selective complexity control. Otherwise, alternatives such as XGBoostLSS or NGBoost generally offer superior efficiency and competitive accuracy.

6. Contextual Applications and Domain Relevance

Cyc-GBM is well-suited to actuarial and insurance modeling, especially for frequency and severity prediction involving high-cardinality categorical variables. Its flexibility for handling arbitrary differentiable losses and custom multi-parameter distributions is valuable in domains where standard boosting methods are less tractable. Exposure-to-risk can be seamlessly integrated into boosting frequency models with cyc-GBM’s architecture.

The ability to maintain interpretability—by constraining certain parameters to remain simple (e.g., setting $d_k=0$ )—while flexibly modeling others reflects the method’s domain adaptability. However, practitioners must weigh the computational cost and predictive yield, reserving cyc-GBM for specific research needs where standard boosting approaches may lack necessary expressiveness or interpretability (Chevalier et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

From Point to probabilistic gradient boosting for claim frequency and severity prediction (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cyclic Gradient Boosting.