Cross-Modal XGBoost Framework

Updated 11 January 2026

The paper introduces an alternating-modality boosting paradigm where each modality is updated separately using residual fitting and KL divergence to prevent modality competition.
It details an objective formulation combining agreement terms with dynamic regularizers and memory consolidation to stabilize cross-modal learning.
Empirical results demonstrate a 5–10 percentage point improvement on benchmark datasets by effectively balancing uni-modal feature extraction with cross-modal fusion.

A Cross-Modal XGBoost Framework is an approach to multi-modal learning that integrates boosting principles with alternating-modality updates, drawing direct analogies to XGBoost while addressing modality competition and distributional imbalance within multi-modal fusion architectures. Central to this framework, as instantiated in ReconBoost, is the replacement of joint optimization with a per-modality, residual-based alternation and the explicit use of regularizers to reconcile information across modalities without allowing any single modality to dominate. This strategy is particularly formulated to exploit uni-modal features fully while controlling cross-modal interactions using dynamic, KL-divergence-based targets.

1. Paradigm of Alternating-Modality Boosting

ReconBoost, the reference realization of a Cross-Modal XGBoost Framework, adopts a modality-alternating boosting paradigm. Given $M$ modalities, at each boosting round $t$ , a single modality $m$ is selected for update, with other modalities' models frozen. The modality-specific learner $\phi_m(\theta_m)$ is trained to fit the residual error relative to the sum of the other frozen modality learners $\Phi_{M\setminus m}^{(t)}$ : $\Phi_{M\setminus m}^{(t)}(x_i) = \sum_{k\neq m} \phi_k(\theta_k^{(t)};x_i^k)$ The training objective at each round includes a dynamically updated reconcilement regularizer, implemented as a Kullback-Leibler (KL) divergence, which ensures the updated modality learner aligns in a controlled manner with the output of other modalities, thereby preventing any one modality from overpowering the system.

This alternating mechanism—contrasted with standard simultaneous joint optimization—prevents the dominant modality from monopolizing the shared gradient, a phenomenon known as modality competition. Instead, the dynamic regularization term enforces the new learner to fit the negative gradient (residual) of the loss with respect to the current non-updated modalities' outputs, mirroring Friedman's gradient boosting paradigm, but each modality only retains its latest ensemble member.

2. Objective Formulation

The combined objective at iteration $t$ for the selected modality $m$ comprises three components: (1) an agreement term aligning with the label, (2) a KL-based reconcilement regularizer, and (3) an optional memory consolidation regularizer to prevent catastrophic forgetting: $\widetilde{\mathcal{L}}^{(t)}(\theta_m) = \frac{1}{N}\sum_{i=1}^N \ell(\phi_m(\theta_m;x_i^m),y_i) - \lambda\frac{1}{N}\sum_{i=1}^N \operatorname{KL}\left[\sigma(\Phi_{M\setminus m}^{(t)}(x_i))\,\|\,\sigma(\phi_m(\theta_m;x_i^m))\right] + \alpha\,\mathcal{L}_{\mathrm{MCR}}$ where $\ell$ is the base loss (e.g., cross-entropy), $\sigma$ the softmax, $\lambda$ the reconcilement weight, and $\alpha$ the memory consolidation strength. The memory consolidation regularizer is given by: $\mathcal{L}_{\mathrm{MCR}} = \frac{1}{N}\sum_{i=1}^N \left\|\nabla_{\phi_m}\ell(\phi_m^{(t)}; x_i^m, y_i) - \nabla_{\phi_{m-1}}\ell(\phi_{m-1}^{(t-1)}; x_i^{m-1}, y_i)\right\|^2$ This structure ensures modality-specific fitting while enforcing both cross-modal reconciliation and temporal stability in the representation.

3. Algorithmic Structure and Enhancements

The framework operates in two stages for each boosting round:

a. Alternating Residual Boosting:

Only the current modality $m$ undergoes parameter updates, with agreement, reconcilement, and memory consolidation losses computed. The update is performed via gradient descent on the combined dynamic loss.

b. Global Rectification Scheme (GRS):

After the residual update, a global rectification stage fine-tunes all modalities jointly on the multi-modal loss: $\ell_{\text{total}} = \frac{1}{N} \sum_{i=1}^N \ell\left(\sum_{k=1}^M \phi_k(\theta_k;x_i^k), y_i\right)$ for several steps. This mitigates greedy local optima arising from isolated modality updates.

Enhancements, such as Memory Consolidation Regularization (MCR), are crucial for preventing drifting and ensuring continuity in representation learning across rounds. Hyperparameter $\alpha$ adjusts the consolidation strength.

4. Analogy to XGBoost and Gradient Boosting

The cross-modal framework maps classical XGBoost principles to the multi-modal domain as follows:

Aspect	XGBoost / GBM	Cross-Modal XGBoost Framework (ReconBoost)
Residual Fitting	Each new weak learner fits $- \nabla \ell$	Each new modality fits $- \nabla_{\Phi_{M\setminus m}}\ell$ , injected via KL regularizer
Ensemble Handling	All weak learners retained (additive ensemble)	Only most recent model per modality retained
Shrinkage	Step-size, learning rate	$\gamma$ , $\lambda$ act as shrinkage hyperparameters
Tree Objective	Second-order Taylor expansion on $\ell$	Second-order approx can be used for modality-objectives
Full Model Update	Periodic Newton-like updates	GRS as joint fine-tuning after each boosting cycle

A defining difference is the retention only of the latest learner per modality, minimizing over-ensembling risk with strong parametric learners (such as DNNs), a concern absent in classical boosting with weak learners.

5. Implementation within XGBoost-like Systems

The framework is directly implementable in existing XGBoost frameworks by treating each modality as a feature group and alternating the updated modality at each round. The combined objective for modality $m$ at round $t$ is: $L^t_m(f_m) = \sum_i \ell(\sigma(\Phi_{M\setminus m}^{(t)} + f_m(x_i^m)), y_i) - \lambda \cdot \operatorname{KL}[ \sigma(\Phi_{M\setminus m}^{(t)}) \| \sigma(\Phi_{M\setminus m}^{(t)} + f_m)] + \alpha\, \mathrm{MCR}$ Gradient and Hessian computation follow standard procedures, with gradients and Hessians fed into the standard tree-building or learner update API per modality. Hyperparameter suggestions are $\lambda \approx 0.25{\text{--}}0.5$ for reconcilement, $\alpha \approx 0.01{\text{--}}0.1$ for memory consolidation, and a learning rate of $0.1$ or smaller for new trees. Logistic or softmax losses are recommended for the main loss function.

6. Empirical Observations and Ablation Findings

Comprehensive benchmark results confirm the superiority of the cross-modal boosting framework over previous modality-balancing methods (e.g., OGM-GE, PMR, UMT), yielding a $5$–$10$ percentage point improvement on datasets such as AVE, CREMA-D, MOSEI, and MOSI. The removal of the KL reconcilement term ( $\lambda=0$ ) leads to a $3$–$5$ point reduction in performance, emphasizing its necessity. Ablating either the memory consolidation or the global rectification scheme decreases performance by approximately $1$ percentage point, confirming their criticality for retaining information and avoiding poor local minima. The framework shows optimal stability and convergence for $\lambda\in[0.25,0.5]$ and $\alpha\in[0.01,0.1]$ ; modality-imbalanced datasets, characterized by high modality dominance coefficient (DMC), benefit most from this approach.

Convergence profiles further indicate that alternating-modality updates reliably circumvent the learning plateaus associated with classic joint-optimization paradigms due to modality competition.

7. Context and Significance

The Cross-Modal XGBoost Framework, through ReconBoost, establishes a principled approach to reconciling the twin goals of uni-modal feature exploitation and cross-modal interaction in multi-modal deep learning. By operationalizing boosting concepts within this domain and explicitly addressing the modality competition phenomenon, it presents a scalable methodology compatible with popular optimization toolkits and justifies its empirical effectiveness with rigorous ablation and sensitivity analyses (Hua et al., 2024). A plausible implication is a broader applicability to multi-modal problems with severe feature and data heterogeneity, where standard joint training is prone to underutilize weaker modalities.

Markdown Report Issue Upgrade to Chat

References (1)

ReconBoost: Boosting Can Achieve Modality Reconcilement (2024)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Cross-Modal XGBoost Framework.