Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Modal XGBoost Framework

Updated 11 January 2026
  • The paper introduces an alternating-modality boosting paradigm where each modality is updated separately using residual fitting and KL divergence to prevent modality competition.
  • It details an objective formulation combining agreement terms with dynamic regularizers and memory consolidation to stabilize cross-modal learning.
  • Empirical results demonstrate a 5–10 percentage point improvement on benchmark datasets by effectively balancing uni-modal feature extraction with cross-modal fusion.

A Cross-Modal XGBoost Framework is an approach to multi-modal learning that integrates boosting principles with alternating-modality updates, drawing direct analogies to XGBoost while addressing modality competition and distributional imbalance within multi-modal fusion architectures. Central to this framework, as instantiated in ReconBoost, is the replacement of joint optimization with a per-modality, residual-based alternation and the explicit use of regularizers to reconcile information across modalities without allowing any single modality to dominate. This strategy is particularly formulated to exploit uni-modal features fully while controlling cross-modal interactions using dynamic, KL-divergence-based targets.

1. Paradigm of Alternating-Modality Boosting

ReconBoost, the reference realization of a Cross-Modal XGBoost Framework, adopts a modality-alternating boosting paradigm. Given MM modalities, at each boosting round tt, a single modality mm is selected for update, with other modalities' models frozen. The modality-specific learner ϕm(θm)\phi_m(\theta_m) is trained to fit the residual error relative to the sum of the other frozen modality learners ΦMm(t)\Phi_{M\setminus m}^{(t)}: ΦMm(t)(xi)=kmϕk(θk(t);xik)\Phi_{M\setminus m}^{(t)}(x_i) = \sum_{k\neq m} \phi_k(\theta_k^{(t)};x_i^k) The training objective at each round includes a dynamically updated reconcilement regularizer, implemented as a Kullback-Leibler (KL) divergence, which ensures the updated modality learner aligns in a controlled manner with the output of other modalities, thereby preventing any one modality from overpowering the system.

This alternating mechanism—contrasted with standard simultaneous joint optimization—prevents the dominant modality from monopolizing the shared gradient, a phenomenon known as modality competition. Instead, the dynamic regularization term enforces the new learner to fit the negative gradient (residual) of the loss with respect to the current non-updated modalities' outputs, mirroring Friedman's gradient boosting paradigm, but each modality only retains its latest ensemble member.

2. Objective Formulation

The combined objective at iteration tt for the selected modality mm comprises three components: (1) an agreement term aligning with the label, (2) a KL-based reconcilement regularizer, and (3) an optional memory consolidation regularizer to prevent catastrophic forgetting: L~(t)(θm)=1Ni=1N(ϕm(θm;xim),yi)λ1Ni=1NKL[σ(ΦMm(t)(xi))σ(ϕm(θm;xim))]+αLMCR\widetilde{\mathcal{L}}^{(t)}(\theta_m) = \frac{1}{N}\sum_{i=1}^N \ell(\phi_m(\theta_m;x_i^m),y_i) - \lambda\frac{1}{N}\sum_{i=1}^N \operatorname{KL}\left[\sigma(\Phi_{M\setminus m}^{(t)}(x_i))\,\|\,\sigma(\phi_m(\theta_m;x_i^m))\right] + \alpha\,\mathcal{L}_{\mathrm{MCR}} where \ell is the base loss (e.g., cross-entropy), σ\sigma the softmax, λ\lambda the reconcilement weight, and α\alpha the memory consolidation strength. The memory consolidation regularizer is given by: LMCR=1Ni=1Nϕm(ϕm(t);xim,yi)ϕm1(ϕm1(t1);xim1,yi)2\mathcal{L}_{\mathrm{MCR}} = \frac{1}{N}\sum_{i=1}^N \left\|\nabla_{\phi_m}\ell(\phi_m^{(t)}; x_i^m, y_i) - \nabla_{\phi_{m-1}}\ell(\phi_{m-1}^{(t-1)}; x_i^{m-1}, y_i)\right\|^2 This structure ensures modality-specific fitting while enforcing both cross-modal reconciliation and temporal stability in the representation.

3. Algorithmic Structure and Enhancements

The framework operates in two stages for each boosting round:

a. Alternating Residual Boosting:

Only the current modality mm undergoes parameter updates, with agreement, reconcilement, and memory consolidation losses computed. The update is performed via gradient descent on the combined dynamic loss.

b. Global Rectification Scheme (GRS):

After the residual update, a global rectification stage fine-tunes all modalities jointly on the multi-modal loss: total=1Ni=1N(k=1Mϕk(θk;xik),yi)\ell_{\text{total}} = \frac{1}{N} \sum_{i=1}^N \ell\left(\sum_{k=1}^M \phi_k(\theta_k;x_i^k), y_i\right) for several steps. This mitigates greedy local optima arising from isolated modality updates.

Enhancements, such as Memory Consolidation Regularization (MCR), are crucial for preventing drifting and ensuring continuity in representation learning across rounds. Hyperparameter α\alpha adjusts the consolidation strength.

4. Analogy to XGBoost and Gradient Boosting

The cross-modal framework maps classical XGBoost principles to the multi-modal domain as follows:

Aspect XGBoost / GBM Cross-Modal XGBoost Framework (ReconBoost)
Residual Fitting Each new weak learner fits - \nabla \ell Each new modality fits ΦMm- \nabla_{\Phi_{M\setminus m}}\ell, injected via KL regularizer
Ensemble Handling All weak learners retained (additive ensemble) Only most recent model per modality retained
Shrinkage Step-size, learning rate γ\gamma, λ\lambda act as shrinkage hyperparameters
Tree Objective Second-order Taylor expansion on \ell Second-order approx can be used for modality-objectives
Full Model Update Periodic Newton-like updates GRS as joint fine-tuning after each boosting cycle

A defining difference is the retention only of the latest learner per modality, minimizing over-ensembling risk with strong parametric learners (such as DNNs), a concern absent in classical boosting with weak learners.

5. Implementation within XGBoost-like Systems

The framework is directly implementable in existing XGBoost frameworks by treating each modality as a feature group and alternating the updated modality at each round. The combined objective for modality mm at round tt is: Lmt(fm)=i(σ(ΦMm(t)+fm(xim)),yi)λKL[σ(ΦMm(t))σ(ΦMm(t)+fm)]+αMCRL^t_m(f_m) = \sum_i \ell(\sigma(\Phi_{M\setminus m}^{(t)} + f_m(x_i^m)), y_i) - \lambda \cdot \operatorname{KL}[ \sigma(\Phi_{M\setminus m}^{(t)}) \| \sigma(\Phi_{M\setminus m}^{(t)} + f_m)] + \alpha\, \mathrm{MCR} Gradient and Hessian computation follow standard procedures, with gradients and Hessians fed into the standard tree-building or learner update API per modality. Hyperparameter suggestions are λ0.250.5\lambda \approx 0.25{\text{--}}0.5 for reconcilement, α0.010.1\alpha \approx 0.01{\text{--}}0.1 for memory consolidation, and a learning rate of $0.1$ or smaller for new trees. Logistic or softmax losses are recommended for the main loss function.

6. Empirical Observations and Ablation Findings

Comprehensive benchmark results confirm the superiority of the cross-modal boosting framework over previous modality-balancing methods (e.g., OGM-GE, PMR, UMT), yielding a $5$–$10$ percentage point improvement on datasets such as AVE, CREMA-D, MOSEI, and MOSI. The removal of the KL reconcilement term (λ=0\lambda=0) leads to a $3$–$5$ point reduction in performance, emphasizing its necessity. Ablating either the memory consolidation or the global rectification scheme decreases performance by approximately $1$ percentage point, confirming their criticality for retaining information and avoiding poor local minima. The framework shows optimal stability and convergence for λ[0.25,0.5]\lambda\in[0.25,0.5] and α[0.01,0.1]\alpha\in[0.01,0.1]; modality-imbalanced datasets, characterized by high modality dominance coefficient (DMC), benefit most from this approach.

Convergence profiles further indicate that alternating-modality updates reliably circumvent the learning plateaus associated with classic joint-optimization paradigms due to modality competition.

7. Context and Significance

The Cross-Modal XGBoost Framework, through ReconBoost, establishes a principled approach to reconciling the twin goals of uni-modal feature exploitation and cross-modal interaction in multi-modal deep learning. By operationalizing boosting concepts within this domain and explicitly addressing the modality competition phenomenon, it presents a scalable methodology compatible with popular optimization toolkits and justifies its empirical effectiveness with rigorous ablation and sensitivity analyses (Hua et al., 2024). A plausible implication is a broader applicability to multi-modal problems with severe feature and data heterogeneity, where standard joint training is prone to underutilize weaker modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Modal XGBoost Framework.