Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning

Published 2 Apr 2026 in cs.CV and cs.AI | (2604.01579v1)

Abstract: Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces GAAL, a method that alternates modality-specific optimization to mitigate gradient conflicts and prevent modality collapse.
It incorporates uncertainty-based gradient surgery via quadratic programming to align conflicting gradients in the shared classifier.
Experimental results demonstrate GAAL's superior performance in both unimodal and multimodal tasks across multiple datasets.

Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning

Introduction

Multimodal fusion of tabular and image data is increasingly critical in real-world applications where structured data and visual cues jointly inform decision-making processes, notably in areas such as healthcare, recommender systems, and marketing analytics. However, the joint optimization of tabular and image modalities suffers from gradient conflicts that emerge due to their inherently disparate structures and learning dynamics. Prior methods in multimodal learning have predominantly relied on unified joint objectives, resulting in negative interactions between modalities, or focused exclusively on encoder-specific gradient manipulation, neglecting adverse interactions at the shared classifier. The paper "Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning" (2604.01579) introduces GAAL, a principled learning paradigm that explicitly addresses cross-modal gradient conflicts and enhances synergistic tabular-image fusion.

Analysis of Gradient Conflicts in Multimodal Fusion

Joint learning paradigms for tabular-image fusion inherently induce gradient conflicts between modalities, as evidenced by pervasive negative cosine similarity between the image-specific and multimodal gradients. This misalignment not only misleads optimization at the modality level but also reduces the representational efficacy of each unimodal branch, leading to a pronounced modality collapse. As demonstrated empirically, canonical approaches, including OGM, MMPareto, MLA, and other multimodal strategies, exhibit substantial unimodal performance degradation when deployed on challenging fusion tasks, such as the DVM dataset.

Figure 1: Tabular-image fusion tasks exhibit severe gradient conflicts with negative cosine similarity in joint training, and existing methods fail to fully leverage unimodal capacities.

Gradient-Aligned Alternating Learning (GAAL) Framework

The proposed GAAL algorithm alternates modality-specific optimization with a shared classifier, decoupling the conflicting gradient flows inherent in joint updates. At each iteration, only one modality’s learner is updated, ensuring per-step focus and information preservation within each stream. To address persistent conflicts at the shared classifier, where alternated gradients can be antagonistic, GAAL incorporates an uncertainty-based cross-modal gradient surgery mechanism.

At each alternating iteration, high-entropy (uncertain) samples from the previous modality are identified, and their average gradient direction is used as a projection anchor via quadratic programming. Current modality gradients that conflict with the previous ones—measured by negative cosine similarity—are projected onto the cross-modal anchor, enforcing geometric alignment subject to an $\epsilon$ -margin constraint. This selective gradient surgery robustly mitigates destructive interference, focusing the update direction to enhance synergy without enforced orthogonality, thus preserving latent modal overlaps.

Figure 2: GAAL architecture alternates updates between modalities and uses uncertainty-based gradient guidance, projecting conflicting gradients in the shared classifier to enhance cross-modal interaction.

Uncertainty-Guided Gradient Selection

Key to the efficacy of GAAL’s cross-modal gradient surgery is the focus on uncertainty-guided (hard) samples. Rather than diluting gradient guidance equally across all examples, entropy-based selection accentuates the learning signal on samples for which the current modality is most unsure. This strategy concentrates optimization on decision boundaries and tail cases, preventing overfitting to easy patterns and leveraging high-information instances to refine the shared classifier. The constrained quadratic programming ensures minimal perturbation to the original gradient while enforcing alignment with the modality most responsive to these hard cases.

Experimental Validation

Empirical evaluation is conducted over three large-scale datasets: DVM, SUNAttribute, and CelebA. Strong baselines—including multimodal fusion, unimodal learning, and state-of-the-art gradient-conflict mitigation strategies—are comprehensively compared. The results demonstrate that GAAL achieves superior multimodal and unimodal performance across all benchmarks, and in several cases, even outperforms the best-per-execution unimodal learners, underscoring its capacity to avoid modality collapse and promote robust cross-modal interaction.

Figure 3: GAAL exhibits sensitivity to the number of high-uncertainty samples ( $\lambda^I$ , $\lambda^T$ ), affirming the importance of hard-sample selection for effective cross-modal gradient guidance.

Figure 4: The convergence profiles of GAAL on DVM and SUNAttribute exhibit stable, monotonically decreasing loss, confirming the resolution of gradient conflicts and training robustness.

Ablation studies further distinguish the individual contributions of alternating learning, cross-modal gradient surgery, and uncertainty-based sample selection. Ablating any component leads to a significant performance drop, confirming their necessity for optimal tabular-image fusion.

Figure 5: Fusion performance is sensitive to the alignment constraint margin $\epsilon$ , with optimal alignment providing maximum synergy and deviation leading to suboptimal independence or excessive entanglement.

Theoretical and Practical Implications

GAAL exemplifies a structured approach to harmonizing multimodal fusion where gradient-space alignment supersedes global joint optimization. The explicit alternating and cross-modal guidance decouples destructive interactions and enables balanced utilization of tabular and image information. The method’s superior results on missing-modality inference tasks further indicate that the learned shared classifier generalizes efficiently even when tabular data is unavailable at test time—a crucial property for practical deployment.

GAAL’s quadratic programming-based gradient projections are amenable to extension in more general settings, such as regression, detection, and broader multi-modal fusion with more than two modalities. The focus on uncertainty-driven guidance can influence active learning paradigms and robustness-critical applications.

Conclusion

The GAAL paradigm systematically addresses the core bottleneck in tabular-image fusion: gradient misalignment across modalities, especially in the shared classifier. Through iterative, uncertainty-guided alternating optimization and convex projection of conflicting gradients, GAAL achieves state-of-the-art performance, balances unimodal and multimodal strengths, and provides a robust template for future multimodal fusion research. Extensions to other modalities, continuous targets, and more complex architectures are natural subsequent directions, with the foundational framework solidly substantiated by comprehensive empirical analysis.

Markdown Report Issue