- The paper introduces GAAL, a method that alternates modality-specific optimization to mitigate gradient conflicts and prevent modality collapse.
- It incorporates uncertainty-based gradient surgery via quadratic programming to align conflicting gradients in the shared classifier.
- Experimental results demonstrate GAAL's superior performance in both unimodal and multimodal tasks across multiple datasets.
Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning
Introduction
Multimodal fusion of tabular and image data is increasingly critical in real-world applications where structured data and visual cues jointly inform decision-making processes, notably in areas such as healthcare, recommender systems, and marketing analytics. However, the joint optimization of tabular and image modalities suffers from gradient conflicts that emerge due to their inherently disparate structures and learning dynamics. Prior methods in multimodal learning have predominantly relied on unified joint objectives, resulting in negative interactions between modalities, or focused exclusively on encoder-specific gradient manipulation, neglecting adverse interactions at the shared classifier. The paper "Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning" (2604.01579) introduces GAAL, a principled learning paradigm that explicitly addresses cross-modal gradient conflicts and enhances synergistic tabular-image fusion.
Analysis of Gradient Conflicts in Multimodal Fusion
Joint learning paradigms for tabular-image fusion inherently induce gradient conflicts between modalities, as evidenced by pervasive negative cosine similarity between the image-specific and multimodal gradients. This misalignment not only misleads optimization at the modality level but also reduces the representational efficacy of each unimodal branch, leading to a pronounced modality collapse. As demonstrated empirically, canonical approaches, including OGM, MMPareto, MLA, and other multimodal strategies, exhibit substantial unimodal performance degradation when deployed on challenging fusion tasks, such as the DVM dataset.
Figure 1: Tabular-image fusion tasks exhibit severe gradient conflicts with negative cosine similarity in joint training, and existing methods fail to fully leverage unimodal capacities.
Gradient-Aligned Alternating Learning (GAAL) Framework
The proposed GAAL algorithm alternates modality-specific optimization with a shared classifier, decoupling the conflicting gradient flows inherent in joint updates. At each iteration, only one modalityโs learner is updated, ensuring per-step focus and information preservation within each stream. To address persistent conflicts at the shared classifier, where alternated gradients can be antagonistic, GAAL incorporates an uncertainty-based cross-modal gradient surgery mechanism.
At each alternating iteration, high-entropy (uncertain) samples from the previous modality are identified, and their average gradient direction is used as a projection anchor via quadratic programming. Current modality gradients that conflict with the previous onesโmeasured by negative cosine similarityโare projected onto the cross-modal anchor, enforcing geometric alignment subject to an ฯต-margin constraint. This selective gradient surgery robustly mitigates destructive interference, focusing the update direction to enhance synergy without enforced orthogonality, thus preserving latent modal overlaps.
Figure 2: GAAL architecture alternates updates between modalities and uses uncertainty-based gradient guidance, projecting conflicting gradients in the shared classifier to enhance cross-modal interaction.
Uncertainty-Guided Gradient Selection
Key to the efficacy of GAALโs cross-modal gradient surgery is the focus on uncertainty-guided (hard) samples. Rather than diluting gradient guidance equally across all examples, entropy-based selection accentuates the learning signal on samples for which the current modality is most unsure. This strategy concentrates optimization on decision boundaries and tail cases, preventing overfitting to easy patterns and leveraging high-information instances to refine the shared classifier. The constrained quadratic programming ensures minimal perturbation to the original gradient while enforcing alignment with the modality most responsive to these hard cases.
Experimental Validation
Empirical evaluation is conducted over three large-scale datasets: DVM, SUNAttribute, and CelebA. Strong baselinesโincluding multimodal fusion, unimodal learning, and state-of-the-art gradient-conflict mitigation strategiesโare comprehensively compared. The results demonstrate that GAAL achieves superior multimodal and unimodal performance across all benchmarks, and in several cases, even outperforms the best-per-execution unimodal learners, underscoring its capacity to avoid modality collapse and promote robust cross-modal interaction.

Figure 3: GAAL exhibits sensitivity to the number of high-uncertainty samples (ฮปI, ฮปT), affirming the importance of hard-sample selection for effective cross-modal gradient guidance.
Figure 4: The convergence profiles of GAAL on DVM and SUNAttribute exhibit stable, monotonically decreasing loss, confirming the resolution of gradient conflicts and training robustness.
Ablation studies further distinguish the individual contributions of alternating learning, cross-modal gradient surgery, and uncertainty-based sample selection. Ablating any component leads to a significant performance drop, confirming their necessity for optimal tabular-image fusion.
Figure 5: Fusion performance is sensitive to the alignment constraint margin ฯต, with optimal alignment providing maximum synergy and deviation leading to suboptimal independence or excessive entanglement.
Theoretical and Practical Implications
GAAL exemplifies a structured approach to harmonizing multimodal fusion where gradient-space alignment supersedes global joint optimization. The explicit alternating and cross-modal guidance decouples destructive interactions and enables balanced utilization of tabular and image information. The methodโs superior results on missing-modality inference tasks further indicate that the learned shared classifier generalizes efficiently even when tabular data is unavailable at test timeโa crucial property for practical deployment.
GAALโs quadratic programming-based gradient projections are amenable to extension in more general settings, such as regression, detection, and broader multi-modal fusion with more than two modalities. The focus on uncertainty-driven guidance can influence active learning paradigms and robustness-critical applications.
Conclusion
The GAAL paradigm systematically addresses the core bottleneck in tabular-image fusion: gradient misalignment across modalities, especially in the shared classifier. Through iterative, uncertainty-guided alternating optimization and convex projection of conflicting gradients, GAAL achieves state-of-the-art performance, balances unimodal and multimodal strengths, and provides a robust template for future multimodal fusion research. Extensions to other modalities, continuous targets, and more complex architectures are natural subsequent directions, with the foundational framework solidly substantiated by comprehensive empirical analysis.