Graph-Based Multimodal Fusion
- Graph-based multimodal fusion is a framework that represents heterogeneous modalities as graph nodes linked by intra- and inter-modal edges for comprehensive data integration.
- It employs graph neural networks, attention mechanisms, and hierarchical pooling to adaptively fuse high-dimensional, semantically diverse signals.
- Empirical results reveal that these methods enhance accuracy and robustness in applications such as sentiment analysis, medical prognosis, and fake news detection.
Graph-based multimodal fusion is a family of computational frameworks that explicitly model relationships both within and across heterogeneous modalities (text, image, audio, video, structured data) by representing them as structured graphs and employing graph neural networks (GNNs), graph attention, and related operators for joint reasoning. These methods address the challenges of fusing high-dimensional, semantically diverse, and potentially unaligned multimodal signals by encoding them as nodes and edges in a graph, facilitating both local and global dependency modeling. The field encompasses representation learning, early/late/adaptive-stage fusion, hierarchical information aggregation, uncertainty modeling, and algorithmic considerations such as efficiency, interpretability, and robustness to missing or noisy modalities.
1. Core Principles of Graph-Based Multimodal Fusion
Graph-based multimodal fusion systems characterize multimodal data as attributed graphs, with nodes corresponding to modality-specific entities (such as words, objects, regions, utterances) and edges encoding intra-modal, inter-modal, or temporal/semantic relationships. This enables explicit modeling of both homogeneous (within-modality) and heterogeneous (cross-modality) interactions.
Fundamental principles include:
- Flexible graph construction: Modalities can be represented as separate graphs (e.g., intra-modal similarity graphs), block-diagonal amalgamations (“dual graphs” for intra- and inter-modal signals (Karthikeya et al., 26 Jan 2026)), or nodes and edges of a heterogeneous supergraph capturing sequential, semantic, or spatial dependencies (Yang et al., 2020, Shan et al., 24 Aug 2025, Dhawan et al., 2022, Hu et al., 2023).
- Task- and data-adaptive edge definition: Edges are weighted by learned attention, feature similarity (cosine/Gaussian kernels (Karthikeya et al., 26 Jan 2026)), statistical dependency scores (mutual information (Shan et al., 24 Aug 2025)), knowledge-graph relations, or structural priors from scene graphs (Li et al., 16 Sep 2025).
- Joint representation learning via GNNs: Node states are updated by message-passing along graph edges, propagating and fusing multimodal context. Standard blocks include GCNs, GATs, relational-GCNs, and multi-head attention (Tang et al., 2021, Li et al., 2022, Dhawan et al., 2022, Shan et al., 24 Aug 2025).
- Explicit intra- vs. inter-modal propagation: Many fusion networks distinguish propagation along homogeneous (sequential or spatial) edges versus heterogeneous (cross-modal) edges, potentially using asynchronous, gated, or staged updates to control fusion order (Hu et al., 2023, Tang et al., 2021).
- Multi-scale and hierarchical fusion: Structures such as hierarchical fusion graphs and pooling networks recursively aggregate unimodal, bimodal, and trimodal signals at suitable scales (Tang et al., 2021, Mai et al., 2019, Li et al., 16 Sep 2025).
2. Methodological Approaches
2.1 Graph Construction and Edge Modeling
Graph construction strategies fall into several classes:
- Fully connected or type-typed graphs: Where every node (e.g., token, region, object) is connected with typed edges (modality, temporal direction) (Yang et al., 2020, Tang et al., 2021).
- Sparse similarity/semantic graphs: Edges are retained if feature similarity or mutual information exceeds a threshold, supporting robust long-range associations (Karthikeya et al., 26 Jan 2026, Shan et al., 24 Aug 2025).
- Scene and knowledge graphs: Nodes correspond to entities, attributes, and relations derived from vision (object detection), language (dependency parsing), or curated knowledge (Li et al., 16 Sep 2025, Sani et al., 2024).
- Rank-fusion or retrieval graphs: Vertices encode items/samples or search results, with edges and weights determined by score aggregation across rankers (Dourado et al., 2019).
2.2 Graph-based Fusion Operators
Fusion is achieved by applying GNNs—GCN, GAT, hierarchical aggregation—over the constructed multimodal graphs. Notable operator types include:
- Relational GNNs: Utilizing edge or relation types to differentiate modality and semantic roles during message passing (Tang et al., 2021, Li et al., 2022).
- Graph attention mechanisms: Attend over intra- and/or inter-modal neighbors, learning adaptive, instance-specific attention weights (Dhawan et al., 2022, Yang et al., 2020, Li et al., 2022).
- Multistage and hierarchical architectures: Progress through stages (unimodal → bimodal → trimodal) or local/global fusion layers (e.g., GraphMMP’s local GNN + Mamba global fusion (Shan et al., 24 Aug 2025); HFGCN’s utterance vs. conversation-level graphs (Tang et al., 2021)).
- Spectral and graph-signal filtering: Enhance representations via Chebyshev polynomial filters or spectrum-aware convolutions to exploit underlying graph topology (Karthikeya et al., 26 Jan 2026).
- Hop-diffused attention and graph expansion: Incorporate multi-hop relationships (hop-diffused attention (Ning et al., 19 Oct 2025), graph powers (Ding et al., 2024)) to propagate information beyond immediate neighbors without over-smoothing.
2.3 Fusion Order, Adaptation, and Pooling
- Adaptive/Ordered Fusion: MMSR proposes node-wise gates that interpolate between early- and late-fusion, learning per-node fusion order by asynchronously updating representations based on the attended strength of homogeneous (sequential) and heterogeneous (cross-modal) neighbors (Hu et al., 2023).
- Pooling and hierarchical readout: Mechanisms such as mean/max graph pooling, link similarity pooling, or hierarchical attention aggregate graph vertices into task-specific vectors, enabling scalable and interpretable graph-level representations (Mai et al., 2020, Tang et al., 2021, Mai et al., 2019).
- Gating and global fusion blocks: Dynamic modality weights are learned to suppress noisy/corrupted modality information (e.g., uncertainty-based gating in DUP-MCRNet (Xiong et al., 28 Aug 2025)), late-stage attention-fusion in COHESION (Xu et al., 6 Apr 2025).
3. Notable Architectures and Applications
3.1 Sentiment, Emotion, and Sequence Analysis
- MTAG: Constructs fully typed modal-temporal graphs from unaligned multimodal language sequences and applies a multi-head modal-temporal attention fusion with dynamic edge pruning (Yang et al., 2020).
- HFGCN: Implements a two-stage hierarchical graph for enriched conversation-level emotion recognition (Tang et al., 2021).
- GraphMFT and Multimodal Graph: Fuse multimodal conversational data at the utterance or sequence level via pairwise/bimodal graphs and hierarchical pooling (Li et al., 2022, Mai et al., 2020).
- AGSP-DSA: Applies dual-graph (intra-, inter-modal) signal processing with spectral filtering and dynamic semantic alignment, achieving state-of-the-art on missing-modality sentiment/event datasets (Karthikeya et al., 26 Jan 2026).
3.2 Medical Prognosis and Recommendation
- GraphMMP: Leverages per-patient multimodal graphs with MI-based edge weights and augments GNN aggregation with a Mamba-based global fusion block, significantly boosting clinical risk prediction (Shan et al., 24 Aug 2025).
- COHESION: Unifies early- and late-stage fusion in a composite graph convolutional network, employing both heterogeneous (user-item-modal) and homogeneous (user-user, item-item) topologies. An adaptive BPR loss balances modality contributions to recommendation quality (Xu et al., 6 Apr 2025).
- CrossGMMI-DUKGLR: Integrates multi-head cross-attention for image/text fusion on knowledge graphs, applies GATs for higher-order propagation, and regularizes with a cross-graph mutual information objective (Fang, 3 Sep 2025).
3.3 Fake News, Retrieval, Multimodal Reasoning
- GAME-ON: Forms a single fully connected multimodal graph for each tweet/post (visual, textual nodes), applies GAT layers, and achieves high accuracy and efficiency in fake news detection (Dhawan et al., 2022).
- Multimodal Retrieval: Early frameworks fuse similarity graphs (visual/textual) using cross-media diffusion, random-walks with query-dependent semantic filtering, and recommend practical fusion recipes for large collections (Csurka et al., 2014).
- LEGO Fusion: Introduces a learnable graph fusion operator based on multilinear expansions of the adjacency relationships across modalities, enabling principled, interpretable fusion for anomaly detection (Ding et al., 2024).
3.4 Visual, Sensor, and Scene Graph Fusion
- MSGFusion: Aligns and processes visual and textual scene graphs via GNNs and graph attention, then fuses high-level semantic information with low-level visual cues through per-pixel graph-driven fusion for robust infrared-visible image fusion (Li et al., 16 Sep 2025).
- SAGA-KF: Develops a sensor-agnostic, graph-aware Kalman filter for multimodal sensor fusion in autonomous driving, employing a process model that propagates uncertainty and state updates along the interaction graph (Sani et al., 2024).
3.5 Neural Machine Translation and Conversational Tasks
- Graph-based NMT encoder: Constructs unified multimodal graphs encoding word-object semantic correspondences and spatial relations, propagates context by stacking intra- and inter-modal GNN layers, and feeds fused text representations into standard Transformer decoders (Yin et al., 2020).
4. Empirical Impact and Comparative Performance
Empirical evaluation across a range of benchmarks consistently demonstrates the superiority of graph-based multimodal fusion over conventional feature concatenation or late-fusion baselines:
- Medical prognosis: GraphMMP improves accuracy and AUC by 2–7% over strong baselines; MI-based edge construction and Mamba fusion individually yield 2–7% drops if removed (Shan et al., 24 Aug 2025).
- Recommendation: COHESION and MMSR attain substantial gains (COHESION +9.1% NDCG@10 (Xu et al., 6 Apr 2025), MMSR +17.2% MRR@5) compared to classical collaborative and non-modal GNNs, in both full- and missing-modality conditions (Hu et al., 2023).
- Emotion/sentiment analysis: HFGCN and AGSP-DSA yield F1 increases of 1–3% over MMGCN, and AGSP-DSA demonstrates superior robustness to missing modalities, dropping only 1–2% in ablation (Tang et al., 2021, Karthikeya et al., 26 Jan 2026).
- Ranking/retrieval: Rank-fusion graphs and graph-level similarity diffusion robustly outperform early/late concatenation by 2–8 points in mAP and accuracy across both multimodal and single-modality setups (Dourado et al., 2019, Csurka et al., 2014).
- Fake news/classification: GAME-ON, with a single GAT layer, achieves 11% higher F1 with 90% fewer parameters than prior state-of-the-art (Dhawan et al., 2022).
- Multi-hop fusion and global structure: Graph4MM’s hop-diffused attention and MM-QFormer yield up to 6.93% average improvement over strong VLM/graph/LLM baselines in zero-shot and generative settings (Ning et al., 19 Oct 2025).
5. Advanced Features and Architectural Innovations
- Dynamic fusion strategies: Models like MMSR, COHESION, and DUP-MCRNet employ trainable gating or dynamic weighting to determine when, where, and how to fuse modalities, adapting to input content and suppressing noise (Hu et al., 2023, Xu et al., 6 Apr 2025, Xiong et al., 28 Aug 2025).
- Hierarchical and multi-scale representations: Integration across semantic levels via hierarchical pooling, multi-level attention, or graph expansion captures global structure without overfitting or over-smoothing (Tang et al., 2021, Ding et al., 2024, Li et al., 16 Sep 2025).
- Graph-driven uncertainty propagation: DUP-MCRNet propagates per-pixel uncertainty through a spatial graph, modulating graph convolution and fusion, achieving robust edge and detail preservation (Xiong et al., 28 Aug 2025).
- Global fusion via state-space models: GraphMMP’s integration of Mamba, a selective state-space mechanism, offers long-range, efficient fusion superior to classical GNN stacks (Shan et al., 24 Aug 2025).
- Structural guidance for large models: Graph4MM demonstrates theoretically and empirically that using graphs as attention masks and hop-diffusion guides, rather than as a standalone input to foundation models, leads to improved generalization and information flow (Ning et al., 19 Oct 2025).
6. Limitations, Open Challenges, and Future Directions
While graph-based multimodal fusion frameworks have shown increased accuracy, efficiency, and flexibility across a wide array of tasks, several challenges persist:
- Graph construction at scale: Crafting meaningful, efficient graphs from large-scale, noisy, or weakly-aligned modalities requires advances in automated structure discovery, bootstrapping edge definitions, and leveraging weak supervision (Karthikeya et al., 26 Jan 2026, Ning et al., 19 Oct 2025).
- Computational efficiency and over-smoothing: Deep and wide GNNs risk feature homogenization and scaling bottlenecks. Methods such as hop-diffusion, residual + dense-concat, and hierarchical/staged aggregation have partially mitigated this (Ning et al., 19 Oct 2025, Li et al., 2022).
- Missing data and robustness: Many graph-based systems naturally handle missing modalities (MMSR, AGSP-DSA), but further research on principled uncertainty quantification and recovery remains necessary (Hu et al., 2023, Karthikeya et al., 26 Jan 2026, Xiong et al., 28 Aug 2025).
- Interpretability: The explicit encoding of interactions as graphs offers a promising avenue for post hoc analysis of learned associations, but methods for explainability and graph attention inspection at scale remain underdeveloped (Yang et al., 2020, Mai et al., 2019).
- Heterogeneous node and edge types: Extending beyond homogeneous graphs to handle arbitrary entity-relation schemas (as in knowledge graphs or multi-task settings) is a frontier for both theory and practice (Fang, 3 Sep 2025, Ning et al., 19 Oct 2025).
- Integration with foundation models: There is active debate and evidence that treating the graph as a guide (masking, hop-priors) outperforms treating it as “just another modality” for large LLM or VLM architectures, due to the mutual information and generalization limits of local GNNs (Ning et al., 19 Oct 2025).
This suggests that the field is progressing toward models that can integrate massive, noisy, and complex multimodal data by leveraging graph-theoretic structure both as an intermediate reasoning substrate and as a means of guiding large-scale attention models. Future work is likely to focus on scalable graph construction, adaptive fusion under uncertainty, more expressive GNN architectures for heterogeneous and dynamic graphs, and hybrid strategies that integrate structural guidance with large pre-trained models while maintaining interpretability and computational tractability.