Graph Hierarchical Fusion for Emotion Recognition
- The paper introduces graph-based hierarchical fusion methods that construct multi-level graphs to capture both intra- and inter-modal dependencies in emotion recognition tasks.
- Advanced attention and differential graph mechanisms dynamically filter noise and enhance feature extraction, leading to improved classification performance.
- Adaptive modality balancing, noise suppression, and contrastive graph learning contribute to state-of-the-art results on benchmarks such as IEMOCAP and MELD.
Graph-based hierarchical fusion in multimodal emotion recognition refers to a class of computational frameworks that integrate multi-modal signals (text, audio, vision, or physiological) using graph-structured neural models at multiple hierarchical levels. These structures encode intra- and inter-modal dependencies, model speaker and context relations, and enable effective information exchange across heterogeneous modalities and temporal or conversational hierarchies. By leveraging advances in graph neural networks (GNNs), attention, and structured adaptivity, these models have established state-of-the-art accuracy and robustness in emotion recognition tasks under real-world conversational and signal conditions.
1. Model Architectures and Hierarchical Graph Construction
Recent graph-based hierarchical fusion methods instantiate multi-level graph or hypergraph structures to capture both short- and long-range modality interactions and conversational context. Core approaches can be categorized as follows:
- Modality-specific Subgraphs: For each modality (e.g., text, audio, vision), independent graphs encode per-utterance or temporal dependencies, often distinguishing intra-speaker (self-continuity) and inter-speaker (cross-influence) relations. For example, AMB-DSGDN builds directed inter- and intra-speaker graphs per modality, using adjacency matrices that encode windowed temporal and speaker interactions (Wang et al., 7 Mar 2026).
- Heterogeneous Bipartite Graphs: Some methods construct cross-modal bipartite or multi-partite graphs, linking nodes from different modalities to explicitly model inter-modal semantic or temporal alignment. Sync-TVA builds three bipartite graphs (V–A, T–V, A–T) with learned edge weights complemented by modality-specific dynamic enhancement (Deng et al., 29 Jul 2025). Speech emotion recognition frameworks employ similar cross-modal graph constructions, fully connecting text and audio nodes and integrating prosodic and spectral cue nodes (Ferreira et al., 2 Jun 2025).
- Pairwise and Multiway Graphs: Other designs, such as GraphMFT, build pairwise heterogeneous graphs over each modality pair (audio–text, audio–vision, text–vision), learning intra-modal and cross-modal attentional edge weights to separately capture contextual and complementary information (Li et al., 2022). Two-stage staged constructions are also observed, e.g., hierarchical intra-utterance and conversation-level graphs in HFGCN (Tang et al., 2021).
- Hypergraph and Variational Structures: To model high-order relationships and dynamic context, frameworks like HAUCL employ variational hypergraph autoencoders that dynamically reconstruct hyperedges capturing both within-utterance (cross-modal) and across-utterance (same-modality) fusion (Yi et al., 2024).
2. Graph Attention, Differential Mechanisms, and Adaptivity
Sophisticated GNN variants and attention mechanisms are foundational to effective hierarchical fusion:
- Graph Attention Networks (GATs): Most contemporary methods use multi-head attention-based GNNs to learn edge weights adaptively at each layer, as in GraphMFT, GA2MIF, and Sync-TVA (Li et al., 2022, Li et al., 2022, Deng et al., 29 Jul 2025). These layers operate over modality graphs or cross-modal graphs, with residual and skip connections to mitigate over-smoothing.
- Differential Attention Mechanisms: AMB-DSGDN introduces a differential graph attention layer (DiffRGCN), which processes parallel positive and negative attention branches, subtracting and amplifying the difference between them. This mechanism filters out modality-shared noise and preserves context-relevant, modality-specific signals. The resulting attention maps are explicitly contrastive and relation-aware, and coefficients are depth-adapted to avoid overfitting or over-suppression (Wang et al., 7 Mar 2026).
- Multi-Stage/Multi-Level Propagation: Hierarchical stacking of GCN, GAT, or Transformer layers across multiple graphs or graph levels enables richer abstraction. For example, GA2MIF first runs intra-modal multi-head directed GATs, then fuses modality streams via pairwise multi-head cross-modal attention (MPCAT) (Li et al., 2022).
- Dynamic Graph and Hypergraph Construction: HAUCL’s VHGAE allows learned pruning and reinforcement of hyperedges, reducing redundant propagation and over-smoothing associated with fully connected or static designs (Yi et al., 2024).
3. Modality Balancing, Regularization, and Noise Handling
Multimodal fusion in emotion recognition is prone to modal imbalance—wherein dominant modalities overwhelm non-dominant ones, suppressing signal diversity and degrading performance:
- Adaptive Modality Balancing: AMB-DSGDN continually quantifies each modality's weighted F1 performance, calculates relative modality ratios, and adaptively applies dropout at the modality level (qₘ) to suppress the influence of dominant signals. This stochastically masks stronger streams and rescales retained features to maintain expected magnitude, leading to better utilization of complementary information (Wang et al., 7 Mar 2026).
- Noise and Redundancy Suppression: Differential attention and hierarchical hyperedge pruning (in AMB-DSGDN and HAUCL) suppress contextually shared noise and mitigate the risk of redundant message passing or over-smoothing in deep GNNs (Wang et al., 7 Mar 2026, Yi et al., 2024).
- Contrastive and Auxiliary Losses: Several hierarchical graph approaches integrate auxiliary supervision, such as cross-entropy for unimodal branches or graph-contrastive alignment (Joyful, HAUCL), which improves class separability and stabilizes stochastic representations under adversarial or noisy edge conditions (Li et al., 2023, Yi et al., 2024).
4. Fusion and Decoding
Each method employs specific strategies for aggregating features across modalities and hierarchical levels to inform final emotion classification:
- Per-Modality Decoding + Late Fusion: AMB-DSGDN applies per-modality MLP output heads, then sum-fuses logits before softmax for final decoding, supported by corresponding auxiliary losses (Wang et al., 7 Mar 2026).
- Node Pooling and Recombination: HFGCN uses per-utterance node pooling and concatenation of modality-specific representations, pairing them with early-fusion outputs across hierarchical graph levels (Tang et al., 2021).
- Concatenation and Projected Fusion: Other pipelines concatenate representations from each modality or graph, then project via small MLPs or direct transformation into classification layers (GraphMFT, Joyful, HAUCL) (Li et al., 2022, Li et al., 2023, Yi et al., 2024).
- Weighted and Regularized Loss: Most frameworks use weighted cross-entropy or multitask variants, often with explicit class weighting to address dataset imbalance (Sync-TVA, HFGCN) (Deng et al., 29 Jul 2025, Tang et al., 2021). Additional L2 regularization, dropout, and data augmentation further enhance generalization.
5. Empirical Results, Ablations, and Performance Analysis
Hierarchical graph-based fusion consistently outperforms monolithic or naive fusion baselines across standard benchmarks (IEMOCAP, MELD, DEAP, MAHNOB-HCI):
| Method/Paper | IEMOCAP WF1 | MELD WF1 | Improvements/Notes |
|---|---|---|---|
| AMB-DSGDN (Wang et al., 7 Mar 2026) | 75.64% | 66.18% | Differential attention + adaptive dropout critical |
| HAUCL (Yi et al., 2024) | 70.27% | 66.72% | Hypergraph autoencoder + contrastive learning |
| Joyful (Li et al., 2023) | 71.03% | 61.77% | Joint fusion, graph contrastive learning |
| GA2MIF (Li et al., 2022) | 70.00% | 58.94% | Hierarchical stagewise GNN/attention |
| HFGCN (Tang et al., 2021) | 67.24% | 59.71% | Two-stage mod./conv. graphs, multitask VA decoding |
| GraphMFT (Li et al., 2022) | 68.07% | 58.37% | Heterogeneous graphs, multihead skip GAT |
| Sync-TVA (Deng et al., 29 Jul 2025) | (not specified) | (not specified) | Robust to class imbalance (noted gains in minority classes) |
Ablation studies consistently demonstrate:
- Removal of hierarchical graph structure or attention mechanisms drops F1/accuracy by 3–8 percentage points (Wang et al., 7 Mar 2026, Yi et al., 2024, Li et al., 2023, Li et al., 2022).
- Bypassing adaptivity (fixed graphs, static dropout, or no contrastive loss) induces further performance reductions (Wang et al., 7 Mar 2026, Yi et al., 2024).
- Overly dense graphs or overly deep GNNs cause over-smoothing and degraded segregation of sentiment classes (addressed via graph augmentation and residual connections) (Li et al., 2023, Yi et al., 2024, Li et al., 2022).
6. Methodological Innovations and Theoretical Insights
Graph-based hierarchical fusion approaches have introduced several methodological advances:
- Differential Graph Attention: Amplifies modality-specific context and suppresses shared or redundant cues utilizing explicit positive/negative branch subtraction and depth-aware balancing (Wang et al., 7 Mar 2026).
- Dynamic Hypergraph Formation: Variational autoencoding over hyperedges introduces task-dependent adaptivity to context propagation, directly addressing the over-smoothing and redundancy intrinsic to static fully-connected graphs (Yi et al., 2024).
- Contrastive Graph Learning: Joint optimization for class-separability and robustness through inter/intra-view contrastive loss sharpens representational differences, especially under adversarial perturbations or limited context (Li et al., 2023, Yi et al., 2024).
- Hierarchical and Multi-Stage Graph Propagation: Distinct intra-modal and cross-modal fusion stages (as in GA2MIF, Sync-TVA, GraphMFT) allow modeling of local context and global semantic complementarity without the confusion and noise of flat heterogeneous graphs (Li et al., 2022, Deng et al., 29 Jul 2025, Li et al., 2022).
A plausible implication is that deeper hierarchically structured and adaptively regularized graph fusion architectures provide a scalable pathway to robust emotion recognition in increasingly complex, spontaneous, or real-world dialog settings.
7. Datasets, Interpretability, and Future Directions
Graph-based hierarchical fusion models are validated predominantly on benchmark datasets such as IEMOCAP, MELD, DEAP, MAHNOB-HCI, and MOSEI. Performance on these corpora confirms that modeling multi-level, multi-modal relational structure is critical for real-world emotion understanding.
Interpretability in these models is enhanced by explicit projection into affective subspaces (e.g., valence–arousal), t-SNE visualization of graph embeddings, and modular graph construction that reveals which context and modality interactions drive prediction (Tang et al., 2021, Li et al., 2023).
Future avenues include further dynamic graph learning (beyond hypergraph autoencoding), integration of domain-specific cues (e.g., prosody, speaker intent), federated adaptation to new domains (domain-invariant graph contrastive learning), and scaling to real-time, streaming contexts where hierarchical graph formation must be adaptive and efficient.
References:
- (Wang et al., 7 Mar 2026, Ferreira et al., 2 Jun 2025, Deng et al., 29 Jul 2025, Yi et al., 2024, Li et al., 2023, Li et al., 2022, Li et al., 2022, Tang et al., 2021, Jia et al., 2021)