TiCAL: Typicality-Based Multimodal Emotion Recognition
- The paper introduces TiCAL using pseudo unimodal labeling and hyperbolic embedding to enhance MER performance across benchmarks.
- It implements dynamic sample-level typicality and consistency estimation to adaptively mitigate inter-modal emotional conflicts.
- Ablation studies confirm that key components, including the HypCPCC loss, significantly contribute to the system's robustness and interpretability.
Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL) is a discriminative learning framework for Multimodal Emotion Recognition (MER) that systematically addresses inter-modal emotional conflicts by quantifying sample-level typicality and consistency, grounded in hyperbolic representations of emotion hierarchies. Unlike conventional MER approaches that supervise with unified emotion labels, TiCAL introduces pseudo unimodal labeling, consistency-aware weighting, and hyperbolic feature embedding, yielding improved robustness, interpretability, and recognition accuracy across benchmarks.
1. Problem Definition and Motivation
MER integrates information from heterogeneous modalities—textual (), visual (), and acoustic ()—to infer a unified emotion label ( for categorical emotion, or sentiment score in ) (Yin et al., 19 Nov 2025). Prevailing methodologies commonly assume agreement among modalities within a sample, thereby supervising training with a single label. However, real-world data frequently contain inter-modal emotional conflicts, such as disagreeing prosodic and textual cues (e.g., angry voice, happy text). TiCAL explicitly models such conflicts by assessing unimodal typicality and inter-modal consistency, facilitating learning that self-adapts to sample reliability and ambiguity.
2. Pseudo Unimodal Label Generation
TiCAL maintains a High-confidence Anchor Samples List (HASL) per modality , populated during initial epochs by samples predicted correctly with confidence . New samples after epoch are assigned a pseudo unimodal label according to the closest anchor feature in hyperbolic space: where denotes hyperbolic distance on the Poincaré ball. These pseudo labels provide fine-grained unimodal supervision, supporting subsequent typicality and consistency estimation.
3. Hyperbolic Embedding and Regularization
TiCAL embeds unimodal features in the -dimensional Poincaré ball , capturing hierarchical emotion relations. Key operations include Möbius addition, exponential map, and hyperbolic distance (see mathematical definitions in (Yin et al., 19 Nov 2025)). Feature regularization aligns embeddings to a tree-structured emotion hierarchy via the Hyperbolic Cophenetic Correlation Coefficient (HypCPCC) loss: with measuring tree-path distance between emotion classes. Ablation shows Hyperbolic regularization contributes −2.07% Acc-2 drop upon removal on MOSI, highlighting its empirical necessity.
4. Dynamic Sample-level Typicality and Consistency Estimation
For each modality, TiCAL computes typicality of pseudo label as distance-normalized proximity to HASL anchors: High signifies anchor-consistent, "typical" samples.
Aggregate inter-modal consistency quantifies agreement and reliability as: with
where are hyperparameters. weights fusion by both typicality and label agreement, with empirical ablation indicating −1.51% Acc-2 degradation upon removal.
5. Stage-wise Perception and Loss Formulation
TiCAL models perception via three human-inspired stages:
- Early Perception (EP): Fast, coarse prediction.
- Correlative Integration (CI): Integrated multimodal inference.
- Advanced Cognition (AC): Unimodal predictions refined by typicality.
Each produces respective predictions , supervised via class-weighted cross-entropy.
Key loss functions:
- Unbiased Unimodal AC Loss (-weighting):
- Dynamic Task Loss (-weighting):
- Overall Training Loss:
The training pipeline proceeds with high-confidence HASL initialization for early epochs (), followed by dynamic pseudo labeling and consistency-weighted loss for .
6. Empirical Results and Benchmarking
TiCAL demonstrates performance gains on multiple benchmark datasets:
| Dataset | Metric | TiCAL | Baseline | Absolute Gain |
|---|---|---|---|---|
| MOSI | Acc-2 | 88.10% | DMD | +2.1% |
| F1 | 88.09% | DMD | +2.0% | |
| Acc-7 | 46.79% | DMD | +1.2% | |
| MOSEI | Acc-2 | 87.03% | DMD | |
| F1 | 87.05% | DMD | ||
| Acc-7 | 55.23% | DMD | ||
| MER2023 | F1 | 91.56% | Emotion-LLaMA | +1.2% |
Ablation on MOSI shows removal of:
- Typicality (): −0.92% Acc-2
- Consistency (): −1.51% Acc-2
- Hyperbolic (vs. Euclidean) HASL: −0.86% Acc-2
- HypCPCC regularizer: −2.07% Acc-2
This suggests all design facets contribute cumulatively to robust recognition and conflict mitigation.
7. Interpretability, Implications, and Concluding Remarks
TiCAL's explicit modeling of sample-level typicality and inter-modal consistency yields dynamic adaptation to unreliable data, mitigating inter-modal conflicts pervasive in real-world MER. Hyperbolic embedding operationalizes emotion hierarchies, enhancing fine-grained discrimination. The staged fusion strategy aligns with cognitive models of human perception, producing interpretable outputs at each inference stage and offering transparent sample reliability metrics.
By integrating these mechanisms, TiCAL attains ~2.6% average gains (MOSI/MOSEI), sets SOTA F1 on MER2023, and provides diagnostic insight into both consistency and typicality (Yin et al., 19 Nov 2025). A plausible implication is that further generalization to additional modalities or more complex hierarchical emotion taxonomies could extend TiCAL’s advantage in diverse affective computing scenarios.