Papers
Topics
Authors
Recent
2000 character limit reached

TiCAL: Typicality-Based Multimodal Emotion Recognition

Updated 22 November 2025
  • The paper introduces TiCAL using pseudo unimodal labeling and hyperbolic embedding to enhance MER performance across benchmarks.
  • It implements dynamic sample-level typicality and consistency estimation to adaptively mitigate inter-modal emotional conflicts.
  • Ablation studies confirm that key components, including the HypCPCC loss, significantly contribute to the system's robustness and interpretability.

Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL) is a discriminative learning framework for Multimodal Emotion Recognition (MER) that systematically addresses inter-modal emotional conflicts by quantifying sample-level typicality and consistency, grounded in hyperbolic representations of emotion hierarchies. Unlike conventional MER approaches that supervise with unified emotion labels, TiCAL introduces pseudo unimodal labeling, consistency-aware weighting, and hyperbolic feature embedding, yielding improved robustness, interpretability, and recognition accuracy across benchmarks.

1. Problem Definition and Motivation

MER integrates information from heterogeneous modalities—textual (xlx_l), visual (xvx_v), and acoustic (xax_a)—to infer a unified emotion label yy (y{1,,C}y \in \{1, \dots, C\} for categorical emotion, or sentiment score in [3,+3][-3, +3]) (Yin et al., 19 Nov 2025). Prevailing methodologies commonly assume agreement among modalities within a sample, thereby supervising training with a single label. However, real-world data frequently contain inter-modal emotional conflicts, such as disagreeing prosodic and textual cues (e.g., angry voice, happy text). TiCAL explicitly models such conflicts by assessing unimodal typicality and inter-modal consistency, facilitating learning that self-adapts to sample reliability and ambiguity.

2. Pseudo Unimodal Label Generation

TiCAL maintains a High-confidence Anchor Samples List (HASL) Sm={(fmi,yi)}i=1nS_m=\{(\mathbf{f}_{m_i},y_i)\}_{i=1}^{n} per modality m{l,v,a}m \in \{l,v,a\}, populated during initial λ\lambda epochs by samples predicted correctly with confidence uy^i>θu_{\hat y_i}>\theta. New samples after epoch λ\lambda are assigned a pseudo unimodal label ymy_m^* according to the closest anchor feature in hyperbolic space: dm=minfmjSmdB(fm,fmj),ym=yargmindmd_m = \min_{\mathbf{f}_{m_j} \in S_m} d_{\mathbb{B}}(\mathbf{f}_m, \mathbf{f}_{m_j}), \quad y_m^* = y_{\arg\min d_m} where dBd_{\mathbb{B}} denotes hyperbolic distance on the Poincaré ball. These pseudo labels provide fine-grained unimodal supervision, supporting subsequent typicality and consistency estimation.

3. Hyperbolic Embedding and Regularization

TiCAL embeds unimodal features fm\mathbf{f}_m in the dd-dimensional Poincaré ball Bd={x:x<1}\mathbb{B}^d = \{x: \lVert x\rVert < 1\}, capturing hierarchical emotion relations. Key operations include Möbius addition, exponential map, and hyperbolic distance (see mathematical definitions in (Yin et al., 19 Nov 2025)). Feature regularization aligns embeddings to a tree-structured emotion hierarchy via the Hyperbolic Cophenetic Correlation Coefficient (HypCPCC) loss: LHypCPCC=corr(dT(yi,yj),dB(fi,fj))\mathcal{L}_{\mathrm{HypCPCC}} = -\mathrm{corr}(d_{\mathbb{T}}(y_i, y_j),\,d_{\mathbb{B}}(\mathbf{f}_i, \mathbf{f}_j)) with dT(yi,yj)d_{\mathbb{T}}(y_i, y_j) measuring tree-path distance between emotion classes. Ablation shows Hyperbolic regularization contributes −2.07% Acc-2 drop upon removal on MOSI, highlighting its empirical necessity.

4. Dynamic Sample-level Typicality and Consistency Estimation

For each modality, TiCAL computes typicality τm\tau_m of pseudo label ymy_m^* as distance-normalized proximity to HASL anchors: τm=maxDmdmmaxDmminDm,Dm={dmover batch}\tau_m = \frac{\max D_m - d_m}{\max D_m - \min D_m}, \quad D_m = \{d_m\,\text{over batch}\} High τm\tau_m signifies anchor-consistent, "typical" samples.

Aggregate inter-modal consistency κ\kappa quantifies agreement and reliability as: κ=(τlτvτa)texp(kdlabel)\kappa = \sqrt{(\tau_l\,\tau_v\,\tau_a)^t\,\exp(-k\,d_{\mathrm{label}})} with

dlabel=(ylμlabel+yvμlabel+yaμlabel3)ρ,μlabel=yl+yv+ya3d_{\mathrm{label}} = \left(\frac{ |y_l^*-\mu_{\mathrm{label}}| + |y_v^*-\mu_{\mathrm{label}}| + |y_a^*-\mu_{\mathrm{label}}| }{3}\right)^\rho, \quad \mu_{\mathrm{label}} = \frac{y_l^* + y_v^* + y_a^*}{3}

where t,k,ρt, k, \rho are hyperparameters. κ[0,1]\kappa \in [0,1] weights fusion by both typicality and label agreement, with empirical ablation indicating −1.51% Acc-2 degradation upon removal.

5. Stage-wise Perception and Loss Formulation

TiCAL models perception via three human-inspired stages:

  • Early Perception (EP): Fast, coarse prediction.
  • Correlative Integration (CI): Integrated multimodal inference.
  • Advanced Cognition (AC): Unimodal predictions refined by typicality.

Each produces respective predictions y^EP,y^CI,y^AC\hat y_{\mathrm{EP}}, \hat y_{\mathrm{CI}}, \hat y_{\mathrm{AC}}, supervised via class-weighted cross-entropy.

Key loss functions:

  • Unbiased Unimodal AC Loss (τ\tau-weighting):

LAC=m{l,v,a}φ(τm)LCE(y^ACm,y),φ(τ)=exp(1τ)\mathcal{L}_{\mathrm{AC}} = \sum_{m\in\{l,v,a\}} \varphi(\tau_m)\,\mathcal{L}_{\mathrm{CE}}(\hat y_{\mathrm{AC}}^m, y),\quad \varphi(\tau)=\exp(1-\tau)

  • Dynamic Task Loss (κ\kappa-weighting):

Ltask=κLEP+LCI+(1κ)LAC\mathcal{L}_{\mathrm{task}} = \kappa\,\mathcal{L}_{\mathrm{EP}} + \mathcal{L}_{\mathrm{CI}} + (1-\kappa)\,\mathcal{L}_{\mathrm{AC}}

  • Overall Training Loss:

Lall=LtaskλHypLHypCPCC\mathcal{L}_{\mathrm{all}} = \mathcal{L}_{\mathrm{task}} - \lambda_{\mathrm{Hyp}}\,\mathcal{L}_{\mathrm{HypCPCC}}

The training pipeline proceeds with high-confidence HASL initialization for early epochs (eλe \leq \lambda), followed by dynamic pseudo labeling and consistency-weighted loss for e>λe > \lambda.

6. Empirical Results and Benchmarking

TiCAL demonstrates performance gains on multiple benchmark datasets:

Dataset Metric TiCAL Baseline Absolute Gain
MOSI Acc-2 88.10% DMD +2.1%
F1 88.09% DMD +2.0%
Acc-7 46.79% DMD +1.2%
MOSEI Acc-2 87.03% DMD
F1 87.05% DMD
Acc-7 55.23% DMD
MER2023 F1 91.56% Emotion-LLaMA +1.2%

Ablation on MOSI shows removal of:

  • Typicality (τ\tau): −0.92% Acc-2
  • Consistency (κ\kappa): −1.51% Acc-2
  • Hyperbolic (vs. Euclidean) HASL: −0.86% Acc-2
  • HypCPCC regularizer: −2.07% Acc-2

This suggests all design facets contribute cumulatively to robust recognition and conflict mitigation.

7. Interpretability, Implications, and Concluding Remarks

TiCAL's explicit modeling of sample-level typicality and inter-modal consistency yields dynamic adaptation to unreliable data, mitigating inter-modal conflicts pervasive in real-world MER. Hyperbolic embedding operationalizes emotion hierarchies, enhancing fine-grained discrimination. The staged fusion strategy aligns with cognitive models of human perception, producing interpretable outputs at each inference stage and offering transparent sample reliability metrics.

By integrating these mechanisms, TiCAL attains ~2.6% average gains (MOSI/MOSEI), sets SOTA F1 on MER2023, and provides diagnostic insight into both consistency and typicality (Yin et al., 19 Nov 2025). A plausible implication is that further generalization to additional modalities or more complex hierarchical emotion taxonomies could extend TiCAL’s advantage in diverse affective computing scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL).