Papers
Topics
Authors
Recent
Search
2000 character limit reached

MicroEmo: Time-Sensitive Multimodal Emotion Recognition

Updated 12 March 2026
  • MicroEmo is a multimodal framework that leverages visual, acoustic, and linguistic signals to detect fleeting micro-expressions and temporal dialogue cues.
  • It employs a global-local attention visual encoder and an utterance-aware Q-Former to fuse fine-grained facial cues with conversational context.
  • Experimental results demonstrate superior accuracy and interpretability, with ablation studies underscoring the critical role of its novel architectural innovations.

MicroEmo refers to a class of advanced, time-sensitive multimodal emotion recognition frameworks with a strong emphasis on the detection and temporal modeling of micro-expression dynamics, contextual dependencies of utterance segments, and explainability in open-vocabulary emotion recognition tasks. These systems integrate visual, acoustic, and linguistic signals directly, explicitly modeling fleeting facial cues and exploiting the structure of spoken dialogue to surpass prior approaches in both performance and interpretability, as exemplified by the architecture and empirical results of "MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues" (Zhang, 2024).

1. Motivation and Background

Multimodal LLMs (MLLMs) have demonstrated robust capabilities in emotion recognition tasks by leveraging heterogeneous inputs, including visual frames, audio tracks, and textual transcripts. However, prevailing approaches underutilize brief, subtle facial micro-expressions—cues that often persist for less than 0.5 seconds—and ignore the segmental structure of human conversation, specifically the alignment of emotion-relevant information to utterance-level temporal segments. MicroEmo addresses these fundamental limitations by introducing explicit mechanisms for capturing both fine-grained facial movements and the rich temporal dependencies anchored to conversational turns (Zhang, 2024).

2. System Architecture

MicroEmo introduces two key architectural innovations that advance the granularity and contextual awareness of emotion recognition:

2.1 Global-Local Attention Visual Encoder

  • Global Frame-Level Timestamp Features: Videos are sampled at NN uniformly distributed timepoints, I1,...,INI_1, ..., I_N. Each frame ItI_t is passed through a pre-trained ViT (EVA-CLIP/G-14) yielding patch embeddings FtRP×DF_t \in \mathbb{R}^{P \times D}. Each frame’s embedding is prefixed with a textually encoded timestamp, enhancing temporal sensitivity.
  • Local Facial Micro-Expression Features: For each ItI_t, MediaPipe face-mesh generates 468 keypoints. Disjoint facial regions RkR^k (e.g., brow, nasolabial) are defined, and binary region masks Maskk{0,1}P×P\text{Mask}^k \in \{0,1\}^{P \times P} assign patch-to-region membership. A layer-wise region-aware self-attention module computes Lt()L_t^{(\ell)} via masked softmax, yielding regional features pooled into local query slots QlocRM×DQ^{loc} \in \mathbb{R}^{M' \times D}.
  • Global-Local Feature Fusion: Global (QgQ^g) and local (QlocQ^{loc}) queries are concatenated and serve as inputs to a cross-attention module, integrating both whole-frame and micro-expression information into fused visual tokens TtT_t.

2.2 Utterance-Aware Video Q-Former

  • Utterance Segmentation and Alignment: The speech track is transcribed and temporally aligned, yielding utterances U={(u1,τ1s,τ1e),...,(uK,τKs,τKe)}U = \{ (u_1, \tau_1^s, \tau_1^e), ..., (u_K, \tau_K^s, \tau_K^e) \}. Each utterance interval [τis,τie][\tau_i^s, \tau_i^e] maps to a subset of frame indices SiS_i.
  • Segment-Level Token Generation: For every utterance segment SiS_i, corresponding fused visual tokens are aggregated, and learnable segment queries QsegiQ_{seg}^i perform cross-attention, producing segment-level outputs VsegiV_{seg}^i.
  • Global and Multi-Scale Fusion: In parallel, a global Q-Former aggregates all frame-level tokens. All segment and global outputs are concatenated and further processed via self-attention to model inter-utterance dependencies, resulting in VfusedV_{fused}, the final multimodal token sequence for downstream language modeling.

3. Explainable Multimodal Emotion Recognition (EMER) Task

3.1 Task Setup

The EMER task is open-vocabulary and explainable: given video, audio, a transcript, and an instruction prompt, the system must generate a free-form emotion label (from an unconstrained lexicon) and a natural-language rationale, referencing the evidence from all modalities.

3.2 Multimodal-Language Integration

The compiled tokens from video (VvideoV_{video}), audio (VaudioV_{audio}), and the inquiry (TLqT_{Lq}) are concatenated and prefixed with an instruction, prompting the LLM to produce:

1
2
Emotion: [LABEL]
Rationale: [EXPLANATION]
The training objective is cross-entropy minimization over the target sequence, driving the model to generate informative, evidence-based explanations without an auxiliary explanation-specific loss.

3.3 Parameter-Efficient Adaptation

Parameter-efficient fine-tuning (LoRA) is applied to Q-Formers and LLM adapters, concentrating model capacity on achieving optimal multimodal–language alignment.

4. Comparative Analysis and Experimental Results

4.1 Quantitative Performance

On the EMER-Fine test split from the AffectGPT dataset, MicroEmo establishes new state-of-the-art metrics (using all modalities):

Model Accuracy_S Recall_S Avg
SALMONN + mPLUG-Owl 65.15
AffectGPT (zero-shot) 61.75
MicroEmo (ours) 63.82 68.59 66.21

Ablation studies further demonstrate the criticality of each architectural element:

Model Variation Avg
No utterance-aware Q-Former 56.01
No global-local encoder 54.57
Both ablated 46.21

4.2 Qualitative Insights

MicroEmo exhibits discriminative power in distinguishing subtle emotion shifts that elude global-only models:

  • Subtle facial cues (e.g., micro-frown) are correctly mapped to nuanced emotions such as "frustration" rather than coarse categories like "anger."
  • Utterance-segmented tokens enable the model to map sarcastic prosody in context (e.g., “Oh great”) to “sarcasm,” where global tokens alone misattribute the emotion as "happy."
  • Rationales are grounded in observable behaviors (e.g., “micro-frown around 3.2 s”), supporting interpretability.

5. Comparison to Expansion Quantization Network (EQN) and Micro-Emotion Annotation

The Expansion Quantization Network (EQN) (Zhou et al., 2024) operates in the textual domain, targeting automatic micro-emotion annotation in discrete datasets such as GoEmotions. EQN maps binary annotations to continuous “energy intensity” scores via a two-stage process (CoEQN initialization and EQN regression), optimizing mean squared error between predicted and regressed intensities. This regression jointly captures inter-label dependencies, allowing rare micro-emotions to be detected at sub-threshold intensities and reducing label imbalance.

Whereas MicroEmo models micro-expression dynamics in video by temporal and spatial attention over facial cues, EQN encodes micro-emotion intensities with soft scoring in high-cardinality, label-rich textual data. Both frameworks address the challenge of underrepresented micro-emotions, but via complementary modalities and methodologies.

6. Significance and Applications

MicroEmo establishes a new baseline for explainable, high-resolution multimodal emotion recognition systems. By explicitly modeling micro-expressions with region masks and temporal windows anchored to utterance boundaries, the model surpasses uniform-frame and global-only baselines in detail and interpretability. This architecture is especially pertinent to tasks in affective computing, conversational AI, and high-fidelity emotion analytics in naturalistic video dialogue.

A plausible implication is that future systems adopting MicroEmo’s architectural template—region-specific visual attention, utterance-timestamped segmentation, and free-form explanation objectives—can generalize to other fine-grained, time-varying state detection domains (e.g., health monitoring, behavioral analysis).

7. Limitations and Future Directions

Despite demonstrable gains, MicroEmo’s reliance on accurate facial landmark detection (e.g., MediaPipe face-mesh) and high-quality speech-to-text alignment introduces fragility in noisy or occluded settings. Scalability to longer videos or multi-party interactions remains an open question. Further research around self-supervised pretraining for micro-expressions and end-to-end alignment of audio-visual-linguistic cues could address these limitations and support broader deployment of fine-grained, explainable affective models (Zhang, 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MicroEmo.