MicroEmo: Time-Sensitive Multimodal Emotion Recognition

Updated 12 March 2026

MicroEmo is a multimodal framework that leverages visual, acoustic, and linguistic signals to detect fleeting micro-expressions and temporal dialogue cues.
It employs a global-local attention visual encoder and an utterance-aware Q-Former to fuse fine-grained facial cues with conversational context.
Experimental results demonstrate superior accuracy and interpretability, with ablation studies underscoring the critical role of its novel architectural innovations.

MicroEmo refers to a class of advanced, time-sensitive multimodal emotion recognition frameworks with a strong emphasis on the detection and temporal modeling of micro-expression dynamics, contextual dependencies of utterance segments, and explainability in open-vocabulary emotion recognition tasks. These systems integrate visual, acoustic, and linguistic signals directly, explicitly modeling fleeting facial cues and exploiting the structure of spoken dialogue to surpass prior approaches in both performance and interpretability, as exemplified by the architecture and empirical results of "MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues" (Zhang, 2024).

1. Motivation and Background

Multimodal LLMs (MLLMs) have demonstrated robust capabilities in emotion recognition tasks by leveraging heterogeneous inputs, including visual frames, audio tracks, and textual transcripts. However, prevailing approaches underutilize brief, subtle facial micro-expressions—cues that often persist for less than 0.5 seconds—and ignore the segmental structure of human conversation, specifically the alignment of emotion-relevant information to utterance-level temporal segments. MicroEmo addresses these fundamental limitations by introducing explicit mechanisms for capturing both fine-grained facial movements and the rich temporal dependencies anchored to conversational turns (Zhang, 2024).

2. System Architecture

MicroEmo introduces two key architectural innovations that advance the granularity and contextual awareness of emotion recognition:

2.1 Global-Local Attention Visual Encoder

Global Frame-Level Timestamp Features: Videos are sampled at $N$ uniformly distributed timepoints, $I_1, ..., I_N$ . Each frame $I_t$ is passed through a pre-trained ViT (EVA-CLIP/G-14) yielding patch embeddings $F_t \in \mathbb{R}^{P \times D}$ . Each frame’s embedding is prefixed with a textually encoded timestamp, enhancing temporal sensitivity.
Local Facial Micro-Expression Features: For each $I_t$ , MediaPipe face-mesh generates 468 keypoints. Disjoint facial regions $R^k$ (e.g., brow, nasolabial) are defined, and binary region masks $\text{Mask}^k \in \{0,1\}^{P \times P}$ assign patch-to-region membership. A layer-wise region-aware self-attention module computes $L_t^{(\ell)}$ via masked softmax, yielding regional features pooled into local query slots $Q^{loc} \in \mathbb{R}^{M' \times D}$ .
Global-Local Feature Fusion: Global ( $Q^g$ ) and local ( $Q^{loc}$ ) queries are concatenated and serve as inputs to a cross-attention module, integrating both whole-frame and micro-expression information into fused visual tokens $T_t$ .

2.2 Utterance-Aware Video Q-Former

Utterance Segmentation and Alignment: The speech track is transcribed and temporally aligned, yielding utterances $U = \{ (u_1, \tau_1^s, \tau_1^e), ..., (u_K, \tau_K^s, \tau_K^e) \}$ . Each utterance interval $[\tau_i^s, \tau_i^e]$ maps to a subset of frame indices $S_i$ .
Segment-Level Token Generation: For every utterance segment $S_i$ , corresponding fused visual tokens are aggregated, and learnable segment queries $Q_{seg}^i$ perform cross-attention, producing segment-level outputs $V_{seg}^i$ .
Global and Multi-Scale Fusion: In parallel, a global Q-Former aggregates all frame-level tokens. All segment and global outputs are concatenated and further processed via self-attention to model inter-utterance dependencies, resulting in $V_{fused}$ , the final multimodal token sequence for downstream language modeling.

3. Explainable Multimodal Emotion Recognition (EMER) Task

3.1 Task Setup

The EMER task is open-vocabulary and explainable: given video, audio, a transcript, and an instruction prompt, the system must generate a free-form emotion label (from an unconstrained lexicon) and a natural-language rationale, referencing the evidence from all modalities.

3.2 Multimodal-Language Integration

The compiled tokens from video ( $V_{video}$ ), audio ( $V_{audio}$ ), and the inquiry ( $T_{Lq}$ ) are concatenated and prefixed with an instruction, prompting the LLM to produce:

1 2	Emotion: [LABEL] Rationale: [EXPLANATION]

The training objective is cross-entropy minimization over the target sequence, driving the model to generate informative, evidence-based explanations without an auxiliary explanation-specific loss.

3.3 Parameter-Efficient Adaptation

Parameter-efficient fine-tuning (LoRA) is applied to Q-Formers and LLM adapters, concentrating model capacity on achieving optimal multimodal–language alignment.

4. Comparative Analysis and Experimental Results

4.1 Quantitative Performance

On the EMER-Fine test split from the AffectGPT dataset, MicroEmo establishes new state-of-the-art metrics (using all modalities):

Model	Accuracy_S	Recall_S	Avg
SALMONN + mPLUG-Owl	–	–	65.15
AffectGPT (zero-shot)	–	–	61.75
MicroEmo (ours)	63.82	68.59	66.21

Ablation studies further demonstrate the criticality of each architectural element:

Model Variation	Avg
No utterance-aware Q-Former	56.01
No global-local encoder	54.57
Both ablated	46.21

4.2 Qualitative Insights

MicroEmo exhibits discriminative power in distinguishing subtle emotion shifts that elude global-only models:

Subtle facial cues (e.g., micro-frown) are correctly mapped to nuanced emotions such as "frustration" rather than coarse categories like "anger."
Utterance-segmented tokens enable the model to map sarcastic prosody in context (e.g., “Oh great”) to “sarcasm,” where global tokens alone misattribute the emotion as "happy."
Rationales are grounded in observable behaviors (e.g., “micro-frown around 3.2 s”), supporting interpretability.

5. Comparison to Expansion Quantization Network (EQN) and Micro-Emotion Annotation

The Expansion Quantization Network (EQN) (Zhou et al., 2024) operates in the textual domain, targeting automatic micro-emotion annotation in discrete datasets such as GoEmotions. EQN maps binary annotations to continuous “energy intensity” scores via a two-stage process (CoEQN initialization and EQN regression), optimizing mean squared error between predicted and regressed intensities. This regression jointly captures inter-label dependencies, allowing rare micro-emotions to be detected at sub-threshold intensities and reducing label imbalance.

Whereas MicroEmo models micro-expression dynamics in video by temporal and spatial attention over facial cues, EQN encodes micro-emotion intensities with soft scoring in high-cardinality, label-rich textual data. Both frameworks address the challenge of underrepresented micro-emotions, but via complementary modalities and methodologies.

6. Significance and Applications

MicroEmo establishes a new baseline for explainable, high-resolution multimodal emotion recognition systems. By explicitly modeling micro-expressions with region masks and temporal windows anchored to utterance boundaries, the model surpasses uniform-frame and global-only baselines in detail and interpretability. This architecture is especially pertinent to tasks in affective computing, conversational AI, and high-fidelity emotion analytics in naturalistic video dialogue.

A plausible implication is that future systems adopting MicroEmo’s architectural template—region-specific visual attention, utterance-timestamped segmentation, and free-form explanation objectives—can generalize to other fine-grained, time-varying state detection domains (e.g., health monitoring, behavioral analysis).

7. Limitations and Future Directions

Despite demonstrable gains, MicroEmo’s reliance on accurate facial landmark detection (e.g., MediaPipe face-mesh) and high-quality speech-to-text alignment introduces fragility in noisy or occluded settings. Scalability to longer videos or multi-party interactions remains an open question. Further research around self-supervised pretraining for micro-expressions and end-to-end alignment of audio-visual-linguistic cues could address these limitations and support broader deployment of fine-grained, explainable affective models (Zhang, 2024).

Markdown Report Issue Upgrade to Chat

References (2)

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues (2024)

Expansion Quantization Network: An Efficient Micro-emotion Annotation and Detection Framework (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MicroEmo.

MicroEmo: Time-Sensitive Multimodal Emotion Recognition

1. Motivation and Background

2. System Architecture

2.1 Global-Local Attention Visual Encoder

2.2 Utterance-Aware Video Q-Former

3. Explainable Multimodal Emotion Recognition (EMER) Task

3.1 Task Setup

3.2 Multimodal-Language Integration

3.3 Parameter-Efficient Adaptation

4. Comparative Analysis and Experimental Results

4.1 Quantitative Performance

4.2 Qualitative Insights

5. Comparison to Expansion Quantization Network (EQN) and Micro-Emotion Annotation

6. Significance and Applications

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MicroEmo: Time-Sensitive Multimodal Emotion Recognition

1. Motivation and Background

2. System Architecture

2.1 Global-Local Attention Visual Encoder

2.2 Utterance-Aware Video Q-Former

3. Explainable Multimodal Emotion Recognition (EMER) Task

3.1 Task Setup

3.2 Multimodal-Language Integration

3.3 Parameter-Efficient Adaptation

4. Comparative Analysis and Experimental Results

4.1 Quantitative Performance

4.2 Qualitative Insights

5. Comparison to Expansion Quantization Network (EQN) and Micro-Emotion Annotation

6. Significance and Applications

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research