Papers
Topics
Authors
Recent
2000 character limit reached

Uni-MER: Motion-Driven MER Dataset

Updated 17 November 2025
  • Uni-MER is a motion-driven dataset that aligns fine-grained facial motion with symbolic action units (AUs) for precise micro-expression recognition.
  • It uses a dual constraint method combining optical flow and AU consistency to effectively isolate subtle, transient facial expressions.
  • The dataset supports DEFT-LLM instruction tuning by providing grounded textual rationales and robust spatio-temporal feature integration.

Uni-MER is a motion-driven instruction dataset specifically constructed for micro-expression recognition (MER) that enforces fine-grained alignment between facial motion evidence and symbolic action unit (AU)/emotion labels. It provides the foundational data for expert disentanglement in the DEFT-LLM (Disentangled Expert Feature Tuning - LLM) architecture, which enables interpretable, cross-modal modeling of spatio-temporal facial dynamics for precise and transparent MER. Uni-MER’s dual constraint approach, leveraging both optical flow and AU label consistency, is unique in bridging the semantic gap inherent in existing MER datasets, thereby supplying physically grounded, motion-aligned textual supervision for instruction-finetuned multimodal LLMs (Zhang et al., 14 Nov 2025).

1. Rationale and Objectives

The Uni-MER dataset was created to address two core challenges in micro-expression recognition:

  1. Entanglement of static appearance and dynamic facial cues, which impedes models from isolating subtle, transient micro-expressions.
  2. Semantic misalignment between dataset textual labels and the underlying neuromuscular (AU) activity, which reduces the correspondence between ground truth annotations and physical motion.

Uni-MER aims to bridge this gap by directly encoding spatio-temporally grounded motion evidence for AUs, thereby enabling instruction-tuned models (notably DEFT-LLM) to associate local motion patterns with natural language descriptions and symbolic emotion labels (Zhang et al., 14 Nov 2025).

2. Data Sources and Preprocessing Pipeline

Uni-MER aggregates 8,041 video samples from eight leading MER datasets (CASME, CASME II, CAS(ME)², CAS(ME)³, SAMM, MMEW, DFME, 4DME). The core pipeline is as follows:

  • ROI Extraction: 468-point dense facial landmarks identify 17 anatomical regions of interest, mapped to individual FACS (Facial Action Coding System) AUs.
  • Optical Flow Computation:
    • For each video, raw optical flow Fraw(p)F_{\text{raw}}(p) is estimated between onset and offset frames for each facial point pp.
    • Global head motion is measured using the centroid shift of nose-tip landmarks Fnose_tipF_{\text{nose\_tip}}.
    • Motion compensation subtracts global head movement:

    Fcomp(p)={Fraw(p)Fnose_tipFraw(p)>ϵ Fraw(p)otherwiseF_{\text{comp}}(p) = \begin{cases} F_{\text{raw}}(p) - F_{\text{nose\_tip}} & \|F_{\text{raw}}(p)\| > \epsilon \ F_{\text{raw}}(p) & \text{otherwise} \end{cases}

  • Per-ROI Motion Quantization: For each ROI ii, motion evidence Ei=(θi,mi)E_i = (\theta_i, m_i) is defined by:

    • θi:\theta_i: dominant flow direction, quantized into eight directional bins.
    • mi:m_i: averaged flow magnitude, discretized into three categories (weak/medium/strong).

Ei=(atan2(Fy,Fx), 1PtoppPtopFcomp(p))E_i = \left( \text{atan2}(\overline{F_y}, \overline{F_x}),~ \frac{1}{|P_{\text{top}}|} \sum_{\mathbf p \in P_{\text{top}}} \| F_{\text{comp}}(\mathbf p)\| \right)

  • Dual Verification: Each sample includes a rationale R\mathcal{R} comprising:

    1. Forward: Each ground-truth AU label in A\mathcal{A} must be substantiated by consistent local motion evidence in its ROI.
    2. Backward: Any strong motion not matching an AU label is flagged as noise.
  • Final Instance Structure: Each instance is a triplet

(C,E,R)(\mathcal{C},\, \mathbf E,\, \mathcal{R})

where C=\mathcal{C}= (AU set A\mathcal{A}, emotion ee), E\mathbf E are per-ROI motion features, and R\mathcal{R} is a rule-based textual explanation.

3. Dataset Statistics and Annotation Schema

  • Total Samples: 8,041
  • Emotion Classes: Eight (happy, disgust, contempt, surprise, fear, anger, sad, other), with a mild negative-emotion skew.
  • AU Coverage: Twelve core FACS AUs (e.g., AU4 brow lowerer appears in 33.3% of samples, AU9 in 2.4%), with a pronounced long-tail frequency distribution.
  • Region-of-interest Granularity: 17 anatomical segments, mapped to AU locations via FACS coding.
  • Rationale Descriptions: Each data point contains a deterministic, rule-based textual rationale, enabling models to explicitly justify label assignments.
Attribute Value/Description Distribution
# Samples 8,041 ---
# Emotions 8 Negative bias
# Core AUs 12 Long-tail
# ROIs 17 Anatomically mapped

4. Alignment Verification and Motion-Text Correspondence

The dual-constraint rationale in Uni-MER enforces a bi-directional match between symbolic AUs and physical motion:

  • Forward Alignment: Validates that each AU in the provided label set A\mathcal{A} has a corresponding, semantically consistent direction and intensity in the local per-ROI optical flow.
  • Backward Alignment: Discovers all strong local motions that are not explained by any ground-truth AU label and marks them as evidence of annotation noise or error.

This protocol ensures that only instances with physically plausible AU—motion alignments and consistent natural language rationales are retained. It further supports interpretable model outputs, as textual rationales in Uni-MER are explicitly derived from quantized local motion features.

5. Role in DEFT-LLM Instruction Tuning and Cross-modal Fusion

Uni-MER constitutes the instruction-following, motion-aligned supervision for DEFT-LLM:

  • Structured Input: Each sample’s features, including motion evidence, are injected as prefix expert tokens into the multimodal LLM pipeline.
  • Instruction-tuned Learning: Triplets (C,E,R)(\mathcal{C},\, \mathbf E,\, \mathcal{R}) form the input for supervised finetuning, yielding models that output both interpretable emotion/AU labels and step-by-step natural language rationales grounded in verifiable physical motion.
  • Fusion with Visual and Text Branches: The Uni-MER design supports tri-expert DEFT-LLM architectures where structure, temporal, and motion-semantics experts (all pretrained or frozen) contribute independent, disentangled signals for robust and interpretable MER prediction.

6. Impact and Downstream Results

Integration of Uni-MER as the backbone dataset for DEFT-LLM has led to significant advancements:

  • State-of-the-art MER and AU Recognition: DEFT-LLM trained on Uni-MER achieves a cross-dataset UF1 of 45.20% in AU detection (compared with 40.5% for SSSNET; 37.2% ResNet18), and outperforms general multimodal LLMs (Qwen-VL, Gemini) by > 0.15 UF1 for MER (Zhang et al., 14 Nov 2025).
  • Interpretability: The explicit motion-language alignment enables transparent reasoning chains, where predicted emotions and AUs are always linked to reported local motion evidence and deterministic rationales.
  • Ablation Studies: Removing Uni-MER’s expert-based instruction or the discriminative calibration module causes notable performance degradation, confirming the unique contribution of this motion-driven, instruction-aligned dataset.

7. Limitations and Prospective Improvements

  • Long-tail AU/Emotion Coverage: Rare AUs remain underrepresented despite multi-corpus aggregation.
  • ROI Granularity: While 17 anatomical ROIs allow for detailed motion localization, very fine or non-standard facial movements could be missed.
  • Instructional Generalization: The degree of transfer to non-facial or out-of-distribution subtle expressions is untested.
  • Potential Extensions: Future directions include densifying rare AU samples, augmenting with synthetic micro-expressions, and extending the dual-verification protocol to broader affective phenomena.

Uni-MER thus represents a critical advance in motion-grounded, instruction-aligned dataset design for micro-expression recognition and enables interpretable, high-fidelity multimodal reasoning in LLM-based architectures (Zhang et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Uni-MER.