EMERT: Eye-Behavior-Aided MER Transformer
- The paper introduces EMERT, a multimodal emotion recognition model that integrates facial expressions and eye behavior using adversarial feature decoupling.
- It employs a multitask Transformer architecture to fuse modality-specific and generic features, achieving superior performance on ER and FER benchmarks.
- The study highlights the critical role of eye movement signals in closing the emotion gap and outlines future directions for real-world applicability.
The Eye-behavior-aided MER Transformer (EMERT) is a multimodal emotion recognition architecture designed to explicitly bridge the gap between facial expression recognition (FER) and true emotion recognition (ER) through the integration and adversarial fusion of facial and eye behavior modalities. Developed alongside the Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset, EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to achieve superior performance on a range of ER and FER benchmarks by modeling eye movement and fixation patterns as critical affective signals, thereby complementing information found in facial expressions (Liu et al., 18 Dec 2025).
1. EMER Dataset and Multimodal Annotation
The EMER dataset provides the empirical foundation for EMERT by systematically pairing facial expression videos, eye movement sequences, and eye fixation maps under conditions designed to elicit spontaneous and genuine emotions.
- Stimulus-Induced Paradigm: 28 video clips, each carefully chosen by emotion experts from an initial pool of 115, spanning Ekman’s six basic emotions plus neutral, induce target affective states in 121 participants under controlled laboratory conditions.
- Multimodal Data: Simultaneous collection of 30 fps facial videos (390,900 frames), high-frequency (120 Hz) Tobii Pro Fusion eye movement sequences (1.91M samples), and frame-level eye fixation heatmaps (7.5 GB).
- Annotation Protocol:
- ER Labels: Participant self-report via the Self-Assessment Manikin (SAM), producing 3-class (positive/negative/neutral), 7-class (emotion categories), and continuous valence/arousal labels in [–1,1].
- FER Labels: Active-Learning Annotation (ALA): Pretrained EmotiEffNet weak supervision, expert correction for disagreements, followed by EM-derived reliability weighting and consensus voting for high-fidelity facial expression annotation (Cronbach’s α > 0.97 for discrete categories). Expression intensity labels in {0,1,2,3} are included.
- Significance: This dual-labeling strategy enables explicit measurement of the divergences (the “emotion gap”) between FER and ER, as well as their differential relationships to eye behaviors [(Liu et al., 18 Dec 2025), Fig. 7].
2. EMERT Model Architecture
The EMERT pipeline is characterized by dedicated feature extraction from each modality, adversarial feature decoupling to isolate shared vs. modality-specific components, and a multitask Transformer-based fusion mechanism.
- Multimodal Feature Extraction (MFE):
- Video frames: ResNet-50 backbone yields .
- Eye movements: 2-layer LSTM encodes sequential gaze data to .
- Eye fixations: 2-layer LSTM processes heatmaps to .
- Modality-Adversarial Feature Decoupling (MAFD):
- Shared “emotion-generic” features: for .
- Modality-specific features: Unique MLP per modality, e.g., .
- Modality discriminator : Trained via a gradient reversal layer with adversarial loss to maximize feature invariance.
- Emotion-Sensitive Multi-Task Transformer (EMT):
- Query: Concatenated emotion-generic features .
- Keys/Values: Modality-unique features .
- Standard Transformer encoder with 4 layers (, 8 heads, FFN hidden size 2048), yielding fused affective representations .
- Task Heads: Two dedicated heads for ER () and FER (), each as a 2-layer MLP.
3. Mathematical Formulation
EMERT’s pipeline is formalized as follows:
- Input Embeddings & Positional Encodings
- Adversarial Loss
and minimized as .
- Classification and Regression Losses
- Total Objective
with , .
4. Training Protocols and Experimental Setup
EMERT is optimized using AdamW with an initial learning rate of and cosine decay, batch size 16, over approximately 100 epochs. Each training sample comprises 8 video frames, 32 eye timestamps, and 32 fixation maps; only random temporal cropping is employed for augmentation, and dropout of 0.1 is applied within the Transformer.
Seven distinct evaluation protocols are implemented:
- 3-/7-class ER and FER classification with weighted and unweighted accuracy rates (WAR, UAR, F1).
- Valence/arousal regression (MAE, MSE, RMSE) for both ER and FER.
- Expression intensity regression for FER.
- Modality ablations (F: facial; E: eye movement; G: gaze/fixation; all combinations).
- Robustness to Gaussian noise ().
- Annotation consistency (Cronbach’s α).
- Cross-dataset generalization onto SIMS [(Liu et al., 18 Dec 2025); Tables II–V, VII–X].
5. Performance and Ablation Analyses
EMERT consistently outperforms state-of-the-art multimodal baselines (Self_MM, MulT, LMF, TMT, NORM-TR) across all core tasks.
| Task / Metric | EMERT (Best) | Best Baseline | Improvement |
|---|---|---|---|
| 7-class ER (WAR/UAR/F1) | 33.92 / 28.17 / 30.38 | up to 32.63 / 28.17 / 29.50 | 0.2–2.0 points |
| 7-class FER (WAR/UAR/F1) | 51.18 / 33.04 / 43.33 | 50.80 / 32.63 / 43.32 | up to 0.38 points |
| FER Intensity (MAE/MSE) | 0.660 / 0.673 | 0.666 / 0.685 | –0.006 / –0.012 |
| Cross-dataset (SIMS, Acc-5) | up to +23.0% over MulT | – | +23.0 |
- Modality Complementarity: The addition of eye movement features increases WAR by 1.3–3.2 over face-only baselines.
- Module Ablation: Removing MAFD or EMT each reduces performance by 1.8–2.3% WAR; full EMERT yields an aggregate +2.7%.
- Multi-tasking Synergy: Simultaneous ER/FER objectives provide mutual regularization, improving both task-specific scores (ER head boosts FER WAR +0.37%; FER head improves ER UAR +5.9%).
- Annotation Consistency: Cronbach’s α > 0.97 for categorical labels.
- Noise Robustness: EMERT demonstrates higher resilience to Gaussian noise than benchmarks.
6. Theoretical Motivation and Insights
The design of EMERT is motivated by fundamental psychological findings and empirical observations:
- Facial expressions are frequently used as social displays, not necessarily correlating with genuine emotional state, contributing to the “emotion gap” between FER and ER.
- Eye behavior—specifically, gaze patterns, pupil dilation, and fixation distributions—correlate more consistently (Pearson r ≈ 0.4–0.6) with ER, versus their correlation with FER (r ≈ 0.2–0.4), and are less amenable to voluntary control or social masking [(Liu et al., 18 Dec 2025), Fig. 8].
- The adversarial decoupling strategy in MAFD explicitly suppresses modality-specific “camouflage” features, resulting in more robust and authentic affective representations.
- The EMT’s approach of querying with emotion-generic features and keying with modality-unique components achieves more effective crossmodal integration than naïve feature concatenation or unimodal queries.
7. Limitations and Future Directions
EMERT’s reliance on high-precision, laboratory-grade eye tracking (Tobii) constrains present-day applicability in unconstrained real-world settings. Prospective research will focus on:
- Affordable webcam-based gaze estimation.
- Extension to additional modalities (audio, text).
- Integration with broader-scale large multimodal foundation models (e.g., CLIP variants).
- Evaluation and adaptation for ecologically valid, in-the-wild contexts [(Liu et al., 18 Dec 2025), Sec. VI].
Continued exploration and public release of EMER and EMERT provide a resource for advancing robust multimodal emotion recognition and for clarifying the distinctive contributions of eye behavior in affective computing (Liu et al., 18 Dec 2025).