Papers
Topics
Authors
Recent
2000 character limit reached

EMER: Eye-Behavior Multimodal Emotion Dataset

Updated 25 December 2025
  • The EMER dataset is a multimodal resource integrating facial expressions with eye behavior to capture genuine emotional cues beyond conventional FER.
  • It combines high-resolution facial video with precise eye tracking, including gaze, saccades, and pupil responses, enabling rigorous multi-view annotation.
  • Leveraging EMERT’s advanced transformer model, the dataset supports robust multi-task learning and benchmarking for comprehensive emotion recognition.

Eye-behavior-aided Multimodal Emotion Recognition (EMER) datasets provide an integrative resource for advancing emotion recognition (ER) by modeling both facial expressions and eye behaviors. These datasets address the inherent limitations of facial expression recognition (FER)—which often fails to capture genuine emotional states due to its susceptibility to social masking—by incorporating dynamic ocular signals (e.g., gaze, saccades, pupil response) as objective correlates of emotion. EMER datasets enable rigorous multi-view annotation strategies, support multimodal machine learning, and promote benchmarking of robust model architectures that explicitly bridge the gap between observed expressions and actual internal affect.

1. Dataset Motivation and Paradigm

The prevailing reliance on FER in affective computing overlooks the divergence between socially performed expressions and authentic emotion (Liu et al., 2024). Eye behaviors, including micro-movements, fixations, and pupil dynamics, provide spontaneous emotional cues less subject to voluntary control and thus yield a more accurate reflection of internal affective states (Liu et al., 18 Dec 2025). EMER datasets systematically integrate these ocular modalities with facial video, enabling analyses that rigorously disentangle perceived emotion (as read from faces) from actual internal emotion (as indexed by eye/oculomotor signals).

To collect naturalistic and genuine emotion, EMER datasets employ spontaneous emotion-induction protocols. For instance, in (Liu et al., 18 Dec 2025), participants are exposed to carefully curated film clips (covering Ekman's six basic emotions and neutral), with their facial expressions, eye movements, and fixation heatmaps continuously recorded. This paradigm ensures spontaneous, rather than acted, behavioral data.

2. Data Collection Protocols and Modalities

EMER datasets are characterized by precise multimodal acquisition:

  • Facial Video: High-resolution (e.g., 1080p at 30 fps) recording synchronized with stimulus onset, enabling frame-wise modeling. Pre-processing includes illumination normalization and landmark-based face alignment via methods such as MTCNN.
  • Eye Tracking: Devices such as the Tobii Pro Fusion (up to 250 Hz) record timestamped gaze coordinates, saccade/fixation events, and pupil diameter. Correction protocols (e.g., blink duration [75,425] ms interpolation, sweep correction, pupil differencing) yield artifact-minimal time series. Eye-fixation maps are generated per frame, resulting in two-dimensional spatial heatmaps of fixation density.
  • Demographics and Context: Participants are drawn from diverse backgrounds (e.g., N=121, ages 18–40, varied majors). Experimental settings are controlled (quiet lab), and hardware timestamping ensures modality synchronization.

In related variants such as (Seikavandi et al., 19 Sep 2025), additional dimensions such as environmental sensors and personality trait vectors (BFI-44 Big Five scores) are included, with eye-gaze sequences interpolated to fixed temporal grids for modeling consistency.

3. Annotation Schemes and Label Taxonomies

EMER datasets provide rich, multi-view annotation with rigorous reliability assessment:

  • ER Labels: Participants self-report experienced affect using standardized tools (e.g., Self-Assessment Manikin), yielding valence and arousal on continuous scales, with categorization into classes (positive/neutral/negative and the six (or seven) basic emotions—happiness, sadness, fear, surprise, disgust, anger, neutral).
  • FER Labels: To illustrate and analyze the FER–ER gap, separate annotations of expressed emotion are produced. An active learning-based annotation (ALA) protocol first applies automated FER (e.g., EmotiEffNet), followed by expert cross-validation and expectation–maximization-based reliability weighting. The final label for sample jj, fjf_j, is obtained as

fj=argmaxci:tj(i)=cαi,f_j = \arg\max_c \sum_{i:t_j^{(i)}=c}\alpha_i,

where tj(i)t_j^{(i)} are annotator/model entries and αi\alpha_i are reliability weights (learned via EM).

  • Inter-Rater Reliability: Assessed by Cronbach’s α\alpha (e.g., discrete: 0.978 with ALA versus 0.784 for experts only), demonstrating high consensus under the ALA protocol.
  • Additional Labels: In listener-perspective datasets (Seikavandi et al., 19 Sep 2025), both perceived and felt valence/arousal are collected post-stimulus, with labels binned for classification and stored for regression.

4. Dataset Structure, Statistics, and Splits

EMER datasets are organized for direct use in machine learning research:

  • Each sample aggregates: facial video (8 frames per sample for modeling), eye-movement sequence (32 time-steps), and fixation-map sequence (32 spatial heatmaps). Label sets include discrete (3-class, 7-class) and continuous (valence/arousal ∈ [–1,1]) values for both ER and FER.
  • The (Liu et al., 18 Dec 2025) dataset comprises 1,303 high-quality samples post-cleaning, derived from an initial pool (1.91M eye samples, 390,900 video frames), with class distributions for both ER and FER highlighting domain-relevant imbalances (e.g., “neutral” dominance in FER).
  • Five-fold cross-validation ensures that on each split approximately 1,043 samples are used for training and 260 for testing.
  • Seven benchmark protocols—ranging from coarse/fine-grained classification to regression of continuous affect and expression intensity—enable multi-objective evaluation.
Modality Temporal Structure Key Features
Facial video 8 frames Aligned, normalized face crops
Eye movement seq. 32 time steps Gaze point, saccade, pupil diameter
Fixation heatmaps 32 frames 2D spatial fixation densities

5. Machine Learning Benchmarks: The EMERT Model

To fully leverage EMER data, the Eye-behavior-aided MER Transformer (EMERT) provides a multitask architecture optimized for both ER and FER (Liu et al., 18 Dec 2025):

Ladv=1Nbj=1Nbm1[modalityj=m]logDm(FC(j)).\mathcal{L}_{\mathrm{adv}} = -\frac{1}{N_b}\sum_{j=1}^{N_b}\sum_{m} \mathbf{1}[\mathrm{modality}_j = m]\log D_m(F_C^{(j)}).

  • Emotion-sensitive Multi-task Transformer (EMT): Fuses features via Transformer attention, with distinct output heads for ER and FER. Multi-task loss combines adversarial objective and cross-entropy/Huber losses for ER and FER heads, yielding

L=αLadv+β(Le+Lf)\mathcal{L} = \alpha\,\mathcal{L}_{\mathrm{adv}} + \beta(\mathcal{L}_e + \mathcal{L}_f)

with optimal α=0.3\alpha=0.3, β=0.1\beta=0.1.

  • Training Regimen: Implemented in PyTorch, using Adam optimizer (initial 1e41e^{-4}, cosine decay), batch size 16, 30–50 epochs with early stopping.

6. Evaluation Protocols and Empirical Results

Seven benchmark protocols test categorical and continuous ER/FER, and expression intensity. Principal results (Liu et al., 18 Dec 2025):

  • ER Classification (3- and 7-class): EMERT achieves WAR 59.28/33.92, UAR 52.62/28.17, F1 55.71/30.38 for 3-/7-class ER, outperforming SOTA on UAR and WAR.
  • FER Classification (3- and 7-class): WAR 68.10/51.18, UAR 56.91/33.04, F1 63.73/43.33.
  • Valence/Arousal Regression: ER MAE/MSE/RMSE of 0.365/0.217/0.456 (arousal) and 0.433/0.279/0.519 (valence); FER results are consistently better than baselines.
  • Ablation and Robustness: Adding ocular features to facial-only models yields consistent performance gains (e.g., 2.68 WAR improvement in ER 7-class). EMERT demonstrates <2.3% degradation under significant Gaussian noise, outperforming other multimodal methods.
Task Metric EMERT Best SOTA
ER(3c) WAR 59.28 59.13
UAR 52.62 49.36
FER(3c) WAR 68.10 67.24
F1 63.73 63.69

7. Interpretative Analysis and Impact

Ocular modalities provide both complementary and synergistic information to facial expressions. Feature ablation reveals that pupil diameter and gaze points are the most informative for ER, supporting the hypothesis that involuntary physiological signals are strong correlates of affective state (Liu et al., 18 Dec 2025). Attention-map visualization distinguishes the ER head’s focus on fine facial details (eye/mouth corners) from the FER head’s emphasis on global facial regions.

Multi-task modeling produces synergistic benefits: optimizing for both ER and FER in a unified framework results in improved ER performance (single-task UAR=46.68 vs. multi-task UAR=52.62). The EMER dataset, coupled with EMERT, establishes a new state-of-the-art benchmark for multimodal emotion recognition and provides a publicly available resource for further methodological advances in the field. The availability of fine-grained, multimodal, and reliably annotated data is expected to catalyze research in robust, ecologically-valid, and personalized affective computing (Liu et al., 18 Dec 2025, Seikavandi et al., 19 Sep 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Eye-behavior-aided Multimodal Emotion Recognition (EMER) Dataset.