Papers
Topics
Authors
Recent
2000 character limit reached

Continuous Micro-Expression Intensity Estimation

Updated 7 December 2025
  • Continuous micro-expression intensity estimation is the quantification of brief, spontaneous facial movements using real-valued trajectories derived from EMG signals and deep neural networks.
  • It overcomes sparse annotations by leveraging weakly supervised pseudo-labels and interpolated triangular intensity curves to model onset, apex, and offset phases.
  • State-of-the-art models combine spatial encoders, temporal GRU networks, and specialized loss functions, achieving high correlation metrics and advancing affective computing research.

Continuous micro-expression intensity estimation concerns the quantification, over time, of the strength or activation of spontaneous, brief facial movements—micro-expressions—that typically reveal genuine, concealed affective states. Unlike traditional approaches that classify micro-expressions categorically or at single time-points, continuous intensity estimation seeks to assign a real-valued trajectory to the evolving facial motion, characterizing the rise, apex, and fall of the micro-expression on a per-frame basis. The field addresses both fundamental questions of emotional dynamics and practical needs in affective computing, clinical research, and human-computer interaction. Challenges principally arise due to the brief duration, low signal-to-noise ratio, and annotation granularity of available datasets.

1. Objective Measurement and Characterization of Micro-Expression Intensity

Modern studies have moved away from purely subjective annotation toward objective, physiological signal-based quantification. Facial electromyography (EMG) provides the principal means to directly quantify the amplitude and temporal profile of micro-expressions. In "Could Micro-Expressions be Quantified? Electromyography Gives Affirmative Evidence" (Li et al., 16 Aug 2024), continuous frame-level intensity curves are derived from high-speed, multi-electrode facial EMG. The normalized intensity at each time tt for participant ii, channel kk^* (corresponding to the most active facial muscle group) is given as the percentage of maximum voluntary contraction (MVC\%):

MVC%ik(t)=Eenv,ik(t)MVCik×100%\mathrm{MVC\%}_{i}^{k^*}(t) = \frac{E_{\mathrm{env},i}^{k^*}(t)}{MVC_i^{k^*}} \times 100\%

where Eenv,ik(t)E_{\mathrm{env},i}^{k^*}(t) is the EMG envelope and MVCikMVC_i^{k^*} is the individual's maximum exerted contraction for that muscle. Micro-expressions, thus measured, exhibit peak MVC\% values in the range $7$\%–$9.2$\% and durations of 307–327 ms. These EMG-derived trajectories serve as both objective characterizations for behavioral science and as precise ground truth for training continuous computer vision models.

2. Annotation Paradigms and Challenges

A central methodological challenge in continuous micro-expression intensity estimation is the lack of dense, frame-level ground-truth in most face video corpora. Traditional datasets such as SAMM and CASME II provide only three sparse landmarks per event: onset, apex, and offset frames. To circumvent this, "Weakly Supervised Continuous Micro-Expression Intensity Estimation Using Temporal Deep Neural Network" (Almushrafy, 30 Nov 2025) and "Facial Micro-Expression Spotting and Recognition using Time Contrasted Feature with Visual Memory" (Nag et al., 2019) use weak or pseudo-labels to model the expected trajectory of intensity. A standard approach interpolates linearly between annotated landmarks:

  • Intensity smoothly rises from $0$ at onset to $1$ at apex, then falls back to $0$ at offset. The most effective variant is the triangular pseudo-trajectory:

y(t)={T(t)a,T(t)a 1T(t)1a,T(t)>ay(t)[0,1]y(t) = \begin{cases} \dfrac{T(t)}{a}, & T(t) \leq a \ \dfrac{1 - T(t)}{1 - a}, & T(t) > a \end{cases} \qquad y(t) \in [0,1]

with T(t)T(t) the normalized time and aa the position of apex. This method enables frame-level regression in the absence of true physical measurements.

3. Computational Models and Network Architectures

State-of-the-art continuous micro-expression intensity estimation leverages multi-stage, spatio-temporal neural networks designed to encode both spatial patterns and temporal dynamics associated with subtle facial actions. In (Almushrafy, 30 Nov 2025), the pipeline comprises:

  • Spatial Encoder: ResNet-18, pretrained, extracting 512-dimensional vectors per frame.
  • Temporal Encoder: A single-layer bidirectional GRU (H=256H=256 per direction), aggregating dependencies across TT resampled frames.
  • Regression Head: A linear layer producing the frame-wise intensity estimate y^(t)\hat{y}(t).

In (Nag et al., 2019), a visual memory GRU is augmented with time-contrasted spatial and motion features:

  • Spatial Network: DeepLab-largeFOV CNN (pretrained), operated on pairs of consecutive frames.
  • Temporal Network: Pretrained MPNet, extracting frame-to-frame optical flow maps.
  • Context-Contrast Module: Three contrast feature maps based on local/global/context statistics amplify micro-movement differentials.
  • Visual Memory (GRU): Aggregates fused spatial-temporal features, predicting both categorical label and scalar intensity via regression and classification heads.

The critical finding is that temporal modeling (e.g., Bi-GRU/GRU) is essential for capturing the rapid rise-apex-fall behavior intrinsic to micro-expressions (Almushrafy, 30 Nov 2025).

4. Loss Functions, Training Objectives, and Evaluation

Model optimization for continuous intensity estimation typically combines multiple losses:

  • Regression Loss: Mean squared error (MSE) between predicted and (pseudo) target intensity curves.
  • Smoothness Regularization: Penalizes large inter-frame intensity jumps,

Lsmooth=1T1t=1T1(y^(t+1)y^(t))2L_{smooth} = \frac{1}{T-1}\sum_{t=1}^{T-1} \left( \hat{y}(t+1) - \hat{y}(t) \right)^2

  • Apex Ranking Loss: Enforces that the predicted apex is the unique maximum,

Crank=max(0,1(y^(tap)maxttapy^(t)))C_{rank} = \max\left(0, 1 - \left(\hat{y}(t_{ap}) - \max_{t\neq t_{ap}}\hat{y}(t) \right) \right)

Overall loss is a weighted sum:

L=αregLreg+αsmoothLsmooth+αrankCrankL = \alpha_{reg} L_{reg} + \alpha_{smooth} L_{smooth} + \alpha_{rank} C_{rank}

Multi-task learning, in which categorical and regression objectives are co-optimized, is common (Nag et al., 2019). Evaluation relies on rank-based correlation coefficients (Spearman ρ\rho, Kendall τ\tau) or, for EMG ground-truth, Pearson correlation and RMSE. For example, the model of (Almushrafy, 30 Nov 2025) achieves ρ=0.9789\rho = 0.9789, τ=0.9222\tau = 0.9222 on SAMM using the full temporal network, markedly outperforming frame-wise baselines.

5. Dataset Construction, Preprocessing, and Alignment

Successful continuous estimation requires both temporal alignment and rigorous preprocessing:

  • Alignment: Uniformly resampling between onset and offset places all clips on a normalized temporal grid.
  • Face Region Preprocessing: Face detection, landmark localization, alignment, cropping, resizing (e.g., 224×224224\times 224 px), and normalization (as in ImageNet or to EMG time-points) are standard.
  • Synchronization (Physiological Ground Truth): For EMG-based datasets such as CASME-MG (Li et al., 16 Aug 2024), alignment of high-frequency EMG signals to video frames is performed using hardware triggers or digital tube indicators. EMG is band-pass filtered, rectified, enveloped, and downsampled to frame rates via windowed averaging.

A table summarizing dataset characteristics:

Dataset # Clips Frame Rate Annotation Ground Truth
SAMM 159 200 fps Onset/apex/offset Pseudo (Triangular)
CASME II 247 200 fps Onset/apex/offset, emotion Pseudo (Triangular)
CASME-MG 380 (233 ME) 30 fps vid + 1 kHz EMG Onset/apex/offset, FACS, emotion, EMG Continuous (MVC%)

6. Key Results, Ablations, and Impact of Model Choices

Ablation studies on (Almushrafy, 30 Nov 2025) reveal:

  • Temporal modeling is indispensable: Adding a Bi-GRU boosts the Spearman correlation from $0.8130$ (ResNet-only) to $0.9789$ (with Bi-GRU) on SAMM.
  • Triangular Prior is optimal: Alternative pseudo-labels (e.g., Gaussian) are outperformed by the simple triangular trajectory.
  • Auxiliary losses: The smoothness and apex-ranking losses provide minor or dataset-dependent improvements; the main effect is delivered by the temporal modeling and the structured pseudo-labels.
  • Direct physiological labels: Use of continuous EMG-derived intensity (MVC%) (Li et al., 16 Aug 2024) as training targets provides fully objective ground-truth, and the CASME-MG database offers a unique resource for future research.

A plausible implication is that future vision-based intensity estimation pipelines may shift toward physiological labels as primary targets, where available, enhancing not only ecological validity but also cross-domain transferability.

7. Outlook and Future Directions

Continuous micro-expression intensity estimation is moving rapidly from subjective, sparse pseudo-labeling toward high-frequency, physiological ground truth, supported by synchronized multimodal datasets such as CASME-MG (Li et al., 16 Aug 2024). Advances in weak supervision (e.g., triangular priors (Almushrafy, 30 Nov 2025)) and context-contrasting networks (Nag et al., 2019) are improving frame-level agreement and real-time applicability. Key open directions include generalization across subjects, robustness to occlusion and variation, and the integration of multi-channel EMG for learning multi-AU trajectories. The field increasingly emphasizes rigor in annotation, benchmarking, and the physiological interpretability of estimated intensity curves.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Continuous Micro-Expression Intensity Estimation.