Papers
Topics
Authors
Recent
2000 character limit reached

Point-Supervised Facial Expression Spotting

Updated 28 November 2025
  • The paper introduces a novel point-supervised method that uses a single timestamp per expression, significantly reducing annotation overhead while maintaining precise temporal localization.
  • It employs Gaussian-based Instance-Adaptive Intensity Modeling (GIM) to generate smooth, instance-specific intensity profiles that improve detection accuracy for both macro- and micro-expressions.
  • The dual-branch network integrates intensity regression with class-aware apex classification and contrastive learning, enabling robust expression spotting through techniques like Soft-NMS and intensity-aware contrastive losses.

Point-Supervised Facial Expression Spotting (P-FES) addresses the problem of automatically identifying temporal intervals corresponding to facial expression instances—specifically macro-expressions (MaE) and micro-expressions (ME)—in extended, untrimmed facial video streams. Departing from traditional fully-supervised paradigms that demand dense frame-level or temporal boundary annotations, P-FES leverages sparse point-level supervision, requiring only a single timestamp inside each expression instance during training. This reduces annotation overhead while still yielding temporally precise and semantically valid expression proposals in real-world facial video analysis tasks (Deng et al., 21 Nov 2025).

1. Problem Definition and Motivation

Given an untrimmed video V={v1,,vT}V = \{v_1, \ldots, v_T\} of TT frames, and a set of NN point annotations Y={(pi,yi)}i=1NY = \{(p_i, y_i)\}_{i=1}^{N}—where pip_i denotes a single timestamp within the ii-th expression and yi{0,1}Cy_i \in \{0,1\}^C is a multi-hot vector encoding CC classes (commonly, C=2C=2 for MaE and ME)—the task is to predict a set of temporal proposals {(sj,ej,cj)}\{(s_j, e_j, c_j)\} with onset sjs_j, offset eje_j, and class cj{MaE,ME}c_j \in \{\mathrm{MaE}, \mathrm{ME}\}. Point supervision vastly reduces annotation costs and circumvents the ambiguities inherent in interval boundary determination. The primary challenge is to extrapolate from sparse signals to accurate temporal localization across varying expression durations and intensities (Deng et al., 21 Nov 2025).

2. Gaussian-Based Instance-Adaptive Intensity Modeling

To address the shortcomings of hard assignment pseudo-labeling schemes and the complex temporal signatures of facial expressions, P-FES introduces Gaussian-based Instance-Adaptive Intensity Modeling (GIM). This module models the expression intensity as a Gaussian-shaped function peaking at the apex and smoothly decaying outward, tailoring intensity supervision on a per-instance basis.

Given snippet-level features xjRD\mathbf{x}_j \in \mathbb{R}^D and temporal intensity scores aja_j, the GIM process operates per expression annotation as follows:

  • Pseudo-Apex Estimation: Within a search region Ii=[pikc/4,pi+kc/4]I_i = [p_i - k_c/4, p_i + k_c/4] (where kck_c is the estimated duration for class cc), the pseudo-apex frame viapex=argmaxjIiajv_i^{\mathrm{apex}} = \operatorname{argmax}_{j \in I_i} a_j provides a feature mean μi=xviapex\mu_i = x_{v_i^{\mathrm{apex}}}.
  • Duration and Spread: The active duration LiL_i is obtained by counting frames with aj>θa_j > \theta in Ji=[viapexkc/2,viapex+kc/2]J_i = [v_i^{\mathrm{apex}} - k_c/2, v_i^{\mathrm{apex}} + k_c/2], which is then symmetrically expanded by a factor δ\delta to define KiK_i, the region for Gaussian modeling. The standard deviation σi\sigma_i is computed as the root mean square distance in feature space around μi\mu_i over KiK_i.
  • Instance-Adapted Soft Labels: For each jKi{j \in K_i}, soft intensity pseudo-labels are defined as

a^j=exp(xjμi22σi2)\hat{a}_j = \exp \left( -\frac{ \| x_j - \mu_i \|^2 }{ 2 \sigma_i^2 } \right)

yielding a smooth, instance-specific intensity profile peaking at the apex and providing a continuous signal for supervision (Deng et al., 21 Nov 2025).

3. Network Architecture and Supervision Branches

The overall P-FES framework utilizes a two-branch network grounded on overlapping temporal snippets processed via optical flow and a backbone such as SpotFormer. The shared feature tensor FRT×D\mathbf{F} \in \mathbb{R}^{T \times D} feeds two functionally distinct branches:

  • Class-Agnostic Expression Intensity Branch: An MLP regressor maps features to intensity scores aRTa \in \mathbb{R}^T. Supervision signals include the GIM-derived mean squared error loss LGIML_\mathrm{GIM}, 1\ell_1 normalization LnormL_\mathrm{norm}, a reward for high-confidence frames LrewardL_\mathrm{reward}, and a temporal smoothness regularizer LsmoothL_\mathrm{smooth}.
  • Class-Aware Apex Classification Branch: An MLP classifier predicts per-frame class scores SRT×CS \in \mathbb{R}^{T \times C}, supervised via focal loss on pseudo-apex and pseudo-neutral frames, distinguishing between MaE and ME at the expression apex.

An intensity-aware contrastive (IAC) loss is integrated to encourage feature discriminability and neutral-expression separation. This contrastive term leverages weights based on intensity similarities, contrasting expression and neutral frames according to their soft pseudo-labels (Deng et al., 21 Nov 2025).

Total training loss aggregates these terms: L=LGIM+Lapex+Lreward+λ1Lsmooth+λ2Lnorm+λ3LIACL = L_\mathrm{GIM} + L_\mathrm{apex} + L_\mathrm{reward} + \lambda_1 L_\mathrm{smooth} + \lambda_2 L_\mathrm{norm} + \lambda_3 L_\mathrm{IAC}

4. Inference Procedure and Proposal Generation

For inference, the dual-branch model processes the test video to yield temporal intensity scores {at}\{a_t\} and class logits {St,c}\{S_{t,c}\}:

  1. Consecutive frames with ata_t above adaptively selected thresholds are grouped as candidate proposals.
  2. Each proposal (s,e)(s, e) undergoes:
    • Apex detection: t=argmaxt[s,e]att^* = \operatorname{argmax}_{t \in [s,e]} a_t.
    • Class determination: c=argmaxcSt,cc = \operatorname{argmax}_{c} S_{t^*,c}, subject to St,c>0.5S_{t^*,c} > 0.5.
    • Outer-Inner Contrastive (OIC) scoring: calculates proposal saliency by contrasting average intensity within vs. outside the segment.
  3. Overlaps are suppressed via class-wise Soft-NMS to finalize the set of detections (Deng et al., 21 Nov 2025).

5. Experimental Protocols and Evaluation

Evaluation adheres to standardized datasets and metrics for facial expression spotting:

  • Datasets: SAMM-LV (147 videos, 343 MaE + 159 ME, subsampled to 30 fps), CAS(ME)2^2 (98 videos, 300 MaE + 57 ME), and CAS(ME)3^3 (over 2231 MaE + 285 ME with ME clips filtered for <15 frames).
  • Validation: Leave-one-subject-out cross-validation with a true positive definition of IoU 0.5\geq 0.5.
  • Performance: Main results are summarized as follows (Table 1 and 2; F1_1 scores):
Method SAMM-LV (All) CAS(ME)2^2 (All) CAS(ME)3^3 (All)
Fully-supervised 0.4401 0.4841 0.2559
LAC 0.3223 0.3598 0.2146
TSP-Net 0.2703 0.3358 0.1484
Deng et al. 2025 0.3587 0.4000 0.2273
Ours (P-FES+GIM+IAC) 0.3705 0.4023 0.2335

Ablation studies and qualitative visualizations confirm that instance-adaptive soft labeling (GIM) and two-branch optimization outperform hard labeling and fused structures, particularly in ME spotting. The approach delivers smooth, neutral-suppressed intensity curves and tight feature clustering around Gaussian-shaped expressions (Deng et al., 21 Nov 2025).

6. Comparative Context and Recent Advances

P-FES with GIM expands the methodological repertoire for point-supervised facial expression spotting, contrasting with anchor-based systems (e.g., LGSNet) and interval regression approaches. PESFormer (Yu et al., 24 Oct 2024) provides an alternative by leveraging Direct Timestamp Encoding (DTE)—binary supervision per snippet—within a vision transformer backbone, forgoing anchor parameterization and interval regression. PESFormer demonstrates significant gains in recall, precision, and F1_1 by operationalizing dense snippet-wise classification with direct point (timestamp) supervision and zero-padding for interval preservation.

This broader shift toward point-supervised learning in temporal localization tasks is motivated by the sparse nature of micro-expressions and the annotational impracticalities of precise boundary definition, making P-FES and DTE-driven approaches influential for automatic facial expression analysis in unconstrained environments (Deng et al., 21 Nov 2025, Yu et al., 24 Oct 2024).

7. Code Availability and Reproducibility

The codebase and pretrained weights for the P-FES method with GIM and IAC loss are publicly available at https://github.com/KinopioIsAllIn/GIM, supporting reproducibility and algorithmic benchmarking in facial expression spotting research (Deng et al., 21 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Point-Supervised Facial Expression Spotting (P-FES).