Point-Supervised Facial Expression Spotting

Updated 28 November 2025

The paper introduces a novel point-supervised method that uses a single timestamp per expression, significantly reducing annotation overhead while maintaining precise temporal localization.
It employs Gaussian-based Instance-Adaptive Intensity Modeling (GIM) to generate smooth, instance-specific intensity profiles that improve detection accuracy for both macro- and micro-expressions.
The dual-branch network integrates intensity regression with class-aware apex classification and contrastive learning, enabling robust expression spotting through techniques like Soft-NMS and intensity-aware contrastive losses.

Point-Supervised Facial Expression Spotting (P-FES) addresses the problem of automatically identifying temporal intervals corresponding to facial expression instances—specifically macro-expressions (MaE) and micro-expressions (ME)—in extended, untrimmed facial video streams. Departing from traditional fully-supervised paradigms that demand dense frame-level or temporal boundary annotations, P-FES leverages sparse point-level supervision, requiring only a single timestamp inside each expression instance during training. This reduces annotation overhead while still yielding temporally precise and semantically valid expression proposals in real-world facial video analysis tasks (Deng et al., 21 Nov 2025).

1. Problem Definition and Motivation

Given an untrimmed video $V = \{v_1, \ldots, v_T\}$ of $T$ frames, and a set of $N$ point annotations $Y = \{(p_i, y_i)\}_{i=1}^{N}$ —where $p_i$ denotes a single timestamp within the $i$ -th expression and $y_i \in \{0,1\}^C$ is a multi-hot vector encoding $C$ classes (commonly, $C=2$ for MaE and ME)—the task is to predict a set of temporal proposals $\{(s_j, e_j, c_j)\}$ with onset $s_j$ , offset $e_j$ , and class $c_j \in \{\mathrm{MaE}, \mathrm{ME}\}$ . Point supervision vastly reduces annotation costs and circumvents the ambiguities inherent in interval boundary determination. The primary challenge is to extrapolate from sparse signals to accurate temporal localization across varying expression durations and intensities (Deng et al., 21 Nov 2025).

2. Gaussian-Based Instance-Adaptive Intensity Modeling

To address the shortcomings of hard assignment pseudo-labeling schemes and the complex temporal signatures of facial expressions, P-FES introduces Gaussian-based Instance-Adaptive Intensity Modeling (GIM). This module models the expression intensity as a Gaussian-shaped function peaking at the apex and smoothly decaying outward, tailoring intensity supervision on a per-instance basis.

Given snippet-level features $\mathbf{x}_j \in \mathbb{R}^D$ and temporal intensity scores $a_j$ , the GIM process operates per expression annotation as follows:

Pseudo-Apex Estimation: Within a search region $I_i = [p_i - k_c/4, p_i + k_c/4]$ (where $k_c$ is the estimated duration for class $c$ ), the pseudo-apex frame $v_i^{\mathrm{apex}} = \operatorname{argmax}_{j \in I_i} a_j$ provides a feature mean $\mu_i = x_{v_i^{\mathrm{apex}}}$ .
Duration and Spread: The active duration $L_i$ is obtained by counting frames with $a_j > \theta$ in $J_i = [v_i^{\mathrm{apex}} - k_c/2, v_i^{\mathrm{apex}} + k_c/2]$ , which is then symmetrically expanded by a factor $\delta$ to define $K_i$ , the region for Gaussian modeling. The standard deviation $\sigma_i$ is computed as the root mean square distance in feature space around $\mu_i$ over $K_i$ .
Instance-Adapted Soft Labels: For each ${j \in K_i}$ , soft intensity pseudo-labels are defined as

$\hat{a}_j = \exp \left( -\frac{ \| x_j - \mu_i \|^2 }{ 2 \sigma_i^2 } \right)$

yielding a smooth, instance-specific intensity profile peaking at the apex and providing a continuous signal for supervision (Deng et al., 21 Nov 2025).

3. Network Architecture and Supervision Branches

The overall P-FES framework utilizes a two-branch network grounded on overlapping temporal snippets processed via optical flow and a backbone such as SpotFormer. The shared feature tensor $\mathbf{F} \in \mathbb{R}^{T \times D}$ feeds two functionally distinct branches:

Class-Agnostic Expression Intensity Branch: An MLP regressor maps features to intensity scores $a \in \mathbb{R}^T$ . Supervision signals include the GIM-derived mean squared error loss $L_\mathrm{GIM}$ , $\ell_1$ normalization $L_\mathrm{norm}$ , a reward for high-confidence frames $L_\mathrm{reward}$ , and a temporal smoothness regularizer $L_\mathrm{smooth}$ .
Class-Aware Apex Classification Branch: An MLP classifier predicts per-frame class scores $S \in \mathbb{R}^{T \times C}$ , supervised via focal loss on pseudo-apex and pseudo-neutral frames, distinguishing between MaE and ME at the expression apex.

An intensity-aware contrastive (IAC) loss is integrated to encourage feature discriminability and neutral-expression separation. This contrastive term leverages weights based on intensity similarities, contrasting expression and neutral frames according to their soft pseudo-labels (Deng et al., 21 Nov 2025).

Total training loss aggregates these terms: $L = L_\mathrm{GIM} + L_\mathrm{apex} + L_\mathrm{reward} + \lambda_1 L_\mathrm{smooth} + \lambda_2 L_\mathrm{norm} + \lambda_3 L_\mathrm{IAC}$

4. Inference Procedure and Proposal Generation

For inference, the dual-branch model processes the test video to yield temporal intensity scores $\{a_t\}$ and class logits $\{S_{t,c}\}$ :

Consecutive frames with $a_t$ above adaptively selected thresholds are grouped as candidate proposals.
Each proposal $(s, e)$ $(s, e)$ undergoes:
- Apex detection: $t^* = \operatorname{argmax}_{t \in [s,e]} a_t$ .
- Class determination: $c = \operatorname{argmax}_{c} S_{t^*,c}$ , subject to $S_{t^*,c} > 0.5$ .
- Outer-Inner Contrastive (OIC) scoring: calculates proposal saliency by contrasting average intensity within vs. outside the segment.
Overlaps are suppressed via class-wise Soft-NMS to finalize the set of detections (Deng et al., 21 Nov 2025).

5. Experimental Protocols and Evaluation

Evaluation adheres to standardized datasets and metrics for facial expression spotting:

Datasets: SAMM-LV (147 videos, 343 MaE + 159 ME, subsampled to 30 fps), CAS(ME) $^2$ (98 videos, 300 MaE + 57 ME), and CAS(ME) $^3$ (over 2231 MaE + 285 ME with ME clips filtered for <15 frames).
Validation: Leave-one-subject-out cross-validation with a true positive definition of IoU $\geq 0.5$ .
Performance: Main results are summarized as follows (Table 1 and 2; F $_1$ scores):

Method	SAMM-LV (All)	CAS(ME) $^2$ (All)	CAS(ME) $^3$ (All)
Fully-supervised	0.4401	0.4841	0.2559
LAC	0.3223	0.3598	0.2146
TSP-Net	0.2703	0.3358	0.1484
Deng et al. 2025	0.3587	0.4000	0.2273
Ours (P-FES+GIM+IAC)	0.3705	0.4023	0.2335

Ablation studies and qualitative visualizations confirm that instance-adaptive soft labeling (GIM) and two-branch optimization outperform hard labeling and fused structures, particularly in ME spotting. The approach delivers smooth, neutral-suppressed intensity curves and tight feature clustering around Gaussian-shaped expressions (Deng et al., 21 Nov 2025).

6. Comparative Context and Recent Advances

P-FES with GIM expands the methodological repertoire for point-supervised facial expression spotting, contrasting with anchor-based systems (e.g., LGSNet) and interval regression approaches. PESFormer (Yu et al., 24 Oct 2024) provides an alternative by leveraging Direct Timestamp Encoding (DTE)—binary supervision per snippet—within a vision transformer backbone, forgoing anchor parameterization and interval regression. PESFormer demonstrates significant gains in recall, precision, and F $_1$ by operationalizing dense snippet-wise classification with direct point (timestamp) supervision and zero-padding for interval preservation.

This broader shift toward point-supervised learning in temporal localization tasks is motivated by the sparse nature of micro-expressions and the annotational impracticalities of precise boundary definition, making P-FES and DTE-driven approaches influential for automatic facial expression analysis in unconstrained environments (Deng et al., 21 Nov 2025, Yu et al., 24 Oct 2024).

7. Code Availability and Reproducibility

The codebase and pretrained weights for the P-FES method with GIM and IAC loss are publicly available at https://github.com/KinopioIsAllIn/GIM, supporting reproducibility and algorithmic benchmarking in facial expression spotting research (Deng et al., 21 Nov 2025).