SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting (2407.20799v1)

Published 30 Jul 2024 in cs.CV

Abstract: Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)² show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.

References (63)

Authors (3)

Yicheng Deng (4 papers)
Hideaki Hayashi (26 papers)
Hajime Nagahara (37 papers)

Summary

The paper introduces a novel SW-MRO feature that efficiently captures subtle micro-expressions while mitigating noise from head movements.
The SpotFormer architecture employs multi-scale spatio-temporal transformers with Facial Local Graph Pooling to extract robust facial features.
Supervised contrastive learning enhances expression differentiation, leading to a 49.4% F1-score improvement on micro-expression spotting.

Analysis of "SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting"

This paper presents a novel framework for facial expression spotting, a critical task in facial expression analysis involving the identification of temporal segments in video data where specific facial expressions occur. The authors introduce several innovations aimed at addressing the persistent challenges of spotting subtle micro-expressions (MEs) and distinguishing these from more pronounced macro-expressions (MaEs) despite the noise caused by irrelevant facial movements such as head motion.

Key Contributions

SW-MRO Features: The introduction of the Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature is an attempt to capture subtle facial expressions efficiently. Unlike traditional optical flow between adjacent frames, the SW-MRO function operates within compact sliding windows that allow for full perception of micro-expressions while effectively managing head movement noise. This capability addresses the need for distinguishing the more transient MEs from MaEs, leveraging temporal windows to balance sensitivity and robustness in spotting expressions.
SpotFormer Architecture: The development of the SpotFormer, a multi-scale spatio-temporal Transformer, is central to the proposed framework. This architecture encodes spatio-temporal relationships using Facial Local Graph Pooling (FLGP) operations, supporting robust feature extraction across multiple scales and dimensions. This transformer-based approach is aligned with general trends in AI that utilize Transformers for modeling relationships and dynamics in both spatial and temporal domains.
Supervised Contrastive Learning: By incorporating supervised contrastive learning, the model enhances its ability to distinguish between different expression types. This methodological enhancement underpins a more nuanced and effective classification approach that contributes significantly to its improved performance, particularly in ME spotting tasks.

Empirical Results

The proposed framework's performance is evaluated on the SAMM-LV and CAS(ME) $^2$ datasets, where it demonstrates superior performance over state-of-the-art models. Notably, the framework achieves significant improvements in micro-expression spotting, indicating the efficacy of the SW-MRO features and the SpotFormer architecture. This improvement is quantitatively illustrated by a 49.4% increase in F1-score on the SAMM-LV dataset compared to existing methods.

Theoretical and Practical Implications

The implications of this work are both broad and specific. Theoretically, it offers insights into the potential for Transformers to model complex, multi-scale temporal dynamics in video data beyond their traditional application in natural language processing. Practically, these advancements can enhance systems that rely on reliable facial expression analysis, with applications spanning security, mental health diagnostics, and human-computer interaction.

Speculation on Future Developments

Looking forward, the exploration of more computationally efficient approaches for extracting motion features stands as an open avenue for research. The challenges of achieving balance between optimal sensitivity in spotting expressions and computational load indicate opportunities for optimization, potentially incorporating end-to-end learning frameworks that might subsume the current pre-processing steps.

The paper makes important contributions to the field of facial expression analysis, particularly in addressing the nuanced challenge of micro-expression detection. The innovations proposed have the potential to set new benchmarks for future research focused on improving both accuracy and computational efficiency in facial expression analysis systems.

PDF Markdown

YouTube

Show All Videos