AU-vMAE: Video Autoencoder for FAU Detection
- AU-vMAE is a video-masked autoencoder framework for detecting Facial Action Units in unconstrained videos using high-ratio tube masking.
- The framework employs a two-stage architecture that combines large-scale self-supervised pre-training with multi-level fine-tuning to enhance classification accuracy.
- Incorporating FSM-style spatial and temporal priors, AU-vMAE achieves state-of-the-art results on benchmarks such as BP4D and DISFA.
AU-vMAE is a video-masked autoencoder framework explicitly designed for the detection of Facial Action Units (FAUs) in unconstrained video data, addressing core challenges of annotation scarcity and high inter-personal variability. It introduces a knowledge-guided approach that leverages large-scale self-supervised video pre-training, multi-level supervision, and finite-state-machine (FSM) style temporal and spatial priors, eschewing traditional Graph Neural Network models. The method is validated with state-of-the-art results on major FAU benchmarks, notably BP4D and DISFA (Jin et al., 2024).
1. Architectural Overview
AU-vMAE centers on a two-stage architecture: (1) large-scale pre-training via a video-masked autoencoder (videoMAE), and (2) task-specific fine-tuning for FAU recognition. The pre-training stage employs a Vision Transformer (ViT) encoder operating on partially observed video sequences, where "tube" masking occludes contiguous spatio-temporal regions at high ratios (up to 90%). The lightweight Transformer decoder reconstructs masked frames from the encoder's learned latent representation.
Input video clips are divided into a temporal sequence of spatial patch tubes, with temporal down-sampling applied as required by the target fine-tuning mode. The decoder reprojects the encoder's output into the original spatio-temporal grid, emphasizing both spatial detail and temporal continuity restoration.
2. Masking and Reconstruction Strategy
AU-vMAE exploits a tube-based masking regime that randomly removes entire spatio-temporal tubes—a high-ratio strategy that impels the encoder to grasp both short- and long-range spatio-temporal structures. For the L₂ reconstruction loss, only masked tokens are considered:
where is the set of masked patch indices in video .
Temporal down-sampling is tailored for the subsequent FAU detection setting: no down-sampling for video/payload-level tasks and down-sampling for frame-level evaluation.
3. Multi-Label and Multi-Level FAU Classification
Following videoMAE pre-training, AU-vMAE leverages a lightweight linear classification head per frame for simultaneous multi-label prediction over action units. Three granularities of input are utilized during fine-tuning:
- Video-level: Full-length input (no masking, no down-sampling).
- Frame-level: Sequence down-sampled by $4$ in time (no masking).
- Patch-level: Full temporal resolution, with random tube-masking.
To address class imbalance inherent in FAU datasets, class-balance weighting is integrated into the binary cross-entropy classification loss:
where incorporates the inverse frequency of AU 0.
4. Incorporation of Spatio-Temporal Prior Knowledge
A distinguishing feature of AU-vMAE is the use of knowledge-guided FSM priors over AU pairs, providing inductive bias for both intra-frame (spatial) and inter-frame (temporal) consistency, but without using GNNs:
- Intra-frame co-occurrence (1):
Conditional probability that a pair of AUs co-occur in a frame, estimated as
2
The predicted counterpart, 3, is updated per batch from binarized classifier outputs, with straight-through gradient estimation. The corresponding loss encourages predicted co-occurrence statistics to match empirical priors:
4
- Inter-frame transition (5):
The model encodes the empirical transition probabilities for each AU pair across consecutive frames as a 6 tensor, reflecting all combinations of AU pair transitions. For predicted probabilities 7 at time 8:
9
where 0 is constructed via a bitwise encoding scheme for transitions.
The total fine-tuning objective is a weighted sum:
1
with default coefficients 2, 3, and 4 for the main, intra-frame, and inter-frame losses, respectively.
5. Experimental Setup and Empirical Results
AU-vMAE is pretrained on approximately 5 million face-only videos from VoxCeleb2, CelebV-HQ, FaceForensics, VFHQ, and MEAD, using Adam with a high mask ratio (6) and 7 epochs. Fine-tuning and evaluation utilize the BP4D (12 AUs, 8K frames, 9 videos) and DISFA (8 AUs, 0K frames, 1 videos) benchmarks, employing person-exclusive cross-validation and standard data augmentations. Classification accuracy is measured primarily by overall and per-AU F1 score.
Summary of main results:
| Dataset | AU-vMAE avg F1 | Best prior avg F1 | Prior Method | Improvement |
|---|---|---|---|---|
| BP4D (12 AUs) | 67.6% | 65.5% | ME-AU[ME-graph] | +2.1 percentage pts |
| DISFA (8 AUs) | 69.6% | 65.8% | CaF-Net | +3.8 percentage pts |
Per-AU improvements are prominent, e.g. AU 1: 2 vs 3, AU 15: 4 vs 5, and AU 17: 6 vs 7 on BP4D. On DISFA, AU 2: 8 vs 9, and AU 26: 0 vs 1.
6. Ablation Studies and Component Impact
Ablation studies systematically validate each component:
- Multi-level input: F1 declines from video-level (2) to frame-level (3) and further with patch-level (4).
- Knowledge priors: Intra-frame prior adds 5 pp; inter-frame, 6 pp; both combined, 7 pp.
- Data augmentation: Contributes 8 pp (BP4D) and 9 pp (DISFA) on F1.
This empirical decomposition confirms the benefit of knowledge priors and the multi-level approach.
7. Context, Limitations, and Outlook
AU-vMAE represents a migration from graph-based priors to FSM-encoded co-occurrence and transition constraints in FAU detection, leveraging large-scale self-supervised video representation learning. This design is suited for domains with heavy class imbalance and spatio-temporal dependencies. Absence of explicit GNN modules simplifies optimization and inference, while FSM priors offer interpretable constraints.
A plausible implication is that the FSM-based approach may generalize to other multi-label, temporally-ordered classification tasks beyond FAU analysis, though its reliance on large-scale face-centric video pre-training and empirically estimated statistics constrains applicability to settings where such data are available.
For full implementation and further methodological details, see AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder (Jin et al., 2024).