Papers
Topics
Authors
Recent
Search
2000 character limit reached

AU-vMAE: Video Autoencoder for FAU Detection

Updated 18 May 2026
  • AU-vMAE is a video-masked autoencoder framework for detecting Facial Action Units in unconstrained videos using high-ratio tube masking.
  • The framework employs a two-stage architecture that combines large-scale self-supervised pre-training with multi-level fine-tuning to enhance classification accuracy.
  • Incorporating FSM-style spatial and temporal priors, AU-vMAE achieves state-of-the-art results on benchmarks such as BP4D and DISFA.

AU-vMAE is a video-masked autoencoder framework explicitly designed for the detection of Facial Action Units (FAUs) in unconstrained video data, addressing core challenges of annotation scarcity and high inter-personal variability. It introduces a knowledge-guided approach that leverages large-scale self-supervised video pre-training, multi-level supervision, and finite-state-machine (FSM) style temporal and spatial priors, eschewing traditional Graph Neural Network models. The method is validated with state-of-the-art results on major FAU benchmarks, notably BP4D and DISFA (Jin et al., 2024).

1. Architectural Overview

AU-vMAE centers on a two-stage architecture: (1) large-scale pre-training via a video-masked autoencoder (videoMAE), and (2) task-specific fine-tuning for FAU recognition. The pre-training stage employs a Vision Transformer (ViT) encoder operating on partially observed video sequences, where "tube" masking occludes contiguous spatio-temporal regions at high ratios (up to 90%). The lightweight Transformer decoder reconstructs masked frames from the encoder's learned latent representation.

Input video clips are divided into a temporal sequence of spatial patch tubes, with temporal down-sampling applied as required by the target fine-tuning mode. The decoder reprojects the encoder's output into the original spatio-temporal grid, emphasizing both spatial detail and temporal continuity restoration.

2. Masking and Reconstruction Strategy

AU-vMAE exploits a tube-based masking regime that randomly removes entire spatio-temporal tubes—a high-ratio strategy that impels the encoder to grasp both short- and long-range spatio-temporal structures. For the L₂ reconstruction loss, only masked tokens are considered:

Lrecon=1Ni=1N  1ωitωiIi(t)I^i(t)22\mathcal{L}_{\rm recon} = \frac{1}{N}\sum_{i=1}^{N}\;\frac{1}{|\omega_i|}\sum_{t\in \omega_i} \bigl\|I_i(t)-\hat I_i(t)\bigr\|_2^2

where ωi\omega_i is the set of masked patch indices in video ii.

Temporal down-sampling is tailored for the subsequent FAU detection setting: no down-sampling for video/payload-level tasks and 4×4 \times down-sampling for frame-level evaluation.

3. Multi-Label and Multi-Level FAU Classification

Following videoMAE pre-training, AU-vMAE leverages a lightweight linear classification head per frame for simultaneous multi-label prediction over NN action units. Three granularities of input are utilized during fine-tuning:

  1. Video-level: Full-length input (no masking, no down-sampling).
  2. Frame-level: Sequence down-sampled by $4$ in time (no masking).
  3. Patch-level: Full temporal resolution, with 50%50\% random tube-masking.

To address class imbalance inherent in FAU datasets, class-balance weighting is integrated into the binary cross-entropy classification loss:

Lcls=i=1Nwi[yilogpi+(1yi)log(1pi)]\mathcal{L}_{\rm cls} = -\sum_{i=1}^N w_i\bigl[y_i\log p_i + (1-y_i)\log(1-p_i)\bigr]

where wi=N(1/ri)j(1/rj)w_i = \frac{N(1/r_i)}{\sum_j (1/r_j)} incorporates the inverse frequency rir_i of AU ωi\omega_i0.

4. Incorporation of Spatio-Temporal Prior Knowledge

A distinguishing feature of AU-vMAE is the use of knowledge-guided FSM priors over AU pairs, providing inductive bias for both intra-frame (spatial) and inter-frame (temporal) consistency, but without using GNNs:

  • Intra-frame co-occurrence (ωi\omega_i1):

Conditional probability that a pair of AUs co-occur in a frame, estimated as

ωi\omega_i2

The predicted counterpart, ωi\omega_i3, is updated per batch from binarized classifier outputs, with straight-through gradient estimation. The corresponding loss encourages predicted co-occurrence statistics to match empirical priors:

ωi\omega_i4

  • Inter-frame transition (ωi\omega_i5):

The model encodes the empirical transition probabilities for each AU pair across consecutive frames as a ωi\omega_i6 tensor, reflecting all combinations of AU pair transitions. For predicted probabilities ωi\omega_i7 at time ωi\omega_i8:

ωi\omega_i9

where ii0 is constructed via a bitwise encoding scheme for transitions.

The total fine-tuning objective is a weighted sum:

ii1

with default coefficients ii2, ii3, and ii4 for the main, intra-frame, and inter-frame losses, respectively.

5. Experimental Setup and Empirical Results

AU-vMAE is pretrained on approximately ii5 million face-only videos from VoxCeleb2, CelebV-HQ, FaceForensics, VFHQ, and MEAD, using Adam with a high mask ratio (ii6) and ii7 epochs. Fine-tuning and evaluation utilize the BP4D (12 AUs, ii8K frames, ii9 videos) and DISFA (8 AUs, 4×4 \times0K frames, 4×4 \times1 videos) benchmarks, employing person-exclusive cross-validation and standard data augmentations. Classification accuracy is measured primarily by overall and per-AU F1 score.

Summary of main results:

Dataset AU-vMAE avg F1 Best prior avg F1 Prior Method Improvement
BP4D (12 AUs) 67.6% 65.5% ME-AU[ME-graph] +2.1 percentage pts
DISFA (8 AUs) 69.6% 65.8% CaF-Net +3.8 percentage pts

Per-AU improvements are prominent, e.g. AU 1: 4×4 \times2 vs 4×4 \times3, AU 15: 4×4 \times4 vs 4×4 \times5, and AU 17: 4×4 \times6 vs 4×4 \times7 on BP4D. On DISFA, AU 2: 4×4 \times8 vs 4×4 \times9, and AU 26: NN0 vs NN1.

6. Ablation Studies and Component Impact

Ablation studies systematically validate each component:

  • Multi-level input: F1 declines from video-level (NN2) to frame-level (NN3) and further with patch-level (NN4).
  • Knowledge priors: Intra-frame prior adds NN5 pp; inter-frame, NN6 pp; both combined, NN7 pp.
  • Data augmentation: Contributes NN8 pp (BP4D) and NN9 pp (DISFA) on F1.

This empirical decomposition confirms the benefit of knowledge priors and the multi-level approach.

7. Context, Limitations, and Outlook

AU-vMAE represents a migration from graph-based priors to FSM-encoded co-occurrence and transition constraints in FAU detection, leveraging large-scale self-supervised video representation learning. This design is suited for domains with heavy class imbalance and spatio-temporal dependencies. Absence of explicit GNN modules simplifies optimization and inference, while FSM priors offer interpretable constraints.

A plausible implication is that the FSM-based approach may generalize to other multi-label, temporally-ordered classification tasks beyond FAU analysis, though its reliance on large-scale face-centric video pre-training and empirically estimated statistics constrains applicability to settings where such data are available.

For full implementation and further methodological details, see AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder (Jin et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AU-vMAE.