AU-vMAE: Video Autoencoder for FAU Detection

Updated 18 May 2026

AU-vMAE is a video-masked autoencoder framework for detecting Facial Action Units in unconstrained videos using high-ratio tube masking.
The framework employs a two-stage architecture that combines large-scale self-supervised pre-training with multi-level fine-tuning to enhance classification accuracy.
Incorporating FSM-style spatial and temporal priors, AU-vMAE achieves state-of-the-art results on benchmarks such as BP4D and DISFA.

AU-vMAE is a video-masked autoencoder framework explicitly designed for the detection of Facial Action Units (FAUs) in unconstrained video data, addressing core challenges of annotation scarcity and high inter-personal variability. It introduces a knowledge-guided approach that leverages large-scale self-supervised video pre-training, multi-level supervision, and finite-state-machine (FSM) style temporal and spatial priors, eschewing traditional Graph Neural Network models. The method is validated with state-of-the-art results on major FAU benchmarks, notably BP4D and DISFA (Jin et al., 2024).

1. Architectural Overview

AU-vMAE centers on a two-stage architecture: (1) large-scale pre-training via a video-masked autoencoder (videoMAE), and (2) task-specific fine-tuning for FAU recognition. The pre-training stage employs a Vision Transformer (ViT) encoder operating on partially observed video sequences, where "tube" masking occludes contiguous spatio-temporal regions at high ratios (up to 90%). The lightweight Transformer decoder reconstructs masked frames from the encoder's learned latent representation.

Input video clips are divided into a temporal sequence of spatial patch tubes, with temporal down-sampling applied as required by the target fine-tuning mode. The decoder reprojects the encoder's output into the original spatio-temporal grid, emphasizing both spatial detail and temporal continuity restoration.

2. Masking and Reconstruction Strategy

AU-vMAE exploits a tube-based masking regime that randomly removes entire spatio-temporal tubes—a high-ratio strategy that impels the encoder to grasp both short- and long-range spatio-temporal structures. For the L₂ reconstruction loss, only masked tokens are considered:

$\mathcal{L}_{\rm recon} = \frac{1}{N}\sum_{i=1}^{N}\;\frac{1}{|\omega_i|}\sum_{t\in \omega_i} \bigl\|I_i(t)-\hat I_i(t)\bigr\|_2^2$

where $\omega_i$ is the set of masked patch indices in video $i$ .

Temporal down-sampling is tailored for the subsequent FAU detection setting: no down-sampling for video/payload-level tasks and $4 \times$ down-sampling for frame-level evaluation.

3. Multi-Label and Multi-Level FAU Classification

Following videoMAE pre-training, AU-vMAE leverages a lightweight linear classification head per frame for simultaneous multi-label prediction over $N$ action units. Three granularities of input are utilized during fine-tuning:

Video-level: Full-length input (no masking, no down-sampling).
Frame-level: Sequence down-sampled by $4$ in time (no masking).
Patch-level: Full temporal resolution, with $50\%$ random tube-masking.

To address class imbalance inherent in FAU datasets, class-balance weighting is integrated into the binary cross-entropy classification loss:

$\mathcal{L}_{\rm cls} = -\sum_{i=1}^N w_i\bigl[y_i\log p_i + (1-y_i)\log(1-p_i)\bigr]$

where $w_i = \frac{N(1/r_i)}{\sum_j (1/r_j)}$ incorporates the inverse frequency $r_i$ of AU $\omega_i$ 0.

4. Incorporation of Spatio-Temporal Prior Knowledge

A distinguishing feature of AU-vMAE is the use of knowledge-guided FSM priors over AU pairs, providing inductive bias for both intra-frame (spatial) and inter-frame (temporal) consistency, but without using GNNs:

Intra-frame co-occurrence ( $\omega_i$ 1):

Conditional probability that a pair of AUs co-occur in a frame, estimated as

$\omega_i$ 2

The predicted counterpart, $\omega_i$ 3, is updated per batch from binarized classifier outputs, with straight-through gradient estimation. The corresponding loss encourages predicted co-occurrence statistics to match empirical priors:

$\omega_i$ 4

Inter-frame transition ( $\omega_i$ 5):

The model encodes the empirical transition probabilities for each AU pair across consecutive frames as a $\omega_i$ 6 tensor, reflecting all combinations of AU pair transitions. For predicted probabilities $\omega_i$ 7 at time $\omega_i$ 8:

$\omega_i$ 9

where $i$ 0 is constructed via a bitwise encoding scheme for transitions.

The total fine-tuning objective is a weighted sum:

$i$ 1

with default coefficients $i$ 2, $i$ 3, and $i$ 4 for the main, intra-frame, and inter-frame losses, respectively.

5. Experimental Setup and Empirical Results

AU-vMAE is pretrained on approximately $i$ 5 million face-only videos from VoxCeleb2, CelebV-HQ, FaceForensics, VFHQ, and MEAD, using Adam with a high mask ratio ( $i$ 6) and $i$ 7 epochs. Fine-tuning and evaluation utilize the BP4D (12 AUs, $i$ 8K frames, $i$ 9 videos) and DISFA (8 AUs, $4 \times$ 0K frames, $4 \times$ 1 videos) benchmarks, employing person-exclusive cross-validation and standard data augmentations. Classification accuracy is measured primarily by overall and per-AU F1 score.

Summary of main results:

Dataset	AU-vMAE avg F1	Best prior avg F1	Prior Method	Improvement
BP4D (12 AUs)	67.6%	65.5%	ME-AU[ME-graph]	+2.1 percentage pts
DISFA (8 AUs)	69.6%	65.8%	CaF-Net	+3.8 percentage pts

Per-AU improvements are prominent, e.g. AU 1: $4 \times$ 2 vs $4 \times$ 3, AU 15: $4 \times$ 4 vs $4 \times$ 5, and AU 17: $4 \times$ 6 vs $4 \times$ 7 on BP4D. On DISFA, AU 2: $4 \times$ 8 vs $4 \times$ 9, and AU 26: $N$ 0 vs $N$ 1.

6. Ablation Studies and Component Impact

Ablation studies systematically validate each component:

Multi-level input: F1 declines from video-level ( $N$ 2) to frame-level ( $N$ 3) and further with patch-level ( $N$ 4).
Knowledge priors: Intra-frame prior adds $N$ 5 pp; inter-frame, $N$ 6 pp; both combined, $N$ 7 pp.
Data augmentation: Contributes $N$ 8 pp (BP4D) and $N$ 9 pp (DISFA) on F1.

This empirical decomposition confirms the benefit of knowledge priors and the multi-level approach.

7. Context, Limitations, and Outlook

AU-vMAE represents a migration from graph-based priors to FSM-encoded co-occurrence and transition constraints in FAU detection, leveraging large-scale self-supervised video representation learning. This design is suited for domains with heavy class imbalance and spatio-temporal dependencies. Absence of explicit GNN modules simplifies optimization and inference, while FSM priors offer interpretable constraints.

A plausible implication is that the FSM-based approach may generalize to other multi-label, temporally-ordered classification tasks beyond FAU analysis, though its reliance on large-scale face-centric video pre-training and empirically estimated statistics constrains applicability to settings where such data are available.

For full implementation and further methodological details, see AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder (Jin et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AU-vMAE.

AU-vMAE: Video Autoencoder for FAU Detection

1. Architectural Overview

2. Masking and Reconstruction Strategy

3. Multi-Label and Multi-Level FAU Classification

4. Incorporation of Spatio-Temporal Prior Knowledge

5. Experimental Setup and Empirical Results

6. Ablation Studies and Component Impact

7. Context, Limitations, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AU-vMAE: Video Autoencoder for FAU Detection

1. Architectural Overview

2. Masking and Reconstruction Strategy

3. Multi-Label and Multi-Level FAU Classification

4. Incorporation of Spatio-Temporal Prior Knowledge

5. Experimental Setup and Empirical Results

6. Ablation Studies and Component Impact

7. Context, Limitations, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research