Speech-Conditional Facial Motion Infilling

Updated 23 December 2025

Speech-conditional facial motion infilling is defined as reconstructing missing facial motion using coordinated speech cues and visual context to ensure temporal and semantic continuity.
Methods employ Diffusion Transformers, vector-quantized pipelines, and landmark-guided GANs with masked reconstruction and temporal regularization to achieve realistic synthesis.
Applications include talking face editing, dubbing, and telepresence systems, delivering high-fidelity, identity-consistent facial motion aligned with corresponding audio.

Speech-conditional facial motion infilling is the task of reconstructing plausible facial motion sequences in regions where the target motion is missing, ambiguous, or intentionally edited, with the requirement that the result is temporally, visually, and semantically aligned both to the surrounding facial motion context and to the corresponding speech. This paradigm lies at the intersection of talking face generation, inpainting, and video editing, unifying these traditionally separate problems into a common conditional sequence infilling problem. Recent advances integrate speech cues, context-aware modeling, and self-supervised infilling frameworks to obtain high-fidelity, speech-synchronized, and identity-consistent facial motion reconstructions, with applications in talking face editing, controllable face synthesis, dubbing, and telepresence systems.

1. Problem Formulation

Speech-conditional facial motion infilling is formally posed as follows. Given a sequence of facial motion representations $F = [F^1,\ldots,F^T] \in \mathbb{R}^{T \times d}$ (where $d$ encodes expression, pose, or motion latents) and a corresponding speech feature sequence $A \in \mathbb{R}^{N \times D}$ , the task is to reconstruct the missing or masked intervals in $F$ such that the completed sequence matches the original data in masked regions and transitions seamlessly in unmasked regions. The mask $M \in \{0,1\}^{T \times d}$ (with $M_{ij}=1$ designating missing entries) defines which portions of $F$ are to be infilled. The reconstruction function $f_\theta$ maps the observed (unmasked) motion and speech to a plausible full sequence: $\hat{F} = f_\theta( (1-M)\odot F, A)$ This paradigm generalizes both conventional editing—where $M$ covers a localized span—and free-form generation—where $M$ spans a full or partial suffix, requiring synthesis from scratch (Sung-Bin et al., 16 Dec 2025).

2. Model Architectures and Training Objectives

Several architectural families implement speech-conditional motion infilling:

Diffusion Transformer Infilling: The FacEDiT model employs a 22-layer Diffusion Transformer backbone with cross-modal attention between facial motion latent tokens and speech features (Sung-Bin et al., 16 Dec 2025). Each layer integrates:
- Biased local self-attention over motion tokens;
- Multi-head cross-attention to speech tokens;
- Feed-forward and normalization modules.
- The model is trained using the conditional flow matching (CFM) objective:

$\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t,F_0,F_1} \, \|v_\theta(F_t, t; A, (1-M)\odot F_1) - (F_1 - F_0)\|_2^2$

where $F_t=(1-t)F_0 + tF_1$ and $F_0 \sim \mathcal{N}(0,I)$ . The objective drives the velocity field between data and noise, enabling closed-form reconstruction.

Vector-Quantized Coarse-to-Fine Token Infilling: VQTalker uses an encoder–quantizer–decoder design, quantizing motion features via Group-Residual Finite Scalar Quantization (GRFSQ), splitting channel groups and applying multi-stage scalar quantization (Liu et al., 13 Dec 2024). Missing token indices in the motion code sequence are infilled with a BERT-style transformer, conditioned on speech- and context motion tokens. Training includes reconstruction loss, adversarial loss on reconstruction realism, perceptual feature loss, audio-visual sync loss, and cross-entropy losses for token prediction at each infilling stage.
Two-Stage Landmark-Guided GANs: Early methods factor motion infilling as landmark trajectory prediction from audio (with context window RNNs) followed by frame synthesis via a conditional GAN (Jalalifar et al., 2018). A Bi-LSTM predicts mouth landmarks from MFCC audio features; a DC-GAN generator synthesizes frames conditioned on these landmarks. Training employs MSE for landmarks and adversarial loss for images.
Speech-to-Blendshape Regression: For 3D facial animation, networks map audio spectrograms to blendshape activations and pose (using CNN → RNN → head/output), trained with MSE to ground truth facial parameters (Pham et al., 2017). To infill, one introduces temporal masking and bidirectional context, adding infilling-specific reconstruction loss.

Key losses across frameworks include:

Reconstruction (MSE or cross-entropy over tokens)
Conditional adversarial loss (for realism)
Temporal smoothness and continuity regularization
Audio–visual/lip synchronization loss (e.g., via specialized discriminators)
Perceptual losses (e.g., VGG feature space, identity embedding)

3. Masked-Context and Self-Supervised Training Paradigms

Speech-conditional motion infilling methods leverage masked reconstruction as a core pretext:

Random Masking: Training applies dynamic binary masks $M$ to motion tokens or latent features, randomly zeroing out contiguous spans of variable length. The network is optimized to reconstruct ground-truth motion only in masked regions, while unmasked regions are preserved exactly (Sung-Bin et al., 16 Dec 2025, Liu et al., 13 Dec 2024).
Bidirectional Temporal Conditioning: To exploit available past and future context, architectures utilize bidirectional encoders (e.g., Bi-LSTM, BERT) for infilling. Masked language modeling analogs are used, with predictors trained to inpaint tokens/events using adjacent context frames and synchronized audio.
Unified Formulation for Editing and Generation: Both localized editing (where $M$ covers a phrase) and full-utterance generation (where $M$ covers a suffix or whole sequence) are realized by appropriate mask selection, requiring no change to architecture or loss (Sung-Bin et al., 16 Dec 2025). This unifies talking face editing, dubbing, and generation.

4. Specialized Attention Mechanisms and Temporal Constraints

Ensuring temporal and spatial coherence in infilling necessitates customized inductive biases:

Biased Local Attention: FacEDiT restricts attention windows in self- and cross-attention mechanisms such that each motion token attends only to its local temporal neighborhood (window size $w\approx51$ frames), promoting both boundary continuity and sharper speech-driven synchronization (Sung-Bin et al., 16 Dec 2025).
Temporal Smoothness Regularization: Additional loss terms penalize sudden changes between adjacent frames: $L_{TS} = \frac{1}{T-1} \sum_{k=2}^T \| \hat{F}_1^k - \hat{F}_1^{k-1} \|_2^2$ with appropriate weighting in the total objective.
Coarse-to-Fine Infilling: VQTalker hierarchically refines motion: coarse stages handle global structure using context and speech, while fine stages incrementally add detail, refining local dynamics and facial micro-expressions (Liu et al., 13 Dec 2024).
Optional Modular Losses: Although early work such as (Jalalifar et al., 2018) uses only landmark MSE and adversarial image loss, modern systems often incorporate perceptual and identity-preserving losses to promote real-world fidelity and visual semantics.

5. Representations: Blendshapes, Motion Latents, Tokens, and Landmarks

Different pipelines represent facial motion at various abstraction levels:

3D Blendshape Activation: Regression-targeted weights for a fixed 3D blendshape model parameterize expression and pose, allowing continuous and physically meaningful editing (Pham et al., 2017).
Motion Tokens: VQTalker discretizes facial dynamics into group-tokenized latent codewords, enabling efficient cross-modal modeling (speech-to-tokens) and bitrate-constrained streaming at 11 kbps (Liu et al., 13 Dec 2024).
Motion Latents from Generative Models: Latent space representations from autoencoder or diffusion pipelines, e.g., LivePortrait 75D vectors for facial action encoding, support end-to-end modeling in high-dimensional semantics (Sung-Bin et al., 16 Dec 2025).
Landmarks and Visemes: Early/landmark-based approaches regress dense mouth region landmarks from audio, converting these into rendered frames—well-suited for low-resolution or landmark-anchored synthesis but less expressive for full-face/identity details (Jalalifar et al., 2018).

6. Datasets, Benchmarking, and Evaluation Metrics

The need for standard benchmarks has motivated the design of controlled editing and generation suites:

FacEDiTBench (Sung-Bin et al., 16 Dec 2025): 250 samples spanning short (1–3 word), medium (4–6), and long (7–10) edit spans, annotated for substitution, insertion, and deletion edits, synthesized via automated dubbing with strict timestamped alignment. Benchmarks facilitate systematic comparison across localized edit types, measuring various aspects of infilling fidelity.

Metrics:

Lip-sync Error: LSE-D, LSE-C for mouth–audio alignment.
Identity Similarity: ArcFace embedding similarity (IDSIM) over edited segments.
Boundary Continuity: Photometric difference ( $P_\text{cont}$ ) and optical flow difference ( $M_\text{cont}$ ) at mask boundaries.
Realism/Distribution Match: FVD, LPIPS for perceptual realism and video dynamics.

Only unified infilling models yield jointly optimal scores across these axes (e.g., FacEDiT: LSE-D=7.135, P_cont=2.42, M_cont=0.80, IDSIM=0.966, FVD=61.93), outperforming generation-focused baselines (Sung-Bin et al., 16 Dec 2025).

7. Technical Significance and Applications

Speech-conditional facial motion infilling enables:

High-fidelity speech-driven talking face editing and generation with seamless transitions and precise speech synchronization.
Controllable video editing, including phrase-level dubbing, localized correction (insertion, substitution, deletion), and natural video inpainting beyond conventional spatial methods.
Multilingual and cross-lingual avatar generation, exploiting phoneme–viseme abstraction for generalizability (Liu et al., 13 Dec 2024).
Integration into telepresence, animation, and accessibility technologies.

Advances in conditional infilling architectures—specifically, diffusion-based Transformers with flow matching, vector-quantized hierarchical token pipelines, and self-supervised masked pretexts—represent state-of-the-art in both talking face editing and full-from-scratch speech-to-motion synthesis. These frameworks unify previously disparate modalities, supporting robust human–machine interaction and creative media applications (Sung-Bin et al., 16 Dec 2025, Liu et al., 13 Dec 2024, Pham et al., 2017, Jalalifar et al., 2018).