Talking Head Facial Expression Manipulation

Updated 26 January 2026

THFEM is a set of techniques for finely controlling facial expressions, mouth shapes, head poses, and eye movements in talking head video synthesis.
It leverages disentangled latent representations and orthogonality constraints to maintain natural facial dynamics and robust identity preservation.
Advanced systems integrate AU-guided landmark prediction and multi-modal inputs (audio, text) to achieve realistic, real-time facial attribute manipulation.

Talking Head Facial Expression Manipulation (THFEM) encompasses the set of techniques, models, and frameworks designed to enable fine-grained, independently controllable modulation of facial expressions within talking head video synthesis. The core objective is the precise disentanglement and editability of expression, mouth articulation (lip-sync), head pose, eye movement, and other facial attributes, often in real time, with fidelity to the original speaker identity and naturalness of synthesized motion. THFEM systems drive digital avatars, neural video conferencing, data-driven human-computer interfaces, and virtual agents, leveraging disentangled latent representations, landmark modelling, and conditional generative architectures.

1. Principles and Rationale of Disentangled Representation

THFEM aims to decouple facial attributes for unconstrained expression editing, lip synchronization, and pose control. Early systems employed 3D Morphable Models (3DMM) to encode facial shape, expression, and pose coefficients, but failed to capture muscle-level nuances and often entangled expression with speech articulation (Sun et al., 2022). Current paradigms represent facial dynamics in orthogonal latent spaces—mouth shape, pose, eye movement, and emotional expression—each controlled independently via linear combinations of learned basis vectors or keypoint deformations (Tan et al., 19 Aug 2025, Tan et al., 2024, Jang et al., 2023). Orthogonality constraints (e.g.,

$\mathcal{L}_{\rm ortho} = \sum_{*}\left\|\left(B^*\right)^{\top}B^* - I\right\|_F^2 + \sum_{*\neq\dagger}\left\|\left(B^*\right)^\top B^{\dagger}\right\|_F^2$

) suppress “leakage” across subspaces.

Table: Latent Spaces in Modern THFEM Systems

System	Mouth	Pose	Eye	Expression	AU-Driven	Text-Driven
EDTalk++	✓	✓	✓	✓	Partial	Partial
EDTalk	✓	✓	—	✓	Partial	Partial
PC-Talk	✓	(static)	—	✓	—	—
AU-landmark (Chang et al., 24 Sep 2025)	✓	✓	—	✓	✓	—

2. Modeling Facial Expression via Latent Subspaces and Basis Banks

Recent THFEM approaches model facial motion factors as projections onto learnable basis banks. EDTalk++ and EDTalk employ sets of basis vectors $B^*$ per attribute, learning optimal bases via self-supervised staged decoupling of mouth, pose, eye, and expression modules (Tan et al., 19 Aug 2025, Tan et al., 2024). Given an input frame $I$ , a canonical encoder $E(I)$ produces $f^{\to r}$ ; a small MLP maps this to a coefficient vector $W^*$ , yielding a motion feature $f^{r\to *} = B^* W^* = \sum_i w^*_i b^*_i$ . The overall driving signal is built as a sum of these features for generation.

Orthogonality regularization, implemented as Frobenius norm penalties or channel-wise inner products, ensures clean separation of attribute spaces. This enables precise, blended, and interpolated control over emotional affect without disrupting mouth or pose performance.

Audio- and text-driven THFEM is enabled by further designing modules (e.g., Audio-to-Motion, Text-to-Expression) that learn to predict expression coefficients directly from HuBERT, wav2vec, or CLIP (for free-form text) embeddings, with modality masking for robustness (Tan et al., 19 Aug 2025, Ma et al., 2023).

3. Landmark and AU-Guided Motion Generators

A substantial class of THFEM systems exploits explicit facial landmark prediction conditioned on Action Unit (AU) intensities or other parametric inputs (Chang et al., 24 Sep 2025). The variational motion generator (VMG) employs temporal-dilated convolutional models to map audio and AU input to temporally coherent 2D landmark sequences. The explicit mapping of AU vectors $u_t \in \mathbb{R}^{18}$ to landmark configurations $\hat L_t$ enforces physically plausible muscle activations, enabling fine-grained, per-frame expression control:

$\hat L_t = f_\theta(z_t, u_t, a_t).$

Landmark scaffolds, either sparse or dense, serve as interfaces between motion prediction and pixel-level synthesis, with subsequent diffusion-based synthesizers generating realistic videos conditioned on the predicted motion (Chang et al., 24 Sep 2025, Gao et al., 2023). This separation improves temporal stability, expression accuracy, and visual fidelity, and supports continuous interpolation and compound emotion modeling.

4. Conditional Generative Architectures and Attribute Editing

Modern THFEM frameworks use advanced conditional architectures—StyleGAN2, U-Net, or Transformer-based latent diffusion models—modulated by disentangled attribute codes. For instance, FC-TFG establishes a canonical latent space for identity and a motion-only latent space for facial movements, combined by channel-wise sum and disentangled via orthogonality losses at multiple StyleGAN layers (Jang et al., 2023). Fine attribute control is facilitated by explicit vector arithmetic in the motion codes:

$z_{s\to d} = z_{s\to c} + z_{c\to d}$

where $z_{s\to c}$ encodes identity and $z_{c\to d}$ encodes pure motion.

Attribute editing in FaceEditTalker leverages dual-layer latent decomposition into semantic codes $z_{\rm sem}$ (global attributes) and stochastic codes $Z_T$ (details), learning linear directions for expressions, hairstyle, and accessories (Feng et al., 28 May 2025). Semantic codes can be shifted by learned attribute vectors to effect smooth, intensity-controlled manipulation without degrading lip sync.

Audio-driven and audio-visual diffusion frameworks (e.g., ACTalker) introduce parallel selective state-space modeling (“mamba” structure), with mask-drop strategies and gating mechanisms to enable signal-specific control of facial regions for conflict-free, multi-modal editing (Hong et al., 3 Apr 2025).

Continuous or discrete intensity control, region-specific editing, and multi-modal source blending are essential features of advanced THFEM. Intensity is modulated via scalar multiplication in latent codes or AU vectors, with region masks controlling the application of deformations (e.g., only lips, brows, eyes) (Wang et al., 18 Mar 2025). Compound emotions are supported by summing region-weighted deformations:

$D_e = \sum_{r} M_r \odot \alpha_r [\mathrm{CPred}(emo_r) - \mathrm{CPred}(\text{neutral})]$

where $M_r$ masks the region and $\alpha_r$ scales the strength.

Text-driven frameworks like TalkCLIP use CLIP-based encoders to embed descriptive sentences and map them into style codes for both coarse-grained emotions and fine-grained AU movements, supporting interpolation between discrete and continuous intensity descriptors, and generalization to unseen text (Ma et al., 2023).

6. Evaluation, Benchmarks, and Quantitative Performance

THFEM models are evaluated using standard and specialized metrics:

Visual quality: FID, SSIM, PSNR, MS-SSIM, CPBD
Lip-sync: LMD (landmark distance), LSE-C (SyncNet confidence), LSE-D (SyncNet error)
Expression accuracy: emotion-classification accuracy (Acc_{emo}), F-LMD (full face landmark distance)
Temporal stability: FVD (Fréchet Video Distance), Smooth (optical flow consistency)
Identity: ArcFace cosine similarity
Attribute editability: intensity linearity (LIE), attribute classifier agreement
User studies: realism, lip sync, emotion correctness, pose naturalness (Tan et al., 2024, Tan et al., 19 Aug 2025, Chang et al., 24 Sep 2025, Jang et al., 2023, Wang et al., 18 Mar 2025).

State-of-the-art results: On MEAD and HDTF, EDTalk++ achieves $Acc_{emo}=68.2\%$ , F-LMD and M-LMD outperform StyleTalk, EAMM, and other baselines (Tan et al., 19 Aug 2025). AU-guided landmark generation pushes $Acc_{emo}$ to $78.03\%$ , with FID=18.0 and M-LMD=2.288 px (Chang et al., 24 Sep 2025). User studies confirm perceived realism and manipulation fidelity.

7. Extensions, Limitations, and Future Research

While THFEM systems now support fully disentangled, user-driven control, several limitations persist:

Dependency on external AU detectors (OpenFace) for landmark prediction (Chang et al., 24 Sep 2025)
Limited personalization: current systems do not model individual expression “style” beyond coarse identity features
Resolution bottlenecks: many pipelines cap output at 256² or 512²
Artifact susceptibility under extreme pose or expression edits
Computational demands: adjacent-frame priors and multi-module architectures increase parameter and FLOP counts (Lu et al., 19 Jan 2026)

Future directions include learning self-supervised AU detectors, style tokens for personalization, real-time high-resolution synthesis, integration with text-to-expression from LLMs, and joint optimization of expression/lip sync in unified backbones (Tan et al., 19 Aug 2025, Ma et al., 2023, Cai et al., 2024, Lu et al., 19 Jan 2026).

In conclusion, THFEM is a rapidly advancing area that combines disentangled latent modeling, landmark and AU-guided control, conditional generative architectures, and multi-modal modulation to yield high-fidelity, independently controllable, and robust talking-head video synthesis. The field is witnessing the emergence of increasingly sophisticated systems for fine-grained emotional editing, avatar personalization, and real-world human-computer interaction (Tan et al., 19 Aug 2025, Tan et al., 2024, Chang et al., 24 Sep 2025, Ma et al., 2023, Jang et al., 2023, Feng et al., 28 May 2025, Lu et al., 19 Jan 2026).