IFED: Instruction-driven Facial Expression Decomposer
- The paper introduces IFED, a neural module that fuses CLIP text embeddings with FLAME parameters to enable precise 3D facial expression generation and transition.
- It employs dual-branch transformer encoders and cross-attention mechanisms, achieving improved expression accuracy (e.g., CK+ Acc₁ up to 91.44%) through effective multimodal fusion.
- Its end-to-end supervised design, incorporating composite losses including vertex reconstruction loss, ensures temporally smooth and semantically faithful facial motion sequences.
The Instruction-driven Facial Expression Decomposer (IFED) is a neural module designed to integrate natural language instructions and low-level facial parameter data to enable precisely conditioned 3D facial expression generation and transition. IFED constitutes the core fusion block in frameworks such as the Facial Expression Transition (FET) module and underpins the recently introduced Instruction to Facial Expression Transition (I2FET) method. This architecture enables controlled transformation of facial expressions in 3D avatars based on free-form text instructions, facilitating the generation of temporally smooth and semantically faithful facial motion sequences (Vo et al., 13 Jan 2026).
1. Functional Overview and Motivation
At the heart of IFED is its capability to perform multimodal decomposition and fusion, processing (a) textual instructions—encoded as CLIP embeddings—and (b) continuous facial parameter vectors based on the FLAME model. IFED receives, as input, a sequence of facial parameter vectors , with representing FLAME expression coefficients and denoting jaw pose, alongside a CLIP-encoded text embedding . IFED outputs a sequence of conditional expression features and pose features . These embeddings are designed specifically for downstream conditioning of VAE-style encoders/decoders within the I2FET framework, enhancing the fidelity and controllability of expression synthesis (Vo et al., 13 Jan 2026).
2. Architecture: Dual-Branch Transformer Encoding and Cross-Attention Fusion
The IFED module is architected around two parallel transformer encoders:
- Facial-Parameter Branch ():
- Input:
- Internal operations:
where denotes multi-head self-attention, is a feed-forward network, and is layer normalization.
Text Branch ():
- Input: replicated to
- Processing: undergoes linear projection , followed by a transformer block, yielding .
Cross-Attention Fusion (CAFT):
- The latent representations and serve as the basis for fused feature computation via dual cross-attention:
where , are projections between modalities, and , are back-projections to original dimensions. is standard multi-head cross-attention. Layer-normalized outputs and are concatenated to form the fused feature .
3. Representation Decomposition and End-to-End Supervision
The fused feature is decomposed via two lightweight linear layers:
yielding the final conditional embeddings for expression and pose.
Within the I2FET module, these embeddings are used to condition two parallel VAE branches:
Stochastic latent codes are sampled and linearly projected, then re-injected into IFED alongside the original text embedding, realizing a refinement mechanism before decoding.
End-to-end supervision is implemented with the following composite loss:
- Mean-squared reconstruction loss for expression and pose parameters.
- VAE KL divergence.
- FLAME-vertex reconstruction loss, which directly supervises 3D shape fidelity:
4. Integration in Expression Transition and Temporal Smoothing
IFED, together with I2FET, generates two anchor frames at initial and final timesteps. The source FLAME shape , expression , pose , and camera are obtained by DECA. Resultant expression and pose coefficients—either predicted , or interpolated via:
(for )—yield a temporally smooth trajectory. Final rendering utilizes a FLAME-based mesh and a pre-trained neural renderer (e.g., ROME or CVTHead).
5. Empirical Results and Quantitative Performance
Evaluation utilizes both the CK+ (26,352 samples) and CelebV-HQ (28,335 samples) datasets, augmented with a range of expression instructions and identities. I2FET is trained for 200 epochs (batch size 128, Adam optimizer, LR = ), with a 10\% test split and repeated trials for mean±std reporting. ResNet-101 (with re-weighted focal loss) is employed for downstream expression classification.
Key results are presented in the table below, contrasting “MotionClip” and the IFED-enabled system (“Ours (I2FET + IFED + vertex loss)”):
| Metric | CK+ MotionClip | CK+ Ours | CelebV-HQ MotionClip | CelebV-HQ Ours |
|---|---|---|---|---|
| Acc₁ | 52% | 91.44% | 40% | 58.24% |
| Acc₂ | 20% | 84.03% | 13.7% | 33.45% |
| G-mean | 40.48% | 80.30% | 34.4% | 46.47% |
Removing IFED leads to substantial drops in classification accuracy (CK+ Acc₁ to ~76.9%), which is fully restored upon its inclusion. t-SNE analysis and confusion matrices indicate improved class separation and representation clustering when IFED is present (Vo et al., 13 Jan 2026).
6. Contextualization and Implications
In summary, IFED serves as a lightweight, cross-attention-based transformer that delivers effective multimodal fusion between semantic text and low-level facial parameters. Injecting these conditional embeddings at both the encoder and decoder stages—enabling “refinement”—greatly enhances the system’s ability to interpret arbitrary textual instructions (e.g., “Turn this face from disgust to happiness”) and generate smooth, expressive facial trajectories. This suggests wide application potential in 3D avatar animation, digital human interaction, and emotion-driven simulation (Vo et al., 13 Jan 2026). The demonstrated ability to use natural language instructions to expand the repertoire of synthesized facial expressions and their transitions represents a substantial advancement in controllable, instruction-driven 3D facial animation.