IFED: Instruction-driven Facial Expression Decomposer

Updated 20 January 2026

The paper introduces IFED, a neural module that fuses CLIP text embeddings with FLAME parameters to enable precise 3D facial expression generation and transition.
It employs dual-branch transformer encoders and cross-attention mechanisms, achieving improved expression accuracy (e.g., CK+ Acc₁ up to 91.44%) through effective multimodal fusion.
Its end-to-end supervised design, incorporating composite losses including vertex reconstruction loss, ensures temporally smooth and semantically faithful facial motion sequences.

The Instruction-driven Facial Expression Decomposer (IFED) is a neural module designed to integrate natural language instructions and low-level facial parameter data to enable precisely conditioned 3D facial expression generation and transition. IFED constitutes the core fusion block in frameworks such as the Facial Expression Transition (FET) module and underpins the recently introduced Instruction to Facial Expression Transition (I2FET) method. This architecture enables controlled transformation of facial expressions in 3D avatars based on free-form text instructions, facilitating the generation of temporally smooth and semantically faithful facial motion sequences (Vo et al., 13 Jan 2026).

1. Functional Overview and Motivation

At the heart of IFED is its capability to perform multimodal decomposition and fusion, processing (a) textual instructions—encoded as CLIP embeddings—and (b) continuous facial parameter vectors based on the FLAME model. IFED receives, as input, a sequence of facial parameter vectors $x^f = \{e,\theta\}\in\mathbb{R}^{m\times 53}$ , with $e\in\mathbb{R}^{50}$ representing FLAME expression coefficients and $\theta\in\mathbb{R}^{3}$ denoting jaw pose, alongside a CLIP-encoded text embedding $x^t\in\mathbb{R}^{77\times 768}$ . IFED outputs a sequence of conditional expression features $x^e_{\mathrm{emb}}\in\mathbb{R}^{m\times 50}$ and pose features $x^p_{\mathrm{emb}}\in\mathbb{R}^{m\times 6}$ . These embeddings are designed specifically for downstream conditioning of VAE-style encoders/decoders within the I2FET framework, enhancing the fidelity and controllability of expression synthesis (Vo et al., 13 Jan 2026).

2. Architecture: Dual-Branch Transformer Encoding and Cross-Attention Fusion

The IFED module is architected around two parallel transformer encoders:

Facial-Parameter Branch ( $\mathcal{E}_P$ ):
- Input: $x^f\in\mathbb{R}^{m\times 53}$
- Internal operations:
$\begin{aligned} y^f &= x^f + \mathrm{MSA}(\mathrm{LN}(x^f)), \ \hat{x}^f &= y^f + \mathrm{FFN}(\mathrm{LN}(y^f)) \end{aligned}$

where $\mathrm{MSA}$ denotes multi-head self-attention, $\mathrm{FFN}$ is a feed-forward network, and $\mathrm{LN}$ is layer normalization.
Text Branch ( $\mathcal{E}_T$ ):
- Input: replicated $x^t$ to $\{x^t_0, x^t_1\}\in\mathbb{R}^{2\times 77\times 768}$
- Processing: $x^t$ undergoes linear projection $P_t(\cdot)$ , followed by a transformer block, yielding $\hat{x}^t\in\mathbb{R}^{m\times 768}$ .

Cross-Attention Fusion (CAFT):

The latent representations $\hat{x}^f$ and $\hat{x}^t$ serve as the basis for fused feature computation via dual cross-attention:

$\begin{aligned} y_c^f &= h^{f2t}(\hat{x}^f_0),\ y_c^t &= h^{t2f}(\hat{x}^t_0),\ x_o^f &= g^{t2f}\left(\mathrm{CA}(y_c^f \otimes x^t_1) + y_c^f\right),\ x_o^t &= g^{f2t}\left(\mathrm{CA}(y_c^t \otimes x^f_1) + y_c^t\right), \end{aligned}$

where $h^{f2t}$ , $h^{t2f}$ are projections between modalities, and $g^{t2f}$ , $g^{f2t}$ are back-projections to original dimensions. $\mathrm{CA}$ is standard multi-head cross-attention. Layer-normalized outputs $x_o^f$ and $x_o^t$ are concatenated to form the fused feature $x^{\mathrm{fused}}$ .

3. Representation Decomposition and End-to-End Supervision

The fused feature $x^{\mathrm{fused}}$ is decomposed via two lightweight linear layers:

$x^e_{\mathrm{emb}} = \mathcal{P}_e(x^{\mathrm{fused}})\,,\qquad x^p_{\mathrm{emb}} = \mathcal{P}_p(x^{\mathrm{fused}})$

yielding the final conditional embeddings for expression and pose.

Within the I2FET module, these embeddings are used to condition two parallel VAE branches:

$\mathcal{E}_e^t(e \otimes x^e_{\mathrm{emb}}) \rightarrow (\mu_e, \sigma_e),\qquad \mathcal{E}_p^t(\theta \otimes x^p_{\mathrm{emb}}) \rightarrow (\mu_p, \sigma_p)$

Stochastic latent codes are sampled and linearly projected, then re-injected into IFED alongside the original text embedding, realizing a refinement mechanism before decoding.

End-to-end supervision is implemented with the following composite loss:

Mean-squared reconstruction loss for expression and pose parameters.
VAE KL divergence.
FLAME-vertex reconstruction loss, which directly supervises 3D shape fidelity:

$\mathcal{L}_v = \|v - \hat{v}\|_2^2, \quad v = \mathcal{V}(\phi, e, \theta)$

4. Integration in Expression Transition and Temporal Smoothing

IFED, together with I2FET, generates two anchor frames at initial and final timesteps. The source FLAME shape $\phi_s$ , expression $e_s$ , pose $\theta_s$ , and camera $c_s$ are obtained by DECA. Resultant expression and pose coefficients—either predicted $(\hat{e}_0,\hat{\theta}_0)$ , $(\hat{e}_1,\hat{\theta}_1)$ or interpolated via:

$e^{(k)} = \delta\, e^{(l)} + (1-\delta)\,e^{(n)},\quad \theta^{(k)} = \delta\, \theta^{(l)} + (1-\delta)\,\theta^{(n)}$

(for $\delta\in[0,1]$ )—yield a temporally smooth trajectory. Final rendering utilizes a FLAME-based mesh and a pre-trained neural renderer (e.g., ROME or CVTHead).

5. Empirical Results and Quantitative Performance

Evaluation utilizes both the CK+ (26,352 samples) and CelebV-HQ (28,335 samples) datasets, augmented with a range of expression instructions and identities. I2FET is trained for 200 epochs (batch size 128, Adam optimizer, LR = $8\times 10^{-4}$ ), with a 10\% test split and repeated trials for mean±std reporting. ResNet-101 (with re-weighted focal loss) is employed for downstream expression classification.

Key results are presented in the table below, contrasting “MotionClip” and the IFED-enabled system (“Ours (I2FET + IFED + vertex loss)”):

Metric	CK+ MotionClip	CK+ Ours	CelebV-HQ MotionClip	CelebV-HQ Ours
Acc₁	52%	91.44%	40%	58.24%
Acc₂	20%	84.03%	13.7%	33.45%
G-mean	40.48%	80.30%	34.4%	46.47%

Removing IFED leads to substantial drops in classification accuracy (CK+ Acc₁ to ~76.9%), which is fully restored upon its inclusion. t-SNE analysis and confusion matrices indicate improved class separation and representation clustering when IFED is present (Vo et al., 13 Jan 2026).

6. Contextualization and Implications

In summary, IFED serves as a lightweight, cross-attention-based transformer that delivers effective multimodal fusion between semantic text and low-level facial parameters. Injecting these conditional embeddings at both the encoder and decoder stages—enabling “refinement”—greatly enhances the system’s ability to interpret arbitrary textual instructions (e.g., “Turn this face from disgust to happiness”) and generate smooth, expressive facial trajectories. This suggests wide application potential in 3D avatar animation, digital human interaction, and emotion-driven simulation (Vo et al., 13 Jan 2026). The demonstrated ability to use natural language instructions to expand the repertoire of synthesized facial expressions and their transitions represents a substantial advancement in controllable, instruction-driven 3D facial animation.

Markdown Report Issue Upgrade to Chat

References (1)

Instruction-Driven 3D Facial Expression Generation and Transition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Instruction-driven Facial Expression Decomposer (IFED).