Papers
Topics
Authors
Recent
Search
2000 character limit reached

IFED: Instruction-driven Facial Expression Decomposer

Updated 20 January 2026
  • The paper introduces IFED, a neural module that fuses CLIP text embeddings with FLAME parameters to enable precise 3D facial expression generation and transition.
  • It employs dual-branch transformer encoders and cross-attention mechanisms, achieving improved expression accuracy (e.g., CK+ Acc₁ up to 91.44%) through effective multimodal fusion.
  • Its end-to-end supervised design, incorporating composite losses including vertex reconstruction loss, ensures temporally smooth and semantically faithful facial motion sequences.

The Instruction-driven Facial Expression Decomposer (IFED) is a neural module designed to integrate natural language instructions and low-level facial parameter data to enable precisely conditioned 3D facial expression generation and transition. IFED constitutes the core fusion block in frameworks such as the Facial Expression Transition (FET) module and underpins the recently introduced Instruction to Facial Expression Transition (I2FET) method. This architecture enables controlled transformation of facial expressions in 3D avatars based on free-form text instructions, facilitating the generation of temporally smooth and semantically faithful facial motion sequences (Vo et al., 13 Jan 2026).

1. Functional Overview and Motivation

At the heart of IFED is its capability to perform multimodal decomposition and fusion, processing (a) textual instructions—encoded as CLIP embeddings—and (b) continuous facial parameter vectors based on the FLAME model. IFED receives, as input, a sequence of facial parameter vectors xf={e,θ}Rm×53x^f = \{e,\theta\}\in\mathbb{R}^{m\times 53}, with eR50e\in\mathbb{R}^{50} representing FLAME expression coefficients and θR3\theta\in\mathbb{R}^{3} denoting jaw pose, alongside a CLIP-encoded text embedding xtR77×768x^t\in\mathbb{R}^{77\times 768}. IFED outputs a sequence of conditional expression features xembeRm×50x^e_{\mathrm{emb}}\in\mathbb{R}^{m\times 50} and pose features xembpRm×6x^p_{\mathrm{emb}}\in\mathbb{R}^{m\times 6}. These embeddings are designed specifically for downstream conditioning of VAE-style encoders/decoders within the I2FET framework, enhancing the fidelity and controllability of expression synthesis (Vo et al., 13 Jan 2026).

2. Architecture: Dual-Branch Transformer Encoding and Cross-Attention Fusion

The IFED module is architected around two parallel transformer encoders:

  1. Facial-Parameter Branch (EP\mathcal{E}_P):
    • Input: xfRm×53x^f\in\mathbb{R}^{m\times 53}
    • Internal operations:

    yf=xf+MSA(LN(xf)), x^f=yf+FFN(LN(yf))\begin{aligned} y^f &= x^f + \mathrm{MSA}(\mathrm{LN}(x^f)), \ \hat{x}^f &= y^f + \mathrm{FFN}(\mathrm{LN}(y^f)) \end{aligned}

    where MSA\mathrm{MSA} denotes multi-head self-attention, FFN\mathrm{FFN} is a feed-forward network, and LN\mathrm{LN} is layer normalization.

  2. Text Branch (ET\mathcal{E}_T):

    • Input: replicated xtx^t to {x0t,x1t}R2×77×768\{x^t_0, x^t_1\}\in\mathbb{R}^{2\times 77\times 768}
    • Processing: xtx^t undergoes linear projection Pt()P_t(\cdot), followed by a transformer block, yielding x^tRm×768\hat{x}^t\in\mathbb{R}^{m\times 768}.

Cross-Attention Fusion (CAFT):

  • The latent representations x^f\hat{x}^f and x^t\hat{x}^t serve as the basis for fused feature computation via dual cross-attention:

ycf=hf2t(x^0f), yct=ht2f(x^0t), xof=gt2f(CA(ycfx1t)+ycf), xot=gf2t(CA(yctx1f)+yct),\begin{aligned} y_c^f &= h^{f2t}(\hat{x}^f_0),\ y_c^t &= h^{t2f}(\hat{x}^t_0),\ x_o^f &= g^{t2f}\left(\mathrm{CA}(y_c^f \otimes x^t_1) + y_c^f\right),\ x_o^t &= g^{f2t}\left(\mathrm{CA}(y_c^t \otimes x^f_1) + y_c^t\right), \end{aligned}

where hf2th^{f2t}, ht2fh^{t2f} are projections between modalities, and gt2fg^{t2f}, gf2tg^{f2t} are back-projections to original dimensions. CA\mathrm{CA} is standard multi-head cross-attention. Layer-normalized outputs xofx_o^f and xotx_o^t are concatenated to form the fused feature xfusedx^{\mathrm{fused}}.

3. Representation Decomposition and End-to-End Supervision

The fused feature xfusedx^{\mathrm{fused}} is decomposed via two lightweight linear layers:

xembe=Pe(xfused),xembp=Pp(xfused)x^e_{\mathrm{emb}} = \mathcal{P}_e(x^{\mathrm{fused}})\,,\qquad x^p_{\mathrm{emb}} = \mathcal{P}_p(x^{\mathrm{fused}})

yielding the final conditional embeddings for expression and pose.

Within the I2FET module, these embeddings are used to condition two parallel VAE branches:

Eet(exembe)(μe,σe),Ept(θxembp)(μp,σp)\mathcal{E}_e^t(e \otimes x^e_{\mathrm{emb}}) \rightarrow (\mu_e, \sigma_e),\qquad \mathcal{E}_p^t(\theta \otimes x^p_{\mathrm{emb}}) \rightarrow (\mu_p, \sigma_p)

Stochastic latent codes are sampled and linearly projected, then re-injected into IFED alongside the original text embedding, realizing a refinement mechanism before decoding.

End-to-end supervision is implemented with the following composite loss:

  • Mean-squared reconstruction loss for expression and pose parameters.
  • VAE KL divergence.
  • FLAME-vertex reconstruction loss, which directly supervises 3D shape fidelity:

Lv=vv^22,v=V(ϕ,e,θ)\mathcal{L}_v = \|v - \hat{v}\|_2^2, \quad v = \mathcal{V}(\phi, e, \theta)

4. Integration in Expression Transition and Temporal Smoothing

IFED, together with I2FET, generates two anchor frames at initial and final timesteps. The source FLAME shape ϕs\phi_s, expression ese_s, pose θs\theta_s, and camera csc_s are obtained by DECA. Resultant expression and pose coefficients—either predicted (e^0,θ^0)(\hat{e}_0,\hat{\theta}_0), (e^1,θ^1)(\hat{e}_1,\hat{\theta}_1) or interpolated via:

e(k)=δe(l)+(1δ)e(n),θ(k)=δθ(l)+(1δ)θ(n)e^{(k)} = \delta\, e^{(l)} + (1-\delta)\,e^{(n)},\quad \theta^{(k)} = \delta\, \theta^{(l)} + (1-\delta)\,\theta^{(n)}

(for δ[0,1]\delta\in[0,1])—yield a temporally smooth trajectory. Final rendering utilizes a FLAME-based mesh and a pre-trained neural renderer (e.g., ROME or CVTHead).

5. Empirical Results and Quantitative Performance

Evaluation utilizes both the CK+ (26,352 samples) and CelebV-HQ (28,335 samples) datasets, augmented with a range of expression instructions and identities. I2FET is trained for 200 epochs (batch size 128, Adam optimizer, LR = 8×1048\times 10^{-4}), with a 10\% test split and repeated trials for mean±std reporting. ResNet-101 (with re-weighted focal loss) is employed for downstream expression classification.

Key results are presented in the table below, contrasting “MotionClip” and the IFED-enabled system (“Ours (I2FET + IFED + vertex loss)”):

Metric CK+ MotionClip CK+ Ours CelebV-HQ MotionClip CelebV-HQ Ours
Acc₁ 52% 91.44% 40% 58.24%
Acc₂ 20% 84.03% 13.7% 33.45%
G-mean 40.48% 80.30% 34.4% 46.47%

Removing IFED leads to substantial drops in classification accuracy (CK+ Acc₁ to ~76.9%), which is fully restored upon its inclusion. t-SNE analysis and confusion matrices indicate improved class separation and representation clustering when IFED is present (Vo et al., 13 Jan 2026).

6. Contextualization and Implications

In summary, IFED serves as a lightweight, cross-attention-based transformer that delivers effective multimodal fusion between semantic text and low-level facial parameters. Injecting these conditional embeddings at both the encoder and decoder stages—enabling “refinement”—greatly enhances the system’s ability to interpret arbitrary textual instructions (e.g., “Turn this face from disgust to happiness”) and generate smooth, expressive facial trajectories. This suggests wide application potential in 3D avatar animation, digital human interaction, and emotion-driven simulation (Vo et al., 13 Jan 2026). The demonstrated ability to use natural language instructions to expand the repertoire of synthesized facial expressions and their transitions represents a substantial advancement in controllable, instruction-driven 3D facial animation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Instruction-driven Facial Expression Decomposer (IFED).