FLAME-in-NeRF: Free-view Face Animation
- The paper demonstrates that integrating FLAME's explicit 3D face parameterization with neural radiance fields enables precise control over facial expressions and free-view synthesis.
- It employs expression codes and FiLM modulation within the NeRF architecture to achieve high photorealism while preserving background consistency.
- Spatial priors using FLAME occupancy fields ensure effective disentanglement of facial features from the background, yielding improved metrics like PSNR and SSIM.
FLAME-in-NeRF refers to a family of neural rendering techniques that integrate the explicit, low-dimensional face modeling capacity of the FLAME (Faces Learned with an Articulated Model and Expressions) 3D morphable model (3DMM) with the high-fidelity, free-viewpoint scene synthesis capabilities of neural radiance fields (NeRF). This approach enables explicit control over a subject's facial expressions, identity, and pose within the neural radiance field, making it possible to synthesize photorealistic novel-view portrait videos with entangled scene and face manipulation. FLAME-in-NeRF leverages FLAME's compact parameterization as an input to the NeRF conditioning mechanism, and applies spatial priors to ensure disentanglement of facial and background regions, yielding an overview system capable of both high realism and direct parametric facial editing (Athar et al., 2021).
1. The Role of FLAME and 3D Morphable Models in Neural Radiance Fields
FLAME is a widely adopted 3DMM designed to generate fully controllable human face meshes. The model parameterizes facial shape, expression, and pose as follows:
- : identity (shape) coefficients
- : expression coefficients
- : pose parameters (jaw, neck, eyeballs)
The output is a triangular mesh encoding the 3D geometry of the face for a particular identity-expression-pose configuration.
In FLAME-in-NeRF, the FLAME parameterization enables a low-dimensional "expression code" for each video frame or sequence. This code serves as a control input, allowing the radiance field to deform and relight the head in accordance with arbitrary facial expressions. The explicit mesh provided by FLAME provides a structural prior that can be leveraged by the radiance field (Athar et al., 2021).
2. Neural Radiance Field Conditioning with Expression Codes
The standard NeRF formulation represents the scene as a continuous function parameterized by a multi-layer perceptron (MLP):
where is the spatial location, the view direction, the volume density, and the RGB color.
In FLAME-in-NeRF, the output of the volumetric MLP is conditioned not only on and , but also on the FLAME-derived expression code :
Expression control is injected using a feature-wise linear modulation (FiLM) mechanism. The code is mapped by a small MLP to , and within each block of the NeRF architecture, the activations are modulated:
where are learnable projections, and is the intermediate feature vector.
This direct parametric conditioning, as opposed to the indirect deformation-field approach (e.g., Nerfies), enables the system to synthesize new facial configurations at inference by adjusting alone, enabling explicit, disentangled control over facial animation (Athar et al., 2021).
3. Spatial Priors and Disentanglement of Face and Background
A central challenge in conditioning NeRF with facial expression codes is to prevent entanglement of head and background appearance. Without constraints, the network may encode expression-dependent variation in the scene background, violating the desired separation.
FLAME-in-NeRF introduces a spatial prior through a binary or soft occupancy field , generated by ray marching the canonical FLAME mesh (in reference expression) into the world frame. The occupancy masks the face region. To enforce head/background disentanglement, the network attenuates or nullifies the FiLM modulation for points outside the facial region by multiplying the modulation scale by :
- In the background (), expression code dependence is zeroed, ensuring the background remains invariant to facial expression changes.
- In the face region (), full modulation is applied, allowing the network to model expression-dependent appearance and geometry (Athar et al., 2021).
This spatial gating approach guarantees that varying the FLAME code only influences the synthesized face, not the underlying scene, yielding more robust and semantically meaningful controls for synthesis and animation.
4. Training Data, Loss Functions, and Optimization
The FLAME-in-NeRF pipeline requires minimal data acquisition: a 10–30 second handheld portrait video, with moderate head movement, recorded on a standard mobile device. Preprocessing involves:
- Camera pose estimation for every frame using structure-from-motion (e.g., COLMAP).
- Per-frame FLAME fitting to extract the expression code and a canonical head mesh.
The loss functions typically include:
- Photometric loss:
ensuring synthesis matches the input images at the pixel level.
- Optional perceptual loss (e.g., VGG-based ).
- Tikhonov regularization on expression codes: .
- Disentanglement loss:
enforcing invariance in the background region under changes in .
- FLAME fitting loss: .
Optimization typically proceeds with Adam for 150k–300k steps with an initial learning rate of , decayed over training, yielding convergence in about 2–3 days on a Tesla V100. Per iteration, the system samples ∼1024 rays and 64 depths per ray (Athar et al., 2021).
5. Qualitative and Quantitative Outcomes
FLAME-in-NeRF supports explicit modification of facial expressions by reparameterizing the expression code at test time, which can be applied to synthesize:
- Expression transfer (e.g., transferring a smile onto a neutral head at a novel viewpoint).
- Arbitrary editing (e.g., combined brow raise and jaw drop in unseen angles).
Quantitative evaluation employs PSNR, SSIM, and LPIPS metrics on held-out test frames, as well as expression accuracy via correlations between rendered and ground-truth action units (AUs). Compared to baseline methods:
- Typical improvements include +2–3 dB PSNR over deformable NeRFs (e.g., Nerfies) and +0.05 SSIM, indicating higher photometric and structural fidelity.
- Qualitative improvements include sharper, more realistic facial animation and background consistency compared to deferred neural rendering methods that lack free view synthesis.
These outcomes reflect FLAME-in-NeRF's capacity to bridge detailed volumetric synthesis with interpretable, parametric face control (Athar et al., 2021).
6. Limitations and Failure Cases
Known constraints of FLAME-in-NeRF, as reported, include:
- Poor generalization to extreme expressions not represented in the training data, sometimes producing rendering artifacts such as "teeth popping" or volume collapse.
- Temporal artifacts (e.g., ghosting) in the background under rapid expression dynamics, due to the underlying NeRF static-world assumption.
- Imperfect rendering of highly specular regions (e.g., teeth, eyeglasses), manifesting as subtle flicker at novel viewpoints.
- Degraded head synthesis during long-range pose changes not covered by the training capture.
These limitations stem from a combination of the representational capacity of the NeRF, the expressivity of the FLAME prior, and the constraints imposed by limited, video-only capture (Athar et al., 2021).
7. Relationship to Related Approaches
FLAME-in-NeRF represents a distinct alternative to direct mesh-texture pipelines and implicit-deformation NeRF variants:
| Method | Expression Control | High-Frequency Detail | Background Synthesis |
|---|---|---|---|
| Vanilla NeRF | No | Yes | Yes |
| Nerfies (deformable NeRF) | Indirect (warp) | Yes | Yes |
| Deferred Neural Rendering | Yes (limited) | Yes | No |
| FLAME-in-NeRF | Explicit (3DMM) | Yes | Yes |
Unlike methods relying on direct mesh deformation or global deformation fields, FLAME-in-NeRF achieves explicit, low-dimensional, and semantically meaningful face control without sacrificing background realism. The spatial prior enforced by the FLAME occupancy field distinguishes FLAME-in-NeRF from approaches that suffer from facial-background entanglement or lack free-view background synthesis (Athar et al., 2021).
References
- "FLAME-in-NeRF: Neural control of Radiance Fields for Free View Face Animation" (Athar et al., 2021)