FLAME-in-NeRF: Mesh-Conditioned Neural Rendering

Updated 4 March 2026

FLAME-in-NeRF is a neural rendering framework that fuses low-dimensional facial control from morphable models with volumetric rendering for precise semantic manipulation.
It employs mask-guided feature aggregation and mesh-based conditioning to decouple facial expressions from background details, enhancing artifact-free free-view synthesis and flare removal.
The framework leverages specialized modules and task-specific losses to achieve superior PSNR and SSIM compared to baseline NeRF methods in face animation and multi-view editing.

FLAME-in-NeRF is a neural rendering framework that integrates morphable face models with @@@@1@@@@, combining volumetric rendering with explicit, low-dimensional facial control. The term “FLAME-in-NeRF” initially refers to neural control of radiance fields for free-view face animation (Athar et al., 2021), but the designation has also been adopted for other frameworks involving explicit FLAME mesh conditioning within NeRF pipelines, as in NeRFlame (Zając et al., 2023). More recently, FLAME-in-NeRF (also referred to as GN-FR) has been generalized to address challenging ill-posed imaging problems such as multi-view lens flare removal (Matta et al., 2024).

The unifying principle is the conditioning of an implicit NeRF scene representation using external, typically mesh-based priors. This enables fine-grained control (“expression editing,” “flare masking”) or high-fidelity recovery of scene statistics that would otherwise be entangled within an unconstrained volume. FLAME-in-NeRF unifies semantic control from 3D morphable models (FLAME) with the photorealistic rendering and flexibility of neural radiance fields (NeRF).

1. Historical Development and Motivations

Traditional 3D morphable models, such as FLAME (“Faces Learned with an Articulated Model and Expressions”), use explicit mesh representations parameterized by low-dimensional codes for facial shape, expression, and pose. These models provide semantic and geometric control but lack photorealistic detail and plausible novel view synthesis. In contrast, NeRF and related volumetric neural representations achieve high-fidelity, view-dependent appearance for static or dynamic scenes, but do not permit explicit semantic control, leading to entanglement of pose, identity, expression, and scene.

FLAME-in-NeRF arose to bridge this gap, initially for semantic face animation and subsequently for a broader class of scene manipulation problems where imposing a spatial or semantic prior on the neural volume enables new forms of control or artifact removal (Athar et al., 2021, Zając et al., 2023, Matta et al., 2024).

2. System Architecture and Conditioning Mechanisms

FLAME-in-NeRF frameworks consist of a modified NeRF architecture that incorporates external priors via conditioning variables or mask-based control. The foundational approaches can be categorized as follows:

Expression-Controlled NeRF: Each 3D sample point $x$ incorporates both its spatial position and a corresponding expression code $e$ , derived from fitted FLAME parameters for each training frame. Points outside the facial mesh receive a zeroed expression code, preventing leakage of facial control into background or hair. The NeRF MLP thus becomes

$F_\theta: (x, d, e\cdot M(x)) \mapsto (\sigma, c),$

where $M(x)$ is a binary face mask (Athar et al., 2021).

Mesh-Based Explicit Density Conditioning: The FLAME mesh is rendered into the volume as a band of nonzero density. All 3D points within a distance $\epsilon$ of the mesh are assigned density

$\sigma_{\text{explicit}}(x; M) = \begin{cases} 0 & d(x, M) > \epsilon \ 1 - d(x, M)/\epsilon & \text{otherwise} \end{cases}$

This explicit density field "pins" the volumetric representation to the mesh. An MLP predicts only color (and, in a later phase, a residual density), with mesh parameters trained jointly for alignment (Zając et al., 2023).

Multi-View Mask-Guided Feature Aggregation: For tasks such as flare removal, the system employs learned binary masks to control both the selection of minimally corrupted views and the attention mechanism over rays and 3D points during feature aggregation. This architecture leverages Generalizable NeRF Transformer (GNT) blocks, augmented with modules for mask prediction, view sampling, and point masking (Matta et al., 2024).

3. Training Objectives, Losses, and Mask Consistency

FLAME-in-NeRF frameworks use a combination of photometric reconstruction, mask consistency, and task-specific losses to disentangle semantic controls from background and other nuisance factors:

Loss	Purpose	Formula/Explanation
Photometric Loss ( $L_{\text{photo}}$ )	Pixelwise reconstruction fidelity	$L_{\text{photo}} = \\|C(r;e) - C_{\text{gt}}(r)\\|^2_2$
Mask Consistency ( $L_{\text{mask}}$ )	Ensures background/hair do not respond to expressions	$L_{\text{mask}} = \lambda_{\text{mask}} \\|C(r;e) - C(r;0)\\|^2$ (outside face)
Unsupervised Masking Loss ( $L_{\text{unsup}}$ )	Applies only to valid/unoccluded regions (flare removal)	$L_{\text{unsup}} = \\|(1 - M) \circ \text{Pred} - (1 - M) \circ \text{Target}\\|^2_2$
Regularization	Smoothness, weight decay	Total variation on $\sigma$ , weight decay

The mask-based losses are central to disentangling the effects of conditioning variables, guaranteeing that semantic edits affect only the intended regions and that artifacts such as lens flare are not learned as scene appearance (Athar et al., 2021, Matta et al., 2024).

4. Specialized Modules and Implementation Details

Several architectural modules and training routines have been developed in the literature to support the unique requirements of each FLAME-in-NeRF variant.

Flare-occupancy Mask Generation (FMG): A PSPNet-based model with ResNet encoder and pyramid pooling decoder predicts binary masks of flare-affected regions. Training is performed on both synthetic data (80 real flare patterns applied to 24k Flickr images) and real, annotated flare scenes (782 images, 17 scenes) (Matta et al., 2024).
View Sampler (VS): Selects minimally corrupted views based on precomputed flare occupancy ratios, discarding any with high flare coverage (Matta et al., 2024).
Point Sampler (PS): Implements attention masking within transformer blocks, setting weights to zero for 3D points projecting into flare or out-of-mask locations (Matta et al., 2024).
FLAME Mesh Fitting and Conditioning: Mesh parameters $(\beta, \psi, \phi)$ are estimated per frame, providing continuous control over identity, expression, and pose. For animation, affine maps bring sample points in deformed meshes back to canonical coordinates for color prediction (Zając et al., 2023).
Expression Vector Routing: For each sample point, the per-frame expression vector is multiplied by the spatial mask, zeroing it outside the face (Athar et al., 2021).

Representative pseudocode for training FLAME-in-NeRF (GN-FR) for flare removal is provided in (Matta et al., 2024), including precomputation of masks, sampling of rays and points, masked feature aggregation, and unsupervised masking loss.

5. Dataset Construction and Quantitative Evaluation

Dedicated datasets are constructed to support evaluation and training:

Face Animation: RGB video of subjects with varied facial expressions and known camera poses. FLAME fitters provide per-frame expression and shape parameters (Athar et al., 2021).
Flare Removal: 3D multi-view flare dataset with 80 real flare patterns (halo, streak, color bleeding, scattering), each manually annotated; 24,000 synthetic flare images for FMG training; 17 real scenes with 782 images for evaluation and fine-tuning (Matta et al., 2024).

Quantitative results demonstrate the efficacy of FLAME-in-NeRF variants. For flare removal:

GN-FR achieves PSNR = 26.18 dB, SSIM = 0.882, LPIPS = 0.034 on synthetic flare benchmarks, outperforming both baseline NeRF and generic GNT pipelines (Matta et al., 2024).

For face animation and editing:

FLAME-in-NeRF achieves PSNR ≈ 26.8 dB, SSIM ≈ 0.92 compared to vanilla NeRF (PSNR ≈ 24 dB, SSIM ≈ 0.88), with sharper facial details and disentangled control (Athar et al., 2021).
NeRFlame (Zając et al., 2023) achieves PSNR ≈ 30–33 dB for single-subject novel view synthesis with explicit semantic control.

Ablation studies confirm that mask-based routing, view and point sampling, and explicit mesh conditioning are essential for robust disentanglement and artifact-free synthesis. The use of ground-truth versus learned flare masks yields negligible (<0.1 dB) difference, validating the effectiveness of mask predictors (Matta et al., 2024).

6. Impact, Limitations, and Extensions

FLAME-in-NeRF has established a foundation for robust integration of semantic, spatial, and prior-driven control within neural volumetric rendering. Immediate applications include high-fidelity face animation with explicit expression and pose editing, compositional flare removal, and other multi-view editing or restoration tasks where mask or mesh-based priors are available.

Limitations include:

The absence of explicit interior geometry (e.g., tongue, teeth) in morphable models, leading to artifacts for extreme expressions or open mouths (Zając et al., 2023).
The trade-off in mesh-based conditioning, where small support bands improve mesh fitting but reduce context for color prediction, while larger bands weaken spatial specificity.
Single-subject training for some methods, requiring per-person model optimization.

Extensions proposed in the literature include joint learning of universal models that generalize across identities, volumetric modeling of interior facial anatomy, incorporation of FLAME priors for robust alignment, and acceleration via multi-resolution grids (Zając et al., 2023). A plausible implication is that further advances in mask prediction and mesh-based conditioning could enable FLAME-in-NeRF workflows for unconstrained, in-the-wild scene domains.

FLAME-in-NeRF is positioned at the intersection of mesh-conditioned neural rendering and generalizable radiance field methods. Whereas traditional NeRF approaches treat all volumetric content as unconditional, FLAME-in-NeRF and related variants directly infuse strong semantic priors—or occlusion and artifact masks—into the neural architecture. This strategy is being extended to general artifact removal, controllable avatar synthesis, and robust unsupervised scene understanding.

Notably, the GN-FR pipeline for flare removal (Matta et al., 2024) represents the first work addressing lens flare removal explicitly within a NeRF-based multi-view framework, establishing new benchmarks for artifact-robust scene synthesis. Earlier works such as NeRFlame (Zając et al., 2023) and FLAME-in-NeRF (Athar et al., 2021) provide a blueprint for integrating MM-based control and achieving disentangled free-view facial animation and editing.