NeRSemble Benchmark

Updated 6 February 2026

The paper introduces a novel hash-ensemble dynamic neural radiance field representation that accurately captures fine-scale head details and temporal consistency.
NeRSemble Benchmark is a curated dataset and protocol featuring precise multi-camera calibration, high-resolution capture, and robust preprocessing for dynamic human head reconstruction.
The benchmark provides rigorous evaluation metrics including PSNR, SSIM, LPIPS, and JOD, and offers open-source code and standardized data splits for reproducible research.

The NeRSemble Benchmark is a comprehensive protocol and dataset for high-fidelity multi-view dynamic reconstruction of human heads using radiance fields. It establishes an experimental and evaluation standard for capturing, reconstructing, and rendering fine-grained head dynamics—including facial expressions, emotions, and speech—from time-synchronized, high-resolution, multi-camera video. The NeRSemble methodology integrates a unique hash-ensemble-based dynamic @@@@1@@@@ representation and provides open protocols, metrics, and data splits for quantitative comparison. The benchmark is accompanied by an extensive dataset and open-source codebase for reproducibility and extensibility (Kirschstein et al., 2023).

1. Data Acquisition and Organization

Capture Rig and Calibration

The NeRSemble dataset is generated from a custom hardware capture rig comprising 16 rigidly mounted industrial machine-vision cameras arranged in a 93° horizontal by 32° vertical arc. Camera intrinsics and extrinsics are calibrated using a fine checkerboard pattern and bundle adjustment, achieving sub-millimeter pose accuracy in synthetic reprojection. Illumination is provided by eight LED panels with diffusers to minimize specular highlights on skin.

Image Acquisition Specifications

Resolution: 3208 × 2200 pixels (≈7.1 MP) per camera.
Framerate: 73 fps, shutter speed: 3 ms (no perceptible motion blur).
All cameras are synchronized using Precision Time Protocol (PTP) to sub-microsecond accuracy.

Dataset Statistics and Content

Number of Subjects	Sequences per Subject	Total Sequences	Total Frames	Raw Disk Usage
222	25	4,734	31.7 million	≈203 TB

Subjects are demographically diverse. Each subject contributes 25 short takes, which encompass:

9 scripted facial expressions.
4 emotion enactments.
1 fast hair-motion sequence.
10 prompted sentences (recorded with audio).
1 segment for free facial expression.

Each sequence contains 300–500 frames (all at 73 fps), amassing approximately 3 minutes per individual.

Data Preprocessing

Background subtraction: per-frame alpha masks via BackgroundMatting v2.
Depth maps: per-view COLMAP multi-view stereo (depths present in fewer than 3 views are discarded).
White balance and gamma: color checker calibration with small affine alignment using optimal transport.
Downsampled training resolution: 1604 × 1100 px (factor 2), with full temporal resolution preserved.
Train/test splits for novel view synthesis: 12 of 16 cameras used for training; 4 held-out, evenly spaced for evaluation, including extreme viewpoints. For speed, 15 uniformly spaced frames per sequence are selected for evaluation.

All 4,734 sequences, baseline code, and benchmark splits are publicly available at https://tobias-kirschstein.github.io/nersemble.

2. NeRSemble Model Architecture

Spatio-temporal Radiance Fields

The NeRSemble approach models the time-varying volumetric radiance field using the formulation:

$\sigma(x,t)$ : density at 3D position $x$ and time $t$ ,
$c(x,d,t)$ : color at $x$ , view direction $d$ , and time $t$ .

Volume rendering is carried out via:

$C(r,t)=\int_{\tau_n}^{\tau_f} T(\tau)\sigma(r(\tau),t)c(r(\tau),d,t)d\tau,$

where $T(\tau) = \exp(-\int_{\tau_n}^\tau \sigma(r(s),t) ds)$ and $r$ denotes a camera ray.

Deformation Field

A learned canonical 3D field is paired with a deformation MLP $D$ that maps dynamic observations into the canonical space:

$D: \mathbb{R}^3 \times \mathbb{R}^{128} \to \mathbb{R}^3,\quad x' = D(x, \omega_t)$

where $\omega_t \in \mathbb{R}^{128}$ is a learned deformation code per frame.

Multi-Resolution Hash Ensemble

Inspired by Instant-NGP, the model comprises $N=32$ separate 3D multi-resolution hash grids $H_i$ . At each frame, features from these grids are blended:

$f_t(x) = \sum_{i=1}^N \beta_{t,i} H_i(D(x, \omega_t)),$

with $\beta_t \in \Delta^{N-1}$ representing simplex-constrained blend weights.

The feature $f_t(x)$ passes to two compact MLP heads:

Density and latent code: $[\sigma, f_{\text{base}}] = \text{MLP}_{\text{base}}(f_t(x))$
Color: $c = f_{\text{color}}(f_{\text{base}}, d)$

Network details include using a deformation MLP (4–6 layers, 128–256 units) emitting SE(3) warp, 16 hash grid levels ( $2^1$ to $2^{16}$ ), hash size $2^{19}$ (~2 million entries per table), batch sampling of 4096 rays (per Instant-NGP), with blend weights and deformation codes initialized randomly and jointly optimized.

3. Training Objective

The training loss combines photometric, mask, depth, and regularization terms:

$L = L_{\text{rgb}} + \lambda_{\text{mask}} L_{\text{mask}} + \lambda_{\text{depth}} L_{\text{depth}} + \lambda_{\text{near}} L_{\text{near}} + \lambda_{\text{empty}} L_{\text{empty}} + \lambda_{\text{dist}} L_{\text{dist}}$

$L_{\text{rgb}}$ : Mean squared error on foreground rays.
$L_{\text{mask}}$ : Background/foreground mask loss, supervised by per-frame alpha matte.
$L_{\text{depth}}$ , $L_{\text{empty}}$ , $L_{\text{near}}$ : Depth map supervision and regularization.
$L_{\text{dist}}$ : Distortion loss penalizing "floaters" as in MipNeRF360.

Default loss weights: $\lambda_{\text{depth}} = \lambda_{\text{dist}} = \lambda_{\text{near}} = \lambda_{\text{empty}} = 1\times10^{-4}, \lambda_{\text{mask}} = 1\times10^{-2}$ .

Optimization runs for 300,000 iterations (~1 day on a single RTX A6000), with a staged warm-up (40k iterations using only one hash table; second 40k ramping in the full ensemble), followed by windowed blending. Learning rates: $1\times10^{-3}$ (hash tables / MLPs, decayed by $0.8$ every 20k iters), $0.5\times$ decay for deformation network.

4. Benchmark Protocol, Metrics, and Results

Evaluation Protocol

For each sequence, 12 input views are trained on, and 4 held-out views (including extremes) are evaluated with 15 evenly spaced frames per sequence. This design is verified to produce metric parity with full-sequence evaluation (variation $\leq 0.02$ in all metrics).

Quantitative Metrics

Evaluation uses four primary metrics:

PSNR (Peak Signal-to-Noise Ratio) — higher is better.
SSIM (Structural Similarity) — higher is better.
LPIPS (Learned Perceptual Image Patch Similarity) — lower is better.
JOD (Just-Objectionable-Difference) — higher indicates greater temporal coherence.

Performance on held-out views (10 diverse test sequences):

Method	PSNR	SSIM	LPIPS	JOD
PSR	12.5	0.774	0.341	—
Instant-NGP	28.8	0.864	0.254	6.75
Nerfies	29.5	0.849	0.299	7.23
HyperNeRF	29.6	0.848	0.304	7.27
DyNeRF	30.6	0.860	0.254	7.69
NeRSemble	31.8	0.875	0.212	7.86

NeRSemble outperforms all baselines, especially in fine detail and temporal consistency.

Qualitative Findings

NeRSemble reconstructs fine-scale details such as wrinkles, hair strands, and complex mouth poses more faithfully than competitors. Baselines display distinct weaknesses: Instant-NGP shows severe floaters and flicker; Nerfies and HyperNeRF, while effective for coarse motion, blur delicate deformations; DyNeRF offers improvements but cannot enforce spatio-temporal consistency as robustly as the hash ensemble in NeRSemble.

Failure Modes

"Hollow face" artifacts occur when the mouth interior is occluded in most frames.
Specular reflection on eyes can produce erroneous highlights.
Extremely fast hair motions surpass the deformation field's capacity; the hash tables alone are insufficient, resulting in blurred hair geometry.

5. Implementation, Use, and Limitations

Reproducibility

Researchers can replicate results by:

Cloning the NeRSemble repository.
Installing dependencies: PyTorch ≥1.12, Nerfstudio, NerfAcc + CUDA.
Downloading datasets (images, masks, intrinsics, extrinsics, depth maps).
Launching training using python3 train.py --cfg configs/nersemble.yaml --data_dir /path/to/data.
Rendering novel views after training with python3 eval.py --checkpoint best.ckpt --views held_out.txt.

Integration with Other Workflows

The NeRSemble dataset and protocol are compatible with alternative pipelines or model architectures. Applications include static or dynamic NeRF, multi-view stereo (MVS), facial mesh/landmark fitting, and GAN-based generative modeling. Train-test splits and metadata are provided in JSON format, and model backbones can be substituted within Nerfstudio.

Limitations and Future Directions

NeRSemble does not provide inter-sequence generalization: each subject and sequence is trained independently; there is no universal or "one-for-all" model.
Mouth interiors and other heavily occluded regions remain challenging; artistic or anatomical priors could be beneficial.
Hair physics and very rapid deformations are areas for methodological improvement, potentially via optical flow regularization or physics-informed models.
A plausible implication is that pre-training a dynamic "head prior" over all identities could allow for faster convergence and possibly accurate single-frame fitting.
Limitations in modeling eye reflections and fleeting occlusions are noted as targets for future work.

All assets, configuration files, and splits are openly accessible, facilitating rapid adoption for comparative research, extension, and benchmarking in dynamic neural synthetic head modeling (Kirschstein et al., 2023).

Markdown Upgrade to Chat

References (1)

NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NeRSemble Benchmark.