NeRF View Synthesis

Updated 5 January 2026

Neural Radiance Field (NeRF) view synthesis is a rendering technique that models static 3D scenes as continuous, view-dependent radiance fields using multilayer perceptrons.
It employs positional encoding and hierarchical sampling to capture high-frequency details and efficiently approximate the volume rendering integral.
NeRF achieves state-of-the-art photorealistic novel view generation with competitive PSNR metrics, though it faces challenges in computational latency and scalability.

Neural Radiance Field (NeRF) View Synthesis is a volumetric neural rendering paradigm that enables photorealistic synthesis of novel views in static 3D scenes by learning a continuous, view-dependent radiance representation parameterized by a multilayer perceptron (MLP). NeRF achieves state-of-the-art rendering quality by integrating a continuous volumetric scene function using sparse sets of input images with known camera poses, projecting output colors and densities into rendered images via differentiable volume rendering (Mildenhall et al., 2020).

1. Theoretical Formulation and Neural Parameterization

NeRF represents a 3D scene as a continuous function

$F_\Theta: (x, y, z, \theta, \phi) \mapsto (\mathbf{c}, \sigma),$

where $x, y, z$ denote spatial position, $(\theta, \phi)$ encode viewing direction, $\sigma$ is the volume density (differential opacity), and $\mathbf{c} \in \mathbb{R}^3$ is the directional radiance (RGB color). This function is realized by a fully-connected MLP mapping a 5D input (position, view) to density and color outputs.

To render an image, rays are cast from camera centers, and $F_\Theta$ is evaluated at sampled points along each ray. The predicted color for a ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ is computed using the volume rendering integral [Kajiya & Von Herzen 1984]: $C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t), \mathbf{d})\,dt, \quad T(t) = \exp\left(-\int_{t_n}^{t}\sigma(\mathbf{r}(s))\,ds\right),$ where $T(t)$ is the accumulated transmittance along the ray. In practice, this integral is approximated by stratified quadrature sampling: $\hat{C}(\mathbf{r}) = \sum_{i=1}^N T_i\,(1 - e^{-\sigma_i \delta_i})\,\mathbf{c}_i,\quad T_i = \exp\left(-\sum_{j=1}^{i-1}\sigma_j \delta_j\right).$ Each sample’s contribution is governed by its opacity in the context of previously encountered densities along the ray.

2. Neural Architecture, Positional Encoding, and Hierarchical Sampling

The network architecture consists of an 8-layer, 256-width ReLU MLP. The input spatial coordinates are positionally encoded: $\gamma(p) = [\sin(2^0\pi p), \cos(2^0\pi p), \ldots, \sin(2^{L-1}\pi p), \cos(2^{L-1}\pi p)],$ with $L=10$ for positions and $L=4$ for view directions. This encoding enables the representation of high spatial frequencies, which is critical for recovering sharp edges and view-dependent effects.

To optimize sampling efficiency, NeRF employs a hierarchical two-stage sampling strategy. First, $N_c$ stratified, coarse samples are drawn along each ray and processed through the MLP. The weights $w_i$ computed from this coarse pass naturally define a piecewise-constant probability density function, from which $N_f$ additional fine samples are drawn via inverse transform sampling. The fine samples, merged with the coarse ones and sorted by depth, enable precise integration of color and geometry near surfaces.

Hierarchical Ray Sampling Pseudocode:

// Coarse pass
Uniformly stratify [t_n, t_f] into N_c bins.
For i = 1..N_c: draw t_i ~ Uniform(bin_i).
Query F_Θ at {r(t_i)} to get (c_i, σ_i).
Compute weights w_i = T_i (1−exp(−σ_i δ_i)).
Compute C_coarse = Σ_i w_i c_i.

// Fine pass
Build PDF from coarse weights; draw N_f samples {t'_j} via inverse-transform sampling.
Merge all t's, sort by depth; query F_Θ at all t's.
Composite to obtain final color C_fine.

3. Optimization, Loss Function, and Training Protocol

The parameters $\Theta$ are learned by minimizing the sum of squared $\ell_2$ errors between rendered colors and ground-truth image pixels from multiple known-view images: $\mathcal{L}(\Theta) = \sum_{r \in R} \| \hat C_\Theta(\mathbf{r}) - C_{\rm gt}(\mathbf{r}) \|_2^2.$ Both coarse and fine outputs are included in the loss to ensure effective gradient propagation through the hybrid sampling pipeline. The entire rendering and compositing path is differentiable, so gradients flow end-to-end, allowing standard backpropagation.

Empirically, training on a single V100 GPU with $100$—$300$K Adam optimizer steps completes in $1$—$2$ days per scene. Inference at $800 \times 800$ pixel resolution, with $256$ network queries per ray, requires approximately $30$ seconds per frame.

4. Empirical Performance and Benchmarking

NeRF sets state-of-the-art standards in view synthesis quality across synthetic and real datasets:

Dataset	Input Images	Resolution	Test Images	NeRF PSNR	LLFF PSNR	SRN PSNR
DeepVoxels Diffuse	479	512×512	1000	40.2	34.4	33.2
Realistic Blender	100	800×800	200	31.0	24.9	22.3
Real Forward-Facing	20–62	1008×756	1/8 split	26.5	24.1	22.8

Qualitatively, NeRF recovers high-frequency detail such as fine rigging and specular highlights and displays consistently higher temporal coherence in video sequences compared to voxel or mesh-based renderers (Mildenhall et al., 2020).

5. Generalization, Extensions, and Practical Insights

Ablation studies highlight critical design elements: removing positional encoding or view-dependent color prediction degrades high-frequency fidelity. Hierarchical sampling improves both speed and rendering quality.

NeRF’s model footprint is order-of-magnitude smaller than voxel-grid methods (∼5MB, ~3000× smaller than LLFF voxel grids). However, the approach is bottlenecked by computationally intensive MLP inference during rendering.

Open research directions include:

Reducing inference latency via specialized data structures or hardware
Generalizing beyond static scenes to dynamics and relightable objects
Improving interpretability of the learned MLP representation
Extending reconstruction to unknown or uncertain camera poses, scene structure, and sparse-view scenarios

6. Impact and Limitations

NeRF’s impact is broad, enabling:

High-fidelity novel-view rendering from sparse, posed images.
Recovery of view-dependent effects for complex geometry and materials.
Compact, continuous scene representations suitable for large and diverse scenes.

Principal limitations relate to:

Extensive per-scene training time and memory cost at test time.
Slow inference arising from large numbers of MLP evaluations per image.
Constraints to static, non-relightable, and rigid scenes, absent explicit dynamic modeling.

Subsequent developments—such as efficient distillation [R2L, (Wang et al., 2022)], real-time motion integration, generalization to transparent or refractive objects (Yoon et al., 2023), and robust pose-free training [VMRF, (Zhang et al., 2022)]—seek to overcome these bottlenecks and expand NeRF’s applicability.

7. Summary Table: Core NeRF Components

Component	Formulation / Citation	Role
5D Radiance Field	$F_\Theta(x, y, z, \theta, \phi)$	Scene parameterization
Volume Rendering Eq.	Eq. (1, 2) (Mildenhall et al., 2020)	Physically-based view synthesis
Positional Encoding	$\gamma(p)$ , $L=10$ for position	High-frequency detail, anti-aliasing
Hierarchical Sampling	Coarse→fine, PDF-driven	Surface localization, efficiency
$\ell_2$ Photometric Loss	Eq. (RGB error), both passes	End-to-end differentiable learning

The architecture synthesizes photorealistic novel views by integrating continuous scene and appearance modeling, differentiable volume rendering, and hierarchical sample scheduling in a unified framework (Mildenhall et al., 2020).