Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields (2106.13228v2)

Published 24 Jun 2021 in cs.CV and cs.GR

Abstract: Neural Radiance Fields (NeRF) are able to reconstruct scenes with unprecedented fidelity, and various recent works have extended NeRF to handle dynamic scenes. A common approach to reconstruct such non-rigid scenes is through the use of a learned deformation field mapping from coordinates in each input image into a canonical template coordinate space. However, these deformation-based approaches struggle to model changes in topology, as topological changes require a discontinuity in the deformation field, but these deformation fields are necessarily continuous. We address this limitation by lifting NeRFs into a higher dimensional space, and by representing the 5D radiance field corresponding to each individual input image as a slice through this "hyper-space". Our method is inspired by level set methods, which model the evolution of surfaces as slices through a higher dimensional surface. We evaluate our method on two tasks: (i) interpolating smoothly between "moments", i.e., configurations of the scene, seen in the input images while maintaining visual plausibility, and (ii) novel-view synthesis at fixed moments. We show that our method, which we dub HyperNeRF, outperforms existing methods on both tasks. Compared to Nerfies, HyperNeRF reduces average error rates by 4.1% for interpolation and 8.6% for novel-view synthesis, as measured by LPIPS. Additional videos, results, and visualizations are available at https://hypernerf.github.io.

Citations (211)

Summary

  • The paper introduces a hyper-dimensional representation that uses a deformable slicing surface and spatial deformation field to handle topological changes in dynamic scenes.
  • The paper extends standard NeRFs with extra ambient dimensions, enabling accurate modeling of complex motions like opening mouths or tearing paper.
  • The paper demonstrates improved novel-view synthesis and temporal interpolation, reducing LPIPS error by up to 8.6% compared to baseline methods.

This paper introduces HyperNeRF, a method designed to improve the modeling of dynamic scenes with Neural Radiance Fields (NeRF), particularly when those scenes involve changes in topology (e.g., opening/closing mouths, tearing paper, objects making/breaking contact). Standard deformable NeRF approaches like Nerfies (Nerfies: Deformable Neural Radiance Fields, 2020) struggle with these changes because they rely on continuous deformation fields, which cannot easily represent the discontinuities required for topological variations.

HyperNeRF addresses this limitation by drawing inspiration from level set methods. Instead of representing the scene in a single 3D canonical space, it embeds the scene representation within a higher-dimensional "hyper-space". The 5D radiance field (3D position + 2D viewing direction) for any given moment (input image) is then represented as a specific "slice" through this higher-dimensional space.

Core Components and Implementation:

  1. Hyper-Space Template: The standard NeRF MLP, which maps (x, y, z, d) to (c, σ), is extended to take additional "ambient" coordinates w = (w1, w2, ..., W) as input. The template NeRF function becomes F: (x, w, d, ψi) -> (c, σ), where x is the 3D position, w is the point in the W-dimensional ambient space, d is the viewing direction, and ψi is a per-frame appearance embedding (Investigating the Theory of Propagating Fluctuations with Numerical Models of Stochastic Accretion Discs, 2021). In experiments, W=2 ambient dimensions were found sufficient.
  2. Spatial Deformation Field: HyperNeRF still utilizes a spatial deformation field T: (x, φi) -> x', similar to Nerfies (Nerfies: Deformable Neural Radiance Fields, 2020), to handle standard, topology-preserving motion. This field maps an observation-space coordinate x to a canonical coordinate x', conditioned on a per-frame latent deformation code φi. It's implemented as an MLP.
  3. Deformable Slicing Surface (DS): To determine the ambient coordinates w for querying the hyper-space template, HyperNeRF introduces a deformable slicing surface field H: (x, φi) -> w. This is another MLP, conditioned on the same latent deformation code φi, that maps each observation-space point x to a specific coordinate w in the ambient space. This allows different spatial locations within the same frame to query different parts of the hyper-space, enabling more efficient representation compared to assigning a single w per frame (Axis-Aligned Plane or AP approach, which was also tested and found inferior).
  4. Combined Query: For a sample point x along a ray in observation space for frame i, the process is:
    • Compute canonical spatial coordinates: x' = T(x, φi)
    • Compute ambient coordinates: w = H(x, φi)
    • Query the hyper-template: (c, σ) = F(x', w, d, ψi)
    • Volume rendering integrates c and σ along the ray as usual.

Implementation Details:

  • Network Architectures:
    • Template F: Standard NeRF MLP architecture, extended inputs for w and ψi. (Fig 13)
    • Deformation T: Nerfies deformation MLP architecture. (Fig 14)
    • Slicing Surface H: MLP with 6 layers, 64 width, skip connection at layer 5. Final layer weights initialized near zero (N(0, 10^-5)). (Fig 15)
  • Positional Encoding: Standard sinusoidal positional encoding γ(.) (Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains, 2020) is used for x, d, and w. Crucially, windowed positional encoding γ_α(.) and γ_β(.) (Nerfies: Deformable Neural Radiance Fields, 2020) are used for the inputs to the deformation field T (parameter α) and the slicing surface field H (parameter β), respectively. The windowing allows for coarse-to-fine optimization.
  • Delayed Ambient Dimensions: To encourage the model to use spatial deformations T for simple motions and reserve the ambient dimensions w primarily for topological changes, the use of w is delayed. This is achieved by keeping the windowing parameter β for the slicing surface MLP's input encoding at 0 for the first 1000 iterations (effectively disabling the ambient dimensions, as the identity mapping is omitted from the positional encoding for w) and then linearly increasing β to its maximum value over the next 10k iterations. The spatial deformation field T uses its own, standard windowing schedule for α.
  • Latent Codes: 8 dimensions are used for both deformation codes φi and appearance codes ψi.
  • Training: Optimized using only an L2 photometric loss between rendered and ground truth pixels. Adam optimizer with learning rate decaying exponentially from 1e-3 to 1e-4. Trained for 250k iterations (evaluation) or 1M iterations (qualitative results) on TPU v4s. Batch size 6144 rays, 128-256 samples per ray.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def spatial_deformation_field(x, deformation_code, alpha):
  # MLP T: (gamma_alpha(x), deformation_code) -> x'
  # Uses windowed positional encoding gamma_alpha
  encoded_x = windowed_positional_encoding(x, alpha)
  inputs = concatenate(encoded_x, deformation_code)
  x_prime = mlp_T(inputs)
  return x_prime

def slicing_surface_field(x, deformation_code, beta):
  # MLP H: (gamma_beta(x), deformation_code) -> w
  # Uses windowed positional encoding gamma_beta
  # Note: gamma_beta for w input to F does NOT use identity mapping
  encoded_x = windowed_positional_encoding(x, beta) # Input uses standard encoding
  inputs = concatenate(encoded_x, deformation_code)
  w = mlp_H(inputs) # Outputs ambient coordinates
  return w

def hyper_template_nerf(x_prime, w, d, appearance_code, beta_template):
  # MLP F: (gamma(x'), gamma_beta(w), gamma(d), appearance_code) -> (c, sigma)
  # Uses standard positional encoding gamma for x' and d
  # Uses windowed positional encoding gamma_beta for w (no identity)
  encoded_x_prime = positional_encoding(x_prime)
  encoded_w = windowed_positional_encoding(w, beta_template, include_identity=False)
  encoded_d = positional_encoding(d)
  inputs = concatenate(encoded_x_prime, encoded_w, encoded_d, appearance_code)
  c, sigma = mlp_F(inputs)
  return c, sigma

deformation_code = deformation_embeddings[i]
appearance_code = appearance_embeddings[i]
alpha = get_current_alpha(training_step) # Deformation window param
beta = get_current_beta(training_step)   # Slicing surface window param

x_prime = spatial_deformation_field(x, deformation_code, alpha)

w = slicing_surface_field(x, deformation_code, beta)

c, sigma = hyper_template_nerf(x_prime, w, d, appearance_code, beta)

Evaluation:

HyperNeRF was evaluated on novel view synthesis and temporal interpolation tasks using custom datasets featuring significant topological changes (peeling a banana, facial expressions, 3D printer movement). It was compared against NeRF, Nerfies, Neural Volumes (The Effect of Disorder on Local Electron Temperature in Quantum Hall Systems, 2019), and NSFF (Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes, 2020).

  • Quantitative: HyperNeRF (DS variant) showed improved performance over baselines on metrics like LPIPS, MS-SSIM, and PSNR, particularly for interpolation (reducing LPIPS error by 4.1% vs. Nerfies) and novel-view synthesis (reducing LPIPS error by 8.6% vs. Nerfies). LPIPS was noted as correlating best with perceptual quality.
  • Qualitative: Visual results demonstrated HyperNeRF's ability to produce sharper renderings with fewer geometric artifacts (e.g., distorted chins, blurry transitions) compared to Nerfies when modeling topological changes. The Deformable Slicing (DS) approach yielded significantly better interpolation quality than the Axis-Aligned Plane (AP) approach, avoiding cross-fading artifacts.

Limitations:

The method inherits limitations from NeRF, including sensitivity to inaccurate camera poses and the inability to reconstruct poorly observed scene parts or very rapid motion without motion blur in the input.