Papers
Topics
Authors
Recent
2000 character limit reached

Instant Neural Graphics Primitives (iNGP)

Updated 11 November 2025
  • Instant Neural Graphics Primitives (iNGP) are neural field representations that combine multi-resolution hash tables with small MLPs for efficient, real-time 3D reconstruction and view synthesis.
  • The design shifts most computational burden to trainable, spatially adaptive hash encodings, yielding orders-of-magnitude speedups and reduced memory usage compared to conventional methods.
  • A fused GPU implementation with CUDA kernels enables rapid convergence and high-fidelity rendering, making iNGP applicable to diverse tasks including satellite imagery 3D reconstruction and simplex-based encodings.

Instant Neural Graphics Primitives (iNGP) refer to a class of neural field representations in which a compact neural network (typically a tiny multilayer perceptron) is combined with a multi-resolution hash-table encoding of spatial coordinates. This design enables the efficient parameterization, rapid training, and real-time inference of implicit graphics primitives—such as signed distance fields (SDF) and neural radiance fields (NeRF)—across tasks including novel view synthesis, 3D reconstruction, and image fitting. The iNGP pipeline achieves orders-of-magnitude speedups over prior neural representations by shifting most of the representational burden to data-adaptive, trainable hash-based lookup tables that precondition the input space hierarchically, thereby allowing extremely compact and shallow networks to suffice for high-fidelity reconstruction.

1. Multi-Resolution Hash Encoding

At the core of iNGP is a spatial encoding that maps an input coordinate x∈Rdx \in \mathbb{R}^d into a concatenated feature vector by querying several hash tables indexed at different spatial resolutions. For each level ℓ=0,…,L−1\ell = 0, \ldots, L-1, the input coordinate is quantized onto a 2ℓ2^{\ell}-spaced grid: uℓ=⌊2ℓx⌋∈Zdu_\ell = \lfloor 2^\ell x \rfloor \in \mathbb{Z}^d. The high-dimensional grid index uℓu_\ell is reduced via modular hashing: hℓ(x)=(uℓmod  Tℓ)∈{0,…,Tℓ−1}h_\ell(x) = (u_\ell \mod T_\ell) \in \{0, \ldots, T_\ell-1\}, with TℓT_\ell typically 2192^{19} for practical memory usage.

Each hash table stores TℓT_\ell trainable feature vectors tℓ[k]∈RF\mathbf{t}_\ell[k] \in \mathbb{R}^F, where FF is usually $2$ for radiance field tasks and $4$ for SDFs. For a given xx, the features at all LL levels are retrieved and concatenated:

f(x)=[t0[h0(x)];t1[h1(x)];…;tL−1[hL−1(x)]]∈RL⋅Ff(x) = [\mathbf{t}_0[h_0(x)]; \mathbf{t}_1[h_1(x)]; \ldots; \mathbf{t}_{L-1}[h_{L-1}(x)]] \in \mathbb{R}^{L \cdot F}

The total parameter count for all hash tables is Phash=∑ℓ=0L−1Tℓ⋅FP_{\text{hash}} = \sum_{\ell=0}^{L-1} T_\ell \cdot F, typically in the low tens of millions even for complex scenes, which is still an order of magnitude smaller than the expanded input dimensionality of high-frequency Fourier or sinusoidal encodings. The trainable hash tables enable the architecture to adaptively allocate capacity to scene regions with high spatial complexity, handling local frequency variation without global basis expansion.

2. Compact Neural Network Architecture

The concatenated hash-encoded feature f(x)f(x) serves as input to a compact multilayer perceptron (MLP) of depth four (hidden width ≈64\approx64). The design for NeRF-style tasks uses:

  • Input: Lâ‹…FL \cdot F features (e.g., 16×2=3216 \times 2 = 32).
  • Four hidden fully-connected layers of width 64, with ReLU activations.
  • A skip connection concatenates f(x)f(x) to the preactivation of the third hidden layer.
  • Output heads: scalar density σ\sigma (with softplus activation) and RGB color c\mathbf{c} (with sigmoid activation).

The forward pass is: z0=f(x) h1=ϕ(W1z0+b1) h2=ϕ(W2h1+b2) hpre3=ϕ(W3h2+b3) h3=[hpre3;z0] h4=ϕ(W4h3+b4) σ=softplus(wσh4+bσ) c=sigmoid(wch4+bc)\begin{align*} z^0 &= f(x) \ h^1 &= \phi(W^1 z^0 + b^1) \ h^2 &= \phi(W^2 h^1 + b^2) \ h^3_{\text{pre}} &= \phi(W^3 h^2 + b^3) \ h^3 &= [h^3_{\text{pre}}; z^0] \ h^4 &= \phi(W^4 h^3 + b^4) \ \sigma &= \mathrm{softplus}(w^\sigma h^4 + b^\sigma) \ \mathbf{c} &= \mathrm{sigmoid}(w^c h^4 + b^c) \end{align*} The total MLP parameter count is O(104)\mathcal{O}(10^4), a reduction of several orders of magnitude compared to conventional neural field networks.

3. Optimization and Loss Functions

The hash tables {tℓ}\{\mathbf{t}_\ell\} and the neural network weights Θ\Theta are jointly optimized via Adam. Learning rates are tuned independently for the hash (αhash\alpha_{\text{hash}}) and network (αnet\alpha_{\text{net}}) parameters (e.g., 10−210^{-2} and 10−310^{-3}, respectively), with exponential decay.

For NeRF reconstruction, the per-ray loss is:

L=∑p∈pixels∥Cpred(p;Θ,{tℓ})−Cgt(p)∥2\mathcal{L} = \sum_{p \in \text{pixels}} \|C_{\text{pred}}(p; \Theta, \{\mathbf{t}_\ell\}) - C_{\text{gt}}(p)\|^2

Here, CpredC_{\text{pred}} is the color rendered via volumetric integration, and CgtC_{\text{gt}} the target pixel color.

For SDF fitting: L=∑i∣Φpred(xi)−Φgt(xi)∣2+λ∑i∣∥∇Φpred(xi)∥−1∣2\mathcal{L} = \sum_i |\Phi_{\text{pred}}(x_i) - \Phi_{\text{gt}}(x_i)|^2 + \lambda \sum_i |\|\nabla \Phi_{\text{pred}}(x_i)\| - 1|^2 with the Eikonal constraint enforcing signed distance regularity. No additional regularization is used; the hash tables function as implicit regularizers by concentrating capacity where needed.

4. GPU Implementation and Real-Time Training

The full encoding and MLP are implemented as fully fused CUDA kernels. All hash lookups, feature concatenation, and MLP layers are composed in a single kernel launch, minimizing global memory traffic. For a batched inference/training pass, hash lookups from all levels for each point are performed in sequence, using coalesced memory access patterns, and exploiting 16-bit storage for both hash features and network weights to maximize arithmetic throughput.

This fused architecture eliminates intermediate memory reads/writes, reducing wasted bandwidth and increasing computational efficiency. On contemporary hardware, such as an NVIDIA A100, iNGP systems can converge to high-fidelity solutions in <15<15 seconds for standard NeRF scenes, with inference rendering of 1920×10801920 \times 1080 images at over $200$ Hz (32 samples per ray).

5. Empirical Performance and Comparison

Comparison against classic NeRF architectures (e.g., a $60$M-parameter MLP with large positional encodings):

  • Training time reduced from tens of minutes/hours to seconds.
  • Memory bandwidth requirements decrease by a factor of ∼\sim20.
  • Computation measured in FLOPs drops by a factor of ∼\sim10.
  • Reconstruction quality (PSNR, SSIM) improves: typical gains of +1+1 to $2$ dB PSNR and several percentage points in SSIM.
  • Total hash+MLP storage (e.g., $20$M+$20$K params) significantly below prior methods, while achieving equal or better expressivity.

This encoding supports strong locality and data adaptivity, handling real-world detail (e.g., thin structures, fine gradients) with greater parameter and runtime efficiency.

6. Extensions and Variants

The multi-resolution hash encoding and fast sampling MLP framework has been adapted for diverse settings:

  • Satellite imagery 3D reconstruction: SAT-NGP (Billouard et al., 27 Mar 2024) couples iNGP's hash-encoding and occupancy grid sampling to achieve relightable, transient-free neural reconstructions of satellite-captured urban scenes, converging in $8$–$14$ minutes on a single $12$GB GPU. The model replaces classical, computationally intense NeRF backbones (e.g., 8-layer $512$-unit MLP) with a 2×642\times64 MLP, integrates robust loss reweighting, and supports explicit lighting vector injection for dynamic relighting.
  • Simplex-based encodings (Wen et al., 2023): These generalize the grid hash to n-simplices, reducing neighbor lookup to n+1n+1 (from 2n2^n in grids), with skewed coordinate transforms and barycentric interpolation. Empirical results show ∼\sim9.4% speedup in $2$D image fitting, and up to 41.2%41.2\% speedup on dense volumetric tasks, with equivalent quality and lower memory waste as dimension increases.
Method Hash Structure Neighbor Count (nD) Typical Speedup Quality
iNGP (grid) Grid 2n2^n Baseline Reference
Simplex-based Simplices n+1n+1 ∼\sim1.1–1.4×\times Same PSNR/SSIM

This suggests that as problem dimensionality rises, simplex-based hash encoding offers substantial scalability advantages over grid-based iNGP, both in memory and computational efficiency.

7. Significance and Theoretical Insights

The essential mechanism underlying iNGP's speed and quality is the separation of concerns between spatial encoding and neural modeling. By leveraging trainable, multi-resolution hash tables, the architecture conditions the MLP on features that are spatially and spectrally adapted to the task, relieving the network from having to untangle high-dimensional frequency content globally. Local scene complexity is handled by the hash features, while the MLP decodes these into outputs (density, color, signed distance) with minimal depth and parameterization.

In contrast to fixed high-frequency embeddings, which require wide and deep networks to achieve similar expressivity, iNGP architectures achieve efficient gradient flow, compact storage, and hardware-level parallelism. This combination creates a practical path for real-time neural rendering, rapid fitting of 3D scenes, and on-the-fly adaptation of graphics primitives across application domains.

A plausible implication is that as implementations continue to optimize memory access patterns (e.g., fusing occupancy-grid updates, quantizing hash tables), and as further hierarchical or non-grid structures (simplex, tree, etc.) are explored, the iNGP paradigm will remain central to the emerging field of neural graphics primitives, especially for applications requiring both high fidelity and interactive rates.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Instant Neural Graphics Primitives (iNGP).