Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fourier PlenOctree for Dynamic Radiance Fields

Updated 11 May 2026
  • The paper introduces a dynamic radiance field framework that uses truncated Fourier series to compress temporal data and enable efficient representations.
  • It employs an adaptive union-octree structure to merge per-frame neural data, ensuring spatial-temporal consistency and reduced computational costs.
  • Achieving up to 100 fps on modern GPUs, FPO delivers state-of-the-art image quality and memory efficiency for interactive dynamic scene rendering.

Fourier PlenOctree (FPO) is a representation and rendering framework designed for efficient neural modeling and real-time rendering of dynamic radiance fields in free-view video sequences. FPO combines generalized Neural Radiance Fields (NeRF), PlenOctree spatial data structures, volumetric blending, and temporal Fourier analysis to achieve compact storage and significantly accelerated rendering, particularly for dynamic scenes with non-rigid motion (Wang et al., 2022).

1. Mathematical Framework

FPO represents the time-varying density σ(p,t)\sigma(p, t) and color spherical harmonics (SH) coefficients z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3} for a spatial location p=(x,y,z)p = (x, y, z) and discrete time t{1,,T}t \in \{1, \ldots, T\} using truncated real Discrete Fourier Series expansions along the temporal axis:

σ(p,t)=i=0n11kiσ(p)IDFTi(t),zm,(p,t)=i=0n21km,,iz(p)IDFTi(t),\sigma(p, t) = \sum_{i=0}^{n_1 - 1} k_i^\sigma(p)\, \mathrm{IDFT}_i(t), \quad z_{m, \ell}(p, t) = \sum_{i=0}^{n_2 - 1} k^z_{m, \ell, i}(p)\, \mathrm{IDFT}_i(t),

with

IDFTi(t)={cos(iπTt)i even, sin((i+1)πTt)i odd.\mathrm{IDFT}_i(t) = \begin{cases} \cos\left(\frac{i\pi}{T} t\right) & i \text{ even}, \ \sin\left(\frac{(i+1)\pi}{T} t\right) & i \text{ odd}. \end{cases}

The per-leaf Fourier coefficients kiσ(p)k_i^\sigma(p) and km,,iz(p)k^z_{m, \ell, i}(p) are stored for each spatial voxel. Typically, n1,n2Tn_1, n_2 \ll T, implying a low-pass temporal filtering that enables compact storage and reduced computational costs, yet suffices to capture the dominant temporal variation in dynamic content.

The canonical NeRF volume rendering equation at test time is modified such that both σ\sigma and z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}0 are reconstructed by IDFT sums from stored coefficients. For a camera ray z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}1,

z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}2

with

z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}3

and

z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}4

where z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}5 denote real spherical harmonics and z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}6 is a sigmoid function.

2. Octree-Based Data Structure and Temporal Fusion

FPO leverages an adaptive octree to partition the spatial domain, with each leaf voxel z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}7 storing the associated sets of Fourier coefficients for temporal density and SH color. In contrast to standard PlenOctree methods, which only store static values per leaf, FPO’s per-leaf vectors encode the Fourier expansion of z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}8 and z(p,t)R(max+1)2×3\mathbf{z}(p, t) \in \mathbb{R}^{(\ell_{\max}+1)^2 \times 3}9.

For dynamic sequences, an independent PlenOctree is first built per frame. To establish a temporally consistent representation, the octrees from each frame are merged (“unioned”) such that a node in the output structure is subdivided if subdivided in any input frame, producing a union-octree topology that supports all observed geometries across time. For each leaf in the union, the time series p=(x,y,z)p = (x, y, z)0 is extracted from the per-frame trees (with possible inheritance from coarse parents when missing), and a real DFT is applied along p=(x,y,z)p = (x, y, z)1 to yield the stored Fourier coefficients.

3. Algorithmic Pipeline

The FPO construction pipeline comprises three main stages:

  1. Coarse-to-Fine PlenOctree Fusion (per frame):
    • Coarse stage: From initial voxels, silhouettes from 6 views are used to form a visual hull (using shape-from-silhouette). Voxels outside the hull are pruned. For remaining nodes, a generalizable NeRF network p=(x,y,z)p = (x, y, z)2 (e.g., IBRNet) is queried over sample viewpoints to estimate view-independent p=(x,y,z)p = (x, y, z)3 and p=(x,y,z)p = (x, y, z)4 by averaging.
    • Fine stage: Using 100 synthetic views rendered from the current tree, further refinement is performed by fusing network predictions with observed pixel colors via weighted averages based on transmittance, iteratively updating density and SH coefficients.
  2. Union-Octree and DFT Coefficient Extraction:
    • The result is one static PlenOctree per frame. The union topological octree is computed, and, for each unified leaf, time series of density and color coefficients are collected and transformed via DFT into the final Fourier coefficients p=(x,y,z)p = (x, y, z)5, p=(x,y,z)p = (x, y, z)6.
  3. Optional Differentiable Fine-Tuning:
    • As the DFT–IDFT chain is differentiable, the explicit per-leaf Fourier coefficients are fine-tuned via Adam to minimize the per-pixel L2 reconstruction loss between rendered images and ground truth, back-propagating through the volumetric rendering equation.

The process is summarized in the following pseudocode:

σ(p,t)=i=0n11kiσ(p)IDFTi(t),zm,(p,t)=i=0n21km,,iz(p)IDFTi(t),\sigma(p, t) = \sum_{i=0}^{n_1 - 1} k_i^\sigma(p)\, \mathrm{IDFT}_i(t), \quad z_{m, \ell}(p, t) = \sum_{i=0}^{n_2 - 1} k^z_{m, \ell, i}(p)\, \mathrm{IDFT}_i(t),0

4. Network Architecture and Training Details

A small MLP (termed “Fourier NeRF-SH”) can be trained to predict the per-voxel temporal Fourier coefficients p=(x,y,z)p = (x, y, z)7 and p=(x,y,z)p = (x, y, z)8 from spatial coordinates p=(x,y,z)p = (x, y, z)9. In practice, direct use of this implicit network is eschewed in favor of explicit coefficient storage, with the MLP primarily serving in the bootstrap phase for initial estimation during the coarse-fine fusion.

The fine-tuning stage employs a per-pixel L2 reconstruction loss:

t{1,,T}t \in \{1, \ldots, T\}0

optimized with Adam (t{1,,T}t \in \{1, \ldots, T\}1, 10–20 minutes), with early stopping based on PSNR/LPIPS convergence.

5. Real-Time Rendering and System Performance

At test time, FPO enables real-time free-viewpoint rendering of dynamic scenes. For a desired timestamp t{1,,T}t \in \{1, \ldots, T\}2 and view, rays are cast through the FPO structure. At each sample, density and color are reconstructed via IDFT sums from stored coefficients, and color is evaluated as a 9-term SH dot product followed by a sigmoid. Alpha compositing with early ray termination enables highly efficient rendering. On an RTX 3090, an t{1,,T}t \in \{1, \ldots, T\}3 image renders at t{1,,T}t \in \{1, \ldots, T\}4100 fps.

Quantitative Performance

Key performance comparisons on dynamic multi-view sequences (60 frames, 60 views):

Method FPS Training Time Memory
NeRF 0.03 2 days
NeuralVolumes 2.3 6 hours
ST-NeRF 0.04 12 hours
iButter 3.5 20 minutes
FPO 100 2 hours t{1,,T}t \in \{1, \ldots, T\}57.3 GB

For a t{1,,T}t \in \{1, \ldots, T\}6 grid (t{1,,T}t \in \{1, \ldots, T\}7, t{1,,T}t \in \{1, \ldots, T\}8, t{1,,T}t \in \{1, \ldots, T\}9), the full memory footprint for the union-octree is approximately 7.3 GB. FPO delivers over an order of magnitude speedup compared to other state-of-the-art methods while maintaining high visual fidelity.

Quantitative Image Quality (held-out, 5 real/5 synthetic sequences)

Model PSNR SSIM MAE LPIPS
NeuralBody 27.3 0.94 0.0123 0.037
NeuralVolumes 23.6 0.92 0.0251 0.088
ST-NeRF 30.6 0.95 0.0092 0.032
iButter 33.8 0.96 0.0054 0.030
FPO 35.2 0.991 0.0033 0.022

Qualitative observations include the accurate preservation of sharp details in fast non-rigid motion (e.g., hair, clothing folds), absence of temporal flicker, and efficient temporal compression due to the dominance of low-frequency content in the Fourier basis.

6. Context and Significance

Fourier PlenOctree is notable for its integration of volumetric neural representations, spatial acceleration structures, and temporal signal processing. Its design facilitates:

  • Real-time rendering of unseen dynamic scenes in free-view video settings
  • Compact storage of long time sequences via low-frequency Fourier coding in the time domain
  • State-of-the-art image fidelity and temporal consistency

Relative to conventional NeRF-based approaches, FPO achieves three orders of magnitude acceleration, and substantially reduced memory overhead for dynamic scenes, making it applicable to interactive graphics, virtual/augmented reality, and volumetric video systems (Wang et al., 2022).

The FPO architecture extends PlenOctree [Yu et al. 2021] via its temporal handling and builds upon generalizable NeRF techniques such as IBRNet for bootstrap fusion. It fits within the broader research thrust of neural scene representations for dynamic content, complementing methods like NeuralVolumes, NeuralBody, ST-NeRF, and iButter. Its synthesis of spatial adaptation, temporal basis compression, and explicit coefficient fine-tuning embodies a modular approach to real-time neural rendering for dynamic environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fourier PlenOctree.