Fourier PlenOctree for Dynamic Radiance Fields
- The paper introduces a dynamic radiance field framework that uses truncated Fourier series to compress temporal data and enable efficient representations.
- It employs an adaptive union-octree structure to merge per-frame neural data, ensuring spatial-temporal consistency and reduced computational costs.
- Achieving up to 100 fps on modern GPUs, FPO delivers state-of-the-art image quality and memory efficiency for interactive dynamic scene rendering.
Fourier PlenOctree (FPO) is a representation and rendering framework designed for efficient neural modeling and real-time rendering of dynamic radiance fields in free-view video sequences. FPO combines generalized Neural Radiance Fields (NeRF), PlenOctree spatial data structures, volumetric blending, and temporal Fourier analysis to achieve compact storage and significantly accelerated rendering, particularly for dynamic scenes with non-rigid motion (Wang et al., 2022).
1. Mathematical Framework
FPO represents the time-varying density and color spherical harmonics (SH) coefficients for a spatial location and discrete time using truncated real Discrete Fourier Series expansions along the temporal axis:
with
The per-leaf Fourier coefficients and are stored for each spatial voxel. Typically, , implying a low-pass temporal filtering that enables compact storage and reduced computational costs, yet suffices to capture the dominant temporal variation in dynamic content.
The canonical NeRF volume rendering equation at test time is modified such that both and 0 are reconstructed by IDFT sums from stored coefficients. For a camera ray 1,
2
with
3
and
4
where 5 denote real spherical harmonics and 6 is a sigmoid function.
2. Octree-Based Data Structure and Temporal Fusion
FPO leverages an adaptive octree to partition the spatial domain, with each leaf voxel 7 storing the associated sets of Fourier coefficients for temporal density and SH color. In contrast to standard PlenOctree methods, which only store static values per leaf, FPO’s per-leaf vectors encode the Fourier expansion of 8 and 9.
For dynamic sequences, an independent PlenOctree is first built per frame. To establish a temporally consistent representation, the octrees from each frame are merged (“unioned”) such that a node in the output structure is subdivided if subdivided in any input frame, producing a union-octree topology that supports all observed geometries across time. For each leaf in the union, the time series 0 is extracted from the per-frame trees (with possible inheritance from coarse parents when missing), and a real DFT is applied along 1 to yield the stored Fourier coefficients.
3. Algorithmic Pipeline
The FPO construction pipeline comprises three main stages:
- Coarse-to-Fine PlenOctree Fusion (per frame):
- Coarse stage: From initial voxels, silhouettes from 6 views are used to form a visual hull (using shape-from-silhouette). Voxels outside the hull are pruned. For remaining nodes, a generalizable NeRF network 2 (e.g., IBRNet) is queried over sample viewpoints to estimate view-independent 3 and 4 by averaging.
- Fine stage: Using 100 synthetic views rendered from the current tree, further refinement is performed by fusing network predictions with observed pixel colors via weighted averages based on transmittance, iteratively updating density and SH coefficients.
- Union-Octree and DFT Coefficient Extraction:
- The result is one static PlenOctree per frame. The union topological octree is computed, and, for each unified leaf, time series of density and color coefficients are collected and transformed via DFT into the final Fourier coefficients 5, 6.
- Optional Differentiable Fine-Tuning:
- As the DFT–IDFT chain is differentiable, the explicit per-leaf Fourier coefficients are fine-tuned via Adam to minimize the per-pixel L2 reconstruction loss between rendered images and ground truth, back-propagating through the volumetric rendering equation.
The process is summarized in the following pseudocode:
0
4. Network Architecture and Training Details
A small MLP (termed “Fourier NeRF-SH”) can be trained to predict the per-voxel temporal Fourier coefficients 7 and 8 from spatial coordinates 9. In practice, direct use of this implicit network is eschewed in favor of explicit coefficient storage, with the MLP primarily serving in the bootstrap phase for initial estimation during the coarse-fine fusion.
The fine-tuning stage employs a per-pixel L2 reconstruction loss:
0
optimized with Adam (1, 10–20 minutes), with early stopping based on PSNR/LPIPS convergence.
5. Real-Time Rendering and System Performance
At test time, FPO enables real-time free-viewpoint rendering of dynamic scenes. For a desired timestamp 2 and view, rays are cast through the FPO structure. At each sample, density and color are reconstructed via IDFT sums from stored coefficients, and color is evaluated as a 9-term SH dot product followed by a sigmoid. Alpha compositing with early ray termination enables highly efficient rendering. On an RTX 3090, an 3 image renders at 4100 fps.
Quantitative Performance
Key performance comparisons on dynamic multi-view sequences (60 frames, 60 views):
| Method | FPS | Training Time | Memory |
|---|---|---|---|
| NeRF | 0.03 | 2 days | |
| NeuralVolumes | 2.3 | 6 hours | |
| ST-NeRF | 0.04 | 12 hours | |
| iButter | 3.5 | 20 minutes | |
| FPO | 100 | 2 hours | 57.3 GB |
For a 6 grid (7, 8, 9), the full memory footprint for the union-octree is approximately 7.3 GB. FPO delivers over an order of magnitude speedup compared to other state-of-the-art methods while maintaining high visual fidelity.
Quantitative Image Quality (held-out, 5 real/5 synthetic sequences)
| Model | PSNR | SSIM | MAE | LPIPS |
|---|---|---|---|---|
| NeuralBody | 27.3 | 0.94 | 0.0123 | 0.037 |
| NeuralVolumes | 23.6 | 0.92 | 0.0251 | 0.088 |
| ST-NeRF | 30.6 | 0.95 | 0.0092 | 0.032 |
| iButter | 33.8 | 0.96 | 0.0054 | 0.030 |
| FPO | 35.2 | 0.991 | 0.0033 | 0.022 |
Qualitative observations include the accurate preservation of sharp details in fast non-rigid motion (e.g., hair, clothing folds), absence of temporal flicker, and efficient temporal compression due to the dominance of low-frequency content in the Fourier basis.
6. Context and Significance
Fourier PlenOctree is notable for its integration of volumetric neural representations, spatial acceleration structures, and temporal signal processing. Its design facilitates:
- Real-time rendering of unseen dynamic scenes in free-view video settings
- Compact storage of long time sequences via low-frequency Fourier coding in the time domain
- State-of-the-art image fidelity and temporal consistency
Relative to conventional NeRF-based approaches, FPO achieves three orders of magnitude acceleration, and substantially reduced memory overhead for dynamic scenes, making it applicable to interactive graphics, virtual/augmented reality, and volumetric video systems (Wang et al., 2022).
7. Connections to Related Work
The FPO architecture extends PlenOctree [Yu et al. 2021] via its temporal handling and builds upon generalizable NeRF techniques such as IBRNet for bootstrap fusion. It fits within the broader research thrust of neural scene representations for dynamic content, complementing methods like NeuralVolumes, NeuralBody, ST-NeRF, and iButter. Its synthesis of spatial adaptation, temporal basis compression, and explicit coefficient fine-tuning embodies a modular approach to real-time neural rendering for dynamic environments.