3D Gaussian Splatting for Real-Time Radiance Field Rendering (2308.04079v1)

Published 8 Aug 2023 in cs.GR and cs.CV

Abstract: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

Authors (4)

Bernhard Kerbl (16 papers)
Georgios Kopanas (8 papers)
Thomas Leimkühler (16 papers)
George Drettakis (18 papers)

Citations (2,202)

View on Semantic Scholar

Summary

The paper introduces anisotropic 3D Gaussians to merge explicit and continuous scene representations, achieving state-of-the-art novel-view synthesis.
It employs interleaved optimization with adaptive density control and fast, visibility-aware GPU rasterization to reduce training time and computational cost.
The method enables real-time 1080p radiance field rendering with competitive visual quality relative to leading neural rendering techniques.

The paper "3D Gaussian Splatting for Real-Time Radiance Field Rendering" introduces a novel approach to achieve SOTA visual quality in novel-view synthesis while maintaining competitive training times and enabling real-time rendering at 1080p resolution. The method combines the advantages of both explicit and continuous scene representations by using 3D Gaussians optimized through a differentiable rendering pipeline.

The key elements of the approach are:

Representation of the scene using 3D Gaussians, which preserve the desirable properties of continuous volumetric radiance fields while avoiding unnecessary computation in empty space. The 3D Gaussians are defined by a position (mean), covariance matrix, opacity $\alpha$ , and spherical harmonic (SH) coefficients representing color.
Interleaved optimization and density control of the 3D Gaussians, including optimization of anisotropic covariance to achieve an accurate representation of the scene. Adaptive density control involves adding and removing 3D Gaussians during optimization based on positional gradients and opacity thresholds.
A fast visibility-aware rendering algorithm that supports anisotropic splatting, which accelerates training and enables real-time rendering. The tile-based rasterizer performs approximate $\alpha$ -blending of anisotropic splats, respecting visibility order through fast sorting.

The authors identify that meshes and points are commonly used 3D scene representations because they are explicit. Recent NeRF methods, however, build on continuous scene representations, typically optimizing a Multi-Layer Perceptron (MLP) using volumetric ray-marching for novel-view synthesis of captured scenes. While continuous methods help optimization, the stochastic sampling required for rendering is costly and can result in noise. The paper's approach combines the best of both worlds, by using a 3D Gaussian representation that is optimized with SOTA visual quality and competitive training times.

The method starts with cameras calibrated with Structure-from-Motion (SfM) and initializes the set of 3D Gaussians with the sparse point cloud produced as part of the SfM process. The properties of the 3D Gaussians (3D position, opacity $\alpha$ , anisotropic covariance, and SH coefficients) are optimized, interleaved with adaptive density control steps. The real-time rendering solution uses fast GPU sorting algorithms and tile-based rasterization.

The main contributions of the paper are:

The introduction of anisotropic 3D Gaussians as a high-quality, unstructured representation of radiance fields.
An optimization method of 3D Gaussian properties, interleaved with adaptive density control that creates high-quality representations for captured scenes.
A fast, differentiable rendering approach for the GPU, which is visibility-aware, allows anisotropic splatting, and fast backpropagation to achieve high-quality novel view synthesis.

The authors show that 3D Gaussians can be optimized from multi-view captures to achieve equal or better quality than previous implicit radiance field approaches. They also achieve training speeds and quality similar to the fastest methods and provide the first real-time rendering with high quality for novel-view synthesis.

The authors discuss traditional scene reconstruction and rendering methods, neural rendering and radiance fields, and point-based rendering techniques. They note that NeRFs are a continuous representation, implicitly representing empty/occupied space. Expensive random sampling is required to find the samples with consequent noise and computational expense. Points, in contrast, are an unstructured, discrete representation that is flexible enough to allow creation, destruction, and displacement of geometry similar to NeRF.

The 3D Gaussians are defined by a full 3D covariance matrix $\Sigma$ centered at point (mean) $\mu$ : $G(x) = e^{-\frac{1}{2}(x)^{T}\Sigma^{-1}(x)}$ . To project the 3D Gaussians to 2D for rendering, the viewing transformation $W$ is used to obtain the covariance matrix $\Sigma'$ in camera coordinates: $\Sigma' = J W ~\Sigma ~W ^{T}J^{T}$ , where $J$ is the Jacobian of the affine approximation of the projective transformation. The covariance matrix $\Sigma$ of a 3D Gaussian is analogous to describing the configuration of an ellipsoid. Given a scaling matrix $S$ and rotation matrix $R$ , the corresponding $\Sigma$ can be found as: $\Sigma = RSS^TR^T$ .

The optimization process involves successive iterations of rendering and comparing the resulting image to the training views in the captured dataset. The loss function is $\mathcal{L}_1$ combined with a D-SSIM term: $\mathcal{L} = (1 - \lambda) \mathcal{L}_1 + \lambda \mathcal{L_{\textrm{D-SSIM}}}$ , where $\lambda = 0.2$ in all the tests.

The adaptive control of density needs to populate empty areas. The method focuses on regions with missing geometric features (under-reconstruction), but also in regions where Gaussians cover large areas in the scene (which often correspond to over-reconstruction). Both have large view-space positional gradients. For small Gaussians that are in under-reconstructed regions, the Gaussians are cloned by creating a copy of the same size and moving it in the direction of the positional gradient. Large Gaussians in regions with high variance need to be split into smaller Gaussians. Such Gaussians are replaced by two new ones, and their scale is divided by a factor of $\phi = 1.6$ . The positions are initialized by using the original 3D Gaussian as a PDF for sampling.

The tile-based rasterizer for Gaussian splats splits the screen into 16 $\times$ 16 tiles and culls 3D Gaussians against the view frustum and each tile. Gaussians with a 99% confidence interval intersecting the view frustum are kept. Each Gaussian is instantiated according to the number of tiles they overlap, and each instance is assigned a key that combines view space depth and tile ID. The Gaussians are then sorted based on these keys using a fast GPU Radix sort. After sorting, a list for each tile is produced by identifying the first and last depth-sorted entry that splats to a given tile. During rasterization, the saturation of $\alpha$ is the only stopping criterion. During the backward pass, the full sequence of blended points per-pixel in the forward pass is recovered.

The method was implemented in Python using the PyTorch framework and custom CUDA kernels for rasterization. The source code and data are available at \url{https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/}. The algorithm was tested on a total of 13 real scenes taken from previously published datasets and the synthetic Blender dataset. The scenes have very different capture styles and cover both bounded indoor scenes and large unbounded outdoor environments.

The training times vary over datasets. The fully converged model achieves quality that is on par and sometimes slightly better than the SOTA Mip-NeRF360 method.

Ablation studies were performed to isolate the different contributions and algorithmic choices. The aspects of the algorithm tested were: initialization from SfM, densification strategies, anisotropic covariance, allowing an unlimited number of splats to have gradients, and the use of SH. The results show that splitting big Gaussians is important to allow good reconstruction of the background. Cloning the small Gaussians instead of splitting them allows for a better and faster convergence, especially when thin structures appear in the scene. The use of anisotropic covariance enables modeling of fine structures and has a significant impact on visual quality. The use of SH improves the overall PSNR scores since they compensate for the view-dependent effects.

The method has limitations. In regions where the scene is not well observed, there are artifacts. The method can create elongated artifacts or splotchy Gaussians. There are occasional popping artifacts when the optimization creates large Gaussians. Also, the method currently does not apply any regularization to the optimization. Also, memory consumption is significantly higher than NeRF-based solutions.

The authors conclude by stating that they have presented the first approach that truly allows real-time, high-quality radiance field rendering, in a wide variety of scenes and capture styles, while requiring training times competitive with the fastest previous methods.

PDF Markdown

Related Papers

Tweets

https://twitter.com/APCopter1/status/1850404718973190512

https://twitter.com/zebird0/status/1748344617475752201

https://twitter.com/CSProfKGD/status/1918301647714857261

https://twitter.com/absiyejunior/status/1744774918946857431

https://twitter.com/andimgladofit/status/1758305619810095110

https://twitter.com/mrhidefr/status/1772293283680489544

YouTube

Show All Videos