Splatter Image: Ultra-Fast Single-View 3D Reconstruction (2312.13150v2)

Published 20 Dec 2023 in cs.CV

Abstract: We introduce the \method, an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS. Our main innovation is the surprisingly straightforward design of this network, which, using 2D operators, maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image, the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS), we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works. Code, models, demo and more results are available at https://szymanowiczs.github.io/splatter-image.

Citations (108)

View on Semantic Scholar

Summary

The paper introduces Splatter Image, which uses a U-Net to predict 3D Gaussians for ultra-fast, single-view 3D reconstruction.
It employs depth prediction, view-dependent color modeling, and efficient Gaussian splatting to achieve state-of-the-art quality at high speeds.
The method extends to multi-view scenarios with minimal resources, leveraging image-level perceptual losses for enhanced real-world applicability.

The paper "Splatter Image: Ultra-Fast Single-View 3D Reconstruction" (2312.13150) introduces a novel approach to monocular (single-view) 3D object reconstruction that is highly efficient at both training and inference time. The core idea is to adapt the high-quality, fast rendering capabilities of Gaussian Splatting (2312.13150) for learning-based 3D reconstruction from one or a few images.

The central contribution is the "Splatter Image," a representation where a neural network predicts a 3D Gaussian for each pixel of the input image. This prediction is achieved using a standard 2D image-to-image network, specifically a U-Net architecture. The network outputs a tensor (the Splatter Image) where each pixel location stores the parameters (opacity, 3D position, 3D shape/covariance, and color) of a corresponding 3D Gaussian. This structured output allows the network to leverage efficient 2D convolutions and process the input image in a feed-forward manner.

The position of each predicted Gaussian is parameterized relative to the input camera view. For each pixel corresponding to a ray from the camera center, the network predicts a depth $d$ along this ray and a 3D offset $\Delta = (\Delta_x, \Delta_y, \Delta_z)$ . The Gaussian's mean (center) is then computed as $= (u_1 d + \Delta_x, u_2 d + \Delta_y, d + \Delta_z)^\top$ , where $(u_1, u_2, 1)^\top$ is the pixel's homogeneous coordinate in camera space. The network also predicts the Gaussian's opacity $\sigma$ (via sigmoid), shape $\Sigma$ (via scale and rotation parameters, following the standard Gaussian Splatting formulation), and view-dependent color using spherical harmonics.

A key insight is that even though the network only sees one view of the object, it can learn through training to distribute the predicted Gaussians using the depth and offset parameters to reconstruct the full 360° shape and appearance. Some Gaussians reconstruct the visible parts, while others are placed behind the object to reconstruct occluded regions, effectively encoding prior knowledge learned during training.

The learning formulation is straightforward: train the image-to-image network to predict the Splatter Image from a source view such that rendering the resulting 3D Gaussian mixture from various target viewpoints matches the ground truth target images. The loss function primarily consists of an L2 photometric loss between the rendered and target images. A significant advantage of the proposed method's speed and efficiency (especially the fast Gaussian Splatting renderer) is the ability to render full images at each training iteration, enabling the effective use of image-level perceptual losses like LPIPS [zhang2018perceptual] in addition to per-pixel losses. Generic regularizations are also applied to the Gaussian parameters to maintain stability and prevent pathological shapes.

The method is extended to handle multiple input views by applying the same network to each view independently. The predicted Gaussian mixtures from different views are then warped into a common coordinate frame using their relative camera poses and combined by taking their union. For view-dependent color (using spherical harmonics), the coefficients are transformed based on the relative rotation between views. To facilitate information exchange between views during the prediction process, the network is conditioned on the relative camera pose (via FiLM layers) and optionally augmented with cross-view attention layers at lower resolutions of the U-Net.

The paper demonstrates the effectiveness of Splatter Image through extensive experiments on various datasets: synthetic (ShapeNet-SRN Cars/Chairs, multi-category ShapeNet), real (CO3D Hydrants/Teddybears), and large-scale/multi-category (Objaverse-LVIS training, Google Scanned Objects testing).

Key experimental findings include:

Reconstruction Quality: Splatter Image achieves state-of-the-art or competitive results across standard metrics (PSNR, SSIM, LPIPS) on these benchmarks, often outperforming significantly slower or more resource-intensive methods like PixelNeRF, VisionNeRF, and even OpenLRM/LRM [he2023openlrm, hong24lrm:]. The use of image-level losses is highlighted as contributing to improved perceptual quality.
Efficiency: A major strength is its speed. At inference time, it achieves 38 FPS on a single GPU for predicting the Splatter Image and rendering new views at 128x128 resolution is significantly faster than prior methods (hundreds to thousands of times faster than implicit methods like PixelNeRF/VisionNeRF). For training, it requires substantially fewer resources (e.g., trainable on a single A6000 GPU in ~7 days for smaller datasets, or two A6000 GPUs in 3.5 days for large datasets like Objaverse) compared to baselines which may require dozens or hundreds of GPUs.
Flexibility: It can handle relative camera poses at inference time, unlike methods requiring absolute canonical poses, making it more practical for real-world applications where precise object orientation might be unknown.
Multi-view Capabilities: The multi-view extension further improves reconstruction quality, outperforming single-view approaches and some multi-view baselines, while still being efficient.
Ablations: Studies confirm the importance of the Splatter Image structure (outperforming unstructured outputs), the depth prediction, and view-dependent color modeling for achieving high-quality results. The ability to use perceptual losses like LPIPS is shown to be crucial for visual fidelity.

In summary, Splatter Image offers a practical and highly efficient approach to 3D object reconstruction from minimal inputs by innovatively combining a standard 2D image-to-image network structure with the power and speed of Gaussian Splatting. Its ability to train and run on modest hardware while achieving competitive results makes it a valuable contribution for applications requiring fast 3D reconstruction.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/StanSzymanowicz/status/1780533225468703138

https://twitter.com/taziku_co/status/1780615899025953279

https://twitter.com/iamrashminagpal/status/1746267629336314281

https://twitter.com/2728547289/status/1737851070670602548

https://twitter.com/22146921/status/1737957354627563909