Fillerbuster: Multi-View Scene Completion for Casual Captures (2502.05175v1)

Published 7 Feb 2025 in cs.CV and cs.GR

Abstract: We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.

Summary

The paper introduces Fillerbuster, a novel latent diffusion transformer trained to complete multi-view 3D scenes from sparse, casual captures by generating unknown views conditioned on known images and poses.
Fillerbuster unifies content generation and pose recovery, enabling uncalibrated scene completion tasks where the model predicts plausible camera poses while creating missing scene parts.
Evaluated on existing datasets, Fillerbuster significantly improves scene completion quality over baselines, demonstrating higher PSNR, SSIM, and lower LPIPS.

The paper "Fillerbuster: Multi-View Scene Completion for Casual Captures" introduces a novel method, Fillerbuster, for completing unknown regions in 3D scenes captured casually using a large-scale multi-view latent diffusion transformer.

The key contributions of this paper are:

A generative model is trained to consume a large context of input frames while generating unknown target views and recovering image poses when desired. The model is unified to predict many images and poses together for scene completion.
The model leverages a latent diffusion transformer with a flow-matching loss for inpainting in latent space, conditioned on known images and camera poses (represented as raymaps).
The model is evaluated by completing partial captures on existing datasets and by introducing an uncalibrated scene completion task where the unified model predicts both poses and creates new content.

The challenges addressed include:

Casual captures are often sparse and miss surrounding content, making existing methods unsuitable as they focus on refining known pixels with sparse-view priors or completing object sides from limited photos.
The desire to complete areas missing and unobserved from hundreds of input frames, even without known camera parameters.

The technical approach involves:

A latent diffusion transformer model is proposed that denoises multiple input images and calibrated camera poses with masks indicating known and unknown regions.
The model handles a sequence of $N$ $N$ elements with images $I_i \in \mathbb{R}^{H \times W \times 3}$ $I_{i} \in R^{H \times W \times 3}$ (where $H$ $H$ is height and $W$ $W$ is width), raymaps $R_i \in \mathbb{R}^{H \times W \times 6}$ $R_{i} \in R^{H \times W \times 6}$ with origin and direction per pixel, valid image masks $\mathcal{M}^\text{I}_i \in \mathbb{R}^{H \times W}$ $M_{i}^{I} \in R^{H \times W}$ , and valid ray masks $\mathcal{M}^\text{R}_i \in \mathbb{R}^{H \times W}$ $M_{i}^{R} \in R^{H \times W}$ , with the objective to predict $p(I,R \mid I \odot \mathcal{M}^\text{I}, R \odot \mathcal{M}^\text{R})$ .
- $I_i$ : Images
- $H$ : Height of image
- $W$ : Width of image
- $R_i$ : Raymaps
- $\mathcal{M}^\text{I}_i$ : Valid image masks
- $\mathcal{M}^\text{R}_i$ : Valid ray masks
The architecture uses a DiT and trains with the flow matching objective. Separate VAEs are used for images and poses encoded as raymaps, with $\mathcal{E}^\text{I}$ $E^{I}$ denoting the image encoder and $\mathcal{E}^\text{R}$ $E^{R}$ denoting the raymap encoder.
- $\mathcal{E}^\text{I}$ : Image encoder
- $\mathcal{E}^\text{R}$ : Raymap encoder
The sequence is prepared as
- $s_{i,t}$ : Sequence
- $\tilde{z}^\text{I}_{i,t}$ : Compressed latent image
- $\mathcal{D}$ : Downscaling operation
- $\tilde{z}^\text{R}_{i,t}$ : Compressed latent raymap
Two forms of positional embeddings are used: 2D layout embeddings and index embeddings, to handle varying sequence lengths.

The results include:

Demonstration of completing casual captures from the Nerfbusters dataset, showing improved consistency compared to baselines like 3DGS and a CAT3D-sized conditioning model.
Introduction of the "uncalibrated scene completion" task, generating fly-throughs from sparse unposed photos, predicting plausible camera poses, and completing missing content.
Outperformance of NeRFiller on the NeRFiller dataset in terms of quality and consistency, measured by PSNR, SSIM, and LPIPS metrics. PSNR increased from 25.57 to 29.60, SSIM from 0.89 to 0.92, and LPIPS decreased from 0.182 to 0.096.
- PSNR: Peak Signal-to-Noise Ratio
- SSIM: Structural Similarity Index Measure
- LPIPS: Learned Perceptual Image Patch Similarity
Ablation studies validate the importance of index embeddings and pose prediction for model performance.

The method shows improvements in completing casual captures, uncalibrated scene completion, and multi-view inpainting tasks. The unified model's ability to jointly complete the unobserved content and recover poses is valuable for casual scene completion.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1888799087157538848

https://twitter.com/ducha_aiki/status/1889262403797344753

https://twitter.com/Almorgand/status/1889600804513857546

https://twitter.com/delin_qu23107/status/1891396847278711235