VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

Published 13 Mar 2025 in cs.CV | (2503.10286v1)

Abstract: We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

Overview of VicaSplat: Efficient 3D Gaussian Splatting from Unposed Video Frames

The paper titled "VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames" presents a novel framework for performing both 3D scene reconstruction and camera pose estimation using sequences of unposed video frames. This addresses a significant yet insufficiently explored area in computer vision, aiming to reduce the dependency on precise camera parameters—a common requirement in existing scene reconstruction methodologies.

Model Architecture and Approach

The crux of VicaSplat lies in its transformer-based architecture, specifically tailored for processing unposed video frames to predict 3D Gaussian splats and precise camera poses. The model is divided into an encoder and a decoder. The encoder transforms video frames into a collection of visual tokens, supplemented by learnable camera tokens. These tokens are then jointly processed by a specialized transformer decoder featuring a video-camera attention mechanism, cross-neighbor attention, and frame-wise modulation.

Learnable Camera Tokens: These tokens capture camera-specific features and are vital for regressing camera extrinsic parameters.
Video-Camera Attention: This mechanism facilitates the interaction between visual and camera tokens, crucial for capturing view-dependent features.
Cross-Neighbor Attention: This addition refines the view-consistency across frames, important for aligning visual tokens from different images.
Framewise Modulation: Drawing inspiration from conditional layer modulation, this allows the injection of view-dependent features into the visual tokens, enhancing consistency in predictions.

Camera Pose Regression

The paper introduces the use of dual-quaternion parameterization for camera poses, offering a compact representation that inherently couples translation and rotation, thus simplifying the regression process. Furthermore, a novel alignment loss is proposed to consolidate pose predictions with ground truth data based on dual-quaternion algebra.

Training Strategy

VicaSplat employs a progressive training strategy, commencing with base geometric understanding before scaling up to multi-view synthesis and camera pose optimization. This strategy is pivotal for efficient learning from sparse data. Additionally, knowledge distillation from pre-trained point cloud models aids in reducing computational burden while ensuring robust geometry predictions.

Results and Implications

Experimental results indicate that VicaSplat surpasses previous methods in multi-view scenarios. It offers comparable results to two-view approaches, noting particularly strong performance on the ScanNet dataset without any additional fine-tuning—demonstrating superior generalization capabilities. This positions VicaSplat as a significant improvement in real-time scene reconstruction and view synthesis, offering potential applications in augmented reality, robotics, and cinematic rendering where speed and accuracy are critical.

The methodology detailed in this paper effectively integrates 3D Gaussian splatting with autonomous pose estimation, providing a streamlined solution that could influence future developments in AI-driven scene understanding. VicaSplat stands to encourage further exploration into transformer-based architectures for 3D applications, especially in scenarios constrained by time and computational resources.

Future research could investigate extending this framework to dynamic scenes or enhancing it with additional sensory input for even more robust environment reconstructions.