DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair (2412.00851v1)

Published 1 Dec 2024 in cs.CV

Abstract: Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses. Our project page is available at https://colin-de.github.io/DynSUP/.

Summary

The paper introduces object-level two-view bundle adjustment to jointly estimate camera pose and object motion from minimal, unposed image data.
It presents a SE(3) field-driven Gaussian training method that models discrete motion transformations for enhanced temporal consistency.
Experimental results on KITTI and Kubric datasets demonstrate significant improvements in PSNR, SSIM, and LPIPS over state-of-the-art models.

Overview of DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair

The paper "DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair" contributes to the field of novel view synthesis under challenging conditions, specifically targeting dynamic scenes where only two images with unknown camera poses are provided. Traditional methods, such as NeRF and 3D-GS, generally presuppose static environments, comprehensive datasets with known camera parameters, and dense views. The paper defies these constraints with its DynSUP framework, thereby extending the application of novel view synthesis to more flexible and realistic scenarios.

Technical Contributions

The authors introduce two primary technical innovations:

Object-Level Two-View Bundle Adjustment: This method addresses dynamic scenes by segmenting them into piecewise rigid structures. The technique allows the joint estimation of both camera pose and object motion by minimizing a loss function that integrates reprojection and depth regularization. This approach extends the classical bundle adjustment, traditionally used for static scenes, to dynamic environments, making it applicable even with minimal data.
SE(3) Field-Driven Gaussian Training Method: By associating individual Gaussians with discrete SE(3) transformations, the method enables acute motion modeling of dynamic elements. The transformations, initialized from object-level segmentations, allow the Gaussian splats to maintain fidelity in temporal consistency and exhibit flexible motion rendering.

These innovations result in a method that enhances the synthesis of novel views in dynamic environments, as evidenced by experimental results on both synthetic and real-world datasets. The paper shows significant improvements over existing state-of-the-art models, which are generally limited to static or well-posed configurations.

Key Results

The paper demonstrates several strong numerical results. On datasets such as KITTI and Kubric, the method consistently surpasses alternative approaches like InstantSplat and SC-GS in key metrics (PSNR, SSIM, and LPIPS). For instance, DynSUP achieves a PSNR of 33.86 on the Kubric dataset, thereby outperforming competitive models by a significant margin. These figures suggest a marked improvement in handling scenes with multiple moving objects and unknown temporal orders.

Implications and Future Work

The approach opens up new possibilities for AI applications requiring realistic scene modeling from limited data, such as augmented reality and robotics. The ability to infer dynamic object motion and accurately reconstruct scenes from minimal views can allow autonomous systems to operate more reliably in uncertain environments.

Future research could further expand upon this work by addressing its limitations. While this approach adeptly manages piecewise rigid environments, it remains less effective with non-rigid deformations. Additionally, exploring enhanced object segmentation techniques could improve dynamic scene reconstructions where initial segmentation is inaccurate or coarse. Further integration with deep learning models capable of handling non-rigidity might enhance the framework's robustness and versatility.

In conclusion, the DynSUP framework marks a significant step forward in novel view synthesis, offering a robust methodology for dynamic scene reconstruction using sparse, pose-free image pairs. Its contributions lay a foundational framework for subsequent research and applications in dynamically evolving and complex environments.

PDF Markdown

Related Papers

GitHub

DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair

Tweets

https://twitter.com/zhenjun_zhao/status/1863854568393572580