Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction (2504.14516v1)

Published 20 Apr 2025 in cs.CV

Abstract: Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

Summary

Bundle Adjustment for Dynamic Scene Reconstruction

This paper introduces a novel framework, BA-Track, for dynamic scene reconstruction using bundle adjustment (BA) in casual video sequences featuring both static and dynamic elements. Traditional Simultaneous Localization and Mapping (SLAM) systems, heavily reliant on BA, tend to falter in such scenarios due to the assumption of static environments—a premise often violated by the presence of moving objects. Current solutions either exclude dynamic components, leading to incomplete reconstructions, or model them separately, risking inconsistency in motion estimates. BA-Track circumvents these issues by separating camera-induced motion from the motion of dynamic objects through a learning-based 3D point tracker.

The core innovation lies in the motion decoupling strategy, which isolates the static component from observed point motion, thereby enabling the BA to operate on dynamic scenes as if all points were static. This separation is achieved via a two-network approach: the first network estimates the total observed motion, while the second isolates the object-induced motion. This mechanism allows for accurate camera pose estimation and temporally coherent, dense 3D reconstructions within the SLAM framework. The approach benefits from robust learning-based priors integrated via monocular depth models, facilitating the distinction between camera- and object-based motions.

Empirical evaluations demonstrate significant performance enhancements in camera motion estimation and 3D reconstruction accuracy across challenging datasets such as MPI Sintel, AirDOS Shibuya, and Epic Fields. BA-Track consistently yields superior results, notably in Absolute Translation Error (ATE), when compared to existing VO and SLAM methodologies, including DROID-SLAM, ParticleSfM, and CasualSAM. This robust performance underscores the effectiveness of the motion decoupling strategy and the integration of sparse dynamic SLAM with global refinement processes.

Furthermore, the paper explores depth refinement techniques to ensure temporal and spatial consistency across video sequences. Utilizing a global refinement module, the framework leverages sparse geometry from BA to refine dense depth maps, leading to improved reconstruction quality. Experiments on real-world datasets, such as the Bonn RGB-D, show marked improvements in depth accuracy, highlighting the efficacy of the refinement process.

Despite the promising results and innovations, the paper identifies certain limitations, such as the reliance on predefined camera intrinsic parameters. Future directions include the joint optimization of these parameters with pose and depth estimates to enhance the robustness against calibration errors. Additionally, exploring alternative refinement models using dense vector fields or neural representations could further advance the precision and applicability of the framework.

In summary, BA-Track effectively adapts traditional BA to dynamic scene reconstruction, achieving reliable and coherent results by leveraging advanced learning-based approaches. This research presents substantive progress toward robust, real-world applications in augmented reality and robotics, where the need to accommodate dynamic environmental elements is paramount. Future developments may continue building on this work, enhancing refinement models and incorporating intrinsic parameter flexibility to broaden its applicability and efficiency.

Related Papers

Tweets

https://twitter.com/ducha_aiki/status/1915687481103597921