Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization (2404.15263v1)

Published 23 Apr 2024 in cs.CV

Abstract: We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a unified backbone architecture that integrates wide-baseline relative pose estimation with visual odometry for seamless multi-session SLAM.
It leverages a novel differentiable solver to minimize symmetric epipolar distance using bi-directional optical flow for precise camera pose estimation.
Experimental results on benchmarks like EuRoC-MAV and ETH3D demonstrate the method’s robust performance in handling disjoint video sequences.

Advanced Multi-Session SLAM Utilizing Differentiable Solver and Optical Flow

Overview of the Proposed System

Researchers have introduced a novel approach to Multi-Session SLAM (Simultaneous Localization and Mapping) that leverages a unified architecture for estimating camera motion across multiple disjoint video sequences. This method is significant in handling sessions where videos are not continuously recorded but occur in separate instances, which is typical in scenarios like collaborative mapping using multiple robots or due to interruptions in video capture.

Their system integrates the prediction of optical flow with a differentiable solver to compute camera poses, adopting a training process that is end-to-end. The architecture is designed to handle the challenges of disjoint sequences, achieving accurate pose estimation and robustness against tracking failures.

Key Contributions and Findings

Unified Backbone Architecture:
- This approach employs a single backbone architecture adept at performing both wide-baseline relative pose estimation and visual odometry, facilitating a straightforward method for Multi-Session SLAM.
Differentiable Solver Layer:
- The introduction of a novel differentiable solver layer that minimizes the symmetric epipolar distance from bi-directional optical flow is central to this method. This layer facilitates the estimation of two-view poses capable of aligning views with large disparities.
Robust and Accurate Performance:
- The system was tested on challenging real-world datasets such as EuRoC-MAV and ETH3D, where it demonstrated superior accuracy and robustness compared to existing SLAM approaches.
Competitive Results on Public Datasets:
- When evaluated on datasets like Scannet and Megadepth for two-view pose estimation, the method showed competitive performance, especially against transformer-based matching networks.

Theoretical Implications and Practical Applications

Theoretical Implications:

The method introduces a novel application of differentiable solver layers in the context of SLAM, expanding theoretical understanding of how continuous optimization techniques can be embedded within deep learning frameworks for enhanced geometric estimation.

Practical Applications:

The improved robustness and accuracy in disjoint video sequence handling make this method particularly useful in augmented reality (AR) and robotics, where reliable and precise mapping and localization are crucial under challenging conditions.

Future Research Directions

The promising results invite several avenues for future research, including further refinement of the architecture to enhance computational efficiency and the exploration of additional applications in dynamic environments. Investigating the integration of this framework with systems that utilize other forms of sensory data (like LIDAR or depth sensors) might also yield interesting synergies and improvements in system robustness and accuracy.

Conclusion

The research presents a significant step forward in the domain of Multi-Session SLAM by providing a method that not only addresses the challenges posed by disjoint video sequences but also does so with notable improvements in accuracy and robustness over existing techniques. The use of a differentiable solver for pose estimation in conjunction with optical flow prediction offers a robust framework that could influence future developments in SLAM technology.

PDF Markdown

Follow-up Questions

Related Papers

Authors (2)

Tweets

https://twitter.com/zhenjun_zhao/status/1782973063212212524

https://twitter.com/ducha_aiki/status/1784944767014412634