DeepTAM: Deep Tracking and Mapping (1808.01900v2)

Published 6 Aug 2018 in cs.CV

Abstract: We present a system for keyframe-based dense camera tracking and depth map estimation that is entirely learned. For tracking, we estimate small pose increments between the current camera image and a synthetic viewpoint. This significantly simplifies the learning problem and alleviates the dataset bias for camera motions. Further, we show that generating a large number of pose hypotheses leads to more accurate predictions. For mapping, we accumulate information in a cost volume centered at the current depth estimate. The mapping network then combines the cost volume and the keyframe image to update the depth prediction, thereby effectively making use of depth measurements and image-based priors. Our approach yields state-of-the-art results with few images and is robust with respect to noisy camera poses. We demonstrate that the performance of our 6 DOF tracking competes with RGB-D tracking algorithms. We compare favorably against strong classic and deep learning powered dense depth algorithms.

Citations (220)

View on Semantic Scholar

Summary

The paper introduces a learned tracking network that refines camera pose estimates using an incremental frame-to-keyframe approach and a multiple hypothesis method.
It proposes a mapping network that integrates cost volumes and image priors for iterative depth map refinement with improved accuracy.
Results demonstrate DeepTAM's robust performance across various datasets, surpassing traditional and learning-based SLAM methods in tracking and mapping.

Overview of DeepTAM: Deep Tracking and Mapping

The paper "DeepTAM: Deep Tracking and Mapping" presents a novel approach in the domain of computer vision, focusing on the advancement of simultaneous tracking and mapping (SLAM) systems using deep learning techniques. Addressing the challenges associated with camera pose estimation and depth mapping, the authors introduce a system that leverages neural networks for keyframe-based dense camera tracking and depth map estimation.

Main Contributions

Learned Tracking Network: The paper introduces a network architecture designed for incremental frame-to-keyframe tracking that minimizes dataset bias. This architecture, coupled with a multiple hypothesis approach for camera poses, facilitates more accurate pose estimation. The method employs a coarse-to-fine approach to refine camera pose estimates iteratively.
Learned Mapping Network: For depth map estimation, the mapping network combines a cost volume accumulated from multiple images with image-based priors. The architecture captures key elements of the depth estimation process, allows for iterative refinement, and integrates a narrow band around the estimated depth to enhance detail and accuracy.
Generalization Capabilities: The system generalizes effectively across different datasets, outperforming competing methods, including traditional RGB-D SLAM systems and learning-based methods like DeMoN and CNN-SLAM. The approach is particularly robust under noisy camera poses and performs notably well with a limited number of frames.

Evaluation and Results

The authors rigorously evaluate their approach on several benchmarks, showcasing its ability to compete with or surpass state-of-the-art methods in terms of tracking accuracy and mapping quality:

Tracking: The tracking performance of DeepTAM is validated on the RGB-D benchmark. Results indicate a favorable comparison against methods like RGB-D SLAM, with DeepTAM displaying improved robustness and reduced translational drift.
Mapping: Quantitative assessments using datasets such as SUN3D, SUNCG, and MVS reveal that the proposed depth estimation method outperforms both classic techniques like DTAM and SGM, as well as learning-based methods such as DeMoN. The combination of multiple frames and iterative refinement contributes significantly to this competitive performance.

Implications and Future Directions

The research illustrates the potential for deep learning to enhance key SLAM components, specifically in achieving robust camera pose estimation and detailed depth mapping with less computational burden and greater resilience to poor or noisy inputs. The combination of learned tracking and mapping in DeepTAM to track camera movement and update depth maps in real-time offers intriguing possibilities for real-world applications, including autonomous navigation and augmented reality.

Future work could extend this approach to a full SLAM system by integrating features such as loop closure detection and map optimization techniques. The methods demonstrated may serve as a foundation for developing more advanced SLAM technologies capable of handling complex environments and diverse data inputs with enhanced efficiency and accuracy. Such advancements could play a pivotal role in fields such as robotics, AI-driven surveillance, and interactive media technologies.

Overall, the DeepTAM paper marks a significant step forward in the application of deep learning to visual SLAM systems, providing a framework for further innovation and development within this critical area of computer vision research.

PDF Markdown