Deep Patch Visual Odometry (2208.04726v2)

Published 8 Aug 2022 in cs.CV

Abstract: We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO uses a novel recurrent network architecture designed for tracking image patches across time. Recent approaches to VO have significantly improved the state-of-the-art accuracy by using deep networks to predict dense flow between video frames. However, using dense flow incurs a large computational cost, making these previous methods impractical for many use cases. Despite this, it has been assumed that dense flow is important as it provides additional redundancy against incorrect matches. DPVO disproves this assumption, showing that it is possible to get the best accuracy and efficiency by exploiting the advantages of sparse patch-based matching over dense flow. DPVO introduces a novel recurrent update operator for patch based correspondence coupled with differentiable bundle adjustment. On Standard benchmarks, DPVO outperforms all prior work, including the learning-based state-of-the-art VO-system (DROID) using a third of the memory while running 3x faster on average. Code is available at https://github.com/princeton-vl/DPVO

Citations (82)

View on Semantic Scholar

Summary

The paper introduces a novel recurrent network with differentiable bundle adjustment that tracks sparse patches to boost efficiency and accuracy.
It achieves up to 120 FPS on modern GPUs with significantly lower memory usage than conventional dense flow methods.
Evaluations on TartanAir and EuRoC datasets demonstrate reduced trajectory error, establishing a new state-of-the-art for monocular visual odometry.

Deep Patch Visual Odometry: An Overview

Deep Patch Visual Odometry (DPVO) presents a novel approach for monocular Visual Odometry (VO) by leveraging a sparse patch-based method rather than the traditional dense flow techniques. This paper challenges the assumption that dense flow is necessary for robustness and accuracy in VO, demonstrating that exploiting the strengths of sparse patch matching can lead to superior efficiency and accuracy.

Key Contributions

DPVO introduces a recurrent network architecture designed specifically to track image patches across time, coupled with differentiable bundle adjustment. This architecture results in significant improvements in both computational efficiency and memory usage compared to prior methods:

Efficiency: DPVO runs at an average of 60 FPS on an RTX-3090, using only 4.9GB of memory, compared to 40 FPS and 8.7GB for DROID-SLAM. A faster variant of DPVO achieves up to 120 FPS while still outperforming previous works.
Memory Usage: By focusing on sparse patches, DPVO requires less than a third of the memory used by dense flow approaches.
Accuracy: DPVO outperforms all prior works on standard benchmarks, marking advancements in the state-of-the-art for monocular VO.

Technical Approach

The DPVO system relies on a sparse representation of the scene through patches rather than dense pixel-level flow:

Patch Representation: Utilizes deep feature-based representations for patches to track keypoints over time. This approach captures local context, improving feature matching accuracy.
Recurrent Network with Bundle Adjustment: The recurrent update operator iteratively refines patch locations using a combination of feature extraction, temporal convolution, and softmax aggregation. This process leverages graph-based message passing mechanisms unique to the patch-based representation.
Differentiable Bundle Adjustment: Instead of precomputing correlation volumes, DPVO computes them on the fly, leading to reduced computational overhead. The system iteratively solves for camera pose and depth updates using a Mahalanobis distance optimization criterion.

Evaluation

DPVO demonstrates robust performance across multiple datasets including TartanAir and EuRoC, consistently achieving lower average trajectory error (ATE) compared to state-of-the-art approaches such as DROID-SLAM. The sparse, efficient methodology proves effective in diverse scenarios including environments with significant motion blur and erratic camera movements.

Implications and Future Work

The implications of DPVO's approach extend beyond improved VO performance. The architectural innovations could inspire further research into sparse, efficient representations in other areas of AI and computer vision, particularly where resource constraints are critical.

Future work may explore the integration of DPVO into broader SLAM systems, potentially augmenting global optimization strategies. Further investigation could also focus on enhancing robustness against dynamic environments and varying lighting conditions.

In summary, DPVO represents a substantial shift towards more efficient and effective VO systems, characterized by a thoughtful balance between sparse data representation and deep learning innovations. This work not only sets a new benchmark for monocular VO but also broadens the scope of applications feasible on resource-constrained platforms.

PDF Markdown

Related Papers

GitHub

GitHub - princeton-vl/DPVO: Deep Patch Visual Odometry (476 stars)