- The paper introduces a novel recurrent network with differentiable bundle adjustment that tracks sparse patches to boost efficiency and accuracy.
- It achieves up to 120 FPS on modern GPUs with significantly lower memory usage than conventional dense flow methods.
- Evaluations on TartanAir and EuRoC datasets demonstrate reduced trajectory error, establishing a new state-of-the-art for monocular visual odometry.
Deep Patch Visual Odometry: An Overview
Deep Patch Visual Odometry (DPVO) presents a novel approach for monocular Visual Odometry (VO) by leveraging a sparse patch-based method rather than the traditional dense flow techniques. This paper challenges the assumption that dense flow is necessary for robustness and accuracy in VO, demonstrating that exploiting the strengths of sparse patch matching can lead to superior efficiency and accuracy.
Key Contributions
DPVO introduces a recurrent network architecture designed specifically to track image patches across time, coupled with differentiable bundle adjustment. This architecture results in significant improvements in both computational efficiency and memory usage compared to prior methods:
- Efficiency: DPVO runs at an average of 60 FPS on an RTX-3090, using only 4.9GB of memory, compared to 40 FPS and 8.7GB for DROID-SLAM. A faster variant of DPVO achieves up to 120 FPS while still outperforming previous works.
- Memory Usage: By focusing on sparse patches, DPVO requires less than a third of the memory used by dense flow approaches.
- Accuracy: DPVO outperforms all prior works on standard benchmarks, marking advancements in the state-of-the-art for monocular VO.
Technical Approach
The DPVO system relies on a sparse representation of the scene through patches rather than dense pixel-level flow:
- Patch Representation: Utilizes deep feature-based representations for patches to track keypoints over time. This approach captures local context, improving feature matching accuracy.
- Recurrent Network with Bundle Adjustment: The recurrent update operator iteratively refines patch locations using a combination of feature extraction, temporal convolution, and softmax aggregation. This process leverages graph-based message passing mechanisms unique to the patch-based representation.
- Differentiable Bundle Adjustment: Instead of precomputing correlation volumes, DPVO computes them on the fly, leading to reduced computational overhead. The system iteratively solves for camera pose and depth updates using a Mahalanobis distance optimization criterion.
Evaluation
DPVO demonstrates robust performance across multiple datasets including TartanAir and EuRoC, consistently achieving lower average trajectory error (ATE) compared to state-of-the-art approaches such as DROID-SLAM. The sparse, efficient methodology proves effective in diverse scenarios including environments with significant motion blur and erratic camera movements.
Implications and Future Work
The implications of DPVO's approach extend beyond improved VO performance. The architectural innovations could inspire further research into sparse, efficient representations in other areas of AI and computer vision, particularly where resource constraints are critical.
Future work may explore the integration of DPVO into broader SLAM systems, potentially augmenting global optimization strategies. Further investigation could also focus on enhancing robustness against dynamic environments and varying lighting conditions.
In summary, DPVO represents a substantial shift towards more efficient and effective VO systems, characterized by a thoughtful balance between sparse data representation and deep learning innovations. This work not only sets a new benchmark for monocular VO but also broadens the scope of applications feasible on resource-constrained platforms.