TartanVO: A Generalizable Learning-based VO (2011.00359v1)

Published 31 Oct 2020 in cs.CV, cs.LG, and cs.RO

Abstract: We present the first learning-based visual odometry (VO) model, which generalizes to multiple datasets and real-world scenarios and outperforms geometry-based methods in challenging scenes. We achieve this by leveraging the SLAM dataset TartanAir, which provides a large amount of diverse synthetic data in challenging environments. Furthermore, to make our VO model generalize across datasets, we propose an up-to-scale loss function and incorporate the camera intrinsic parameters into the model. Experiments show that a single model, TartanVO, trained only on synthetic data, without any finetuning, can be generalized to real-world datasets such as KITTI and EuRoC, demonstrating significant advantages over the geometry-based methods on challenging trajectories. Our code is available at https://github.com/castacks/tartanvo.

Citations (127)

View on Semantic Scholar

Summary

The paper demonstrates that training with diverse synthetic data from TartanAir significantly enhances the generalization of visual odometry models.
It introduces an innovative up-to-scale loss function that mitigates scale ambiguity in monocular systems to improve accuracy.
By incorporating an intrinsics layer, TartanVO adapts to varying camera parameters, achieving competitive results on KITTI, EuRoC, and real-world sensors without finetuning.

An In-depth Review of TartanVO: A Generalizable Learning-based VO

The paper "TartanVO: A Generalizable Learning-based VO" proposes a novel approach for visual odometry (VO) that demonstrates significant generalization capabilities across diverse datasets. This approach aims to surpass traditional geometry-based VO methods, which often falter in real-world applications due to their sensitivity to environmental changes and sensor variations. The primary achievement of the research is the TartanVO model, which is built to generalize effectively across different datasets by leveraging extensive synthetic training data from the TartanAir dataset.

Key Contributions

The authors make several distinct contributions to the field of learning-based VO:

Data Diversity in Training: The paper emphasizes the impact of data diversity on the generalization capabilities of learning-based VO models. By utilizing TartanAir, a synthetic dataset that covers a wide variety of scenes and motion patterns, the authors demonstrate how training on diverse data can enhance model performance. This approach addresses shortcomings in models limited by the homogeneity of datasets like KITTI and EuRoC, leading to better generalization in unseen environments.
Up-to-Scale Loss Function: A unique up-to-scale loss function is introduced to mitigate scale ambiguity inherent in monocular vision systems. Traditional methods struggle with obtaining absolute scale from monocular sequences. The proposed loss function focuses on recovering the direction of translation, rather than the full scale, which enhances the model's ability to generalize to different camera setups.
Intrinsics Layer for Cross-Camera Generalization: To address the intrinsics ambiguity, where changes in camera settings can dramatically affect images and thus VO performance, the authors incorporate an intrinsics layer into the model. This layer allows the model to adapt to varying intrinsic camera parameters, thus supporting cross-camera generalization.

Notable Experimental Results

The experimental results presented in the paper are compelling. TartanVO achieves competitive performance on the KITTI and EuRoC datasets without any finetuning, despite being trained solely on synthetic data. This success highlights the model's robustness and generalization capabilities, particularly its ability to handle challenging scenarios with aggressive motion patterns and varying environmental conditions.

When applied to real-world data from a customized sensor, TartanVO performs comparably to Intel's T265, a specialized device for tracking, highlighting its potential for real-world applications. Moreover, in challenging scenarios provided by TartanAir, TartanVO demonstrates superior robustness compared to ORB-SLAM, a staple in geometry-based methods, which often loses tracking under such conditions.

Implications and Future Directions

The implications of this research reach beyond the immediate improvements in VO. The ability to generalize across datasets without significant retraining or finetuning opens avenues for deploying visual odometry solutions in varying environments with minimal configuration effort. Moreover, the intrinsics layer can be further explored for adaptation in models dealing with even broader ranges of camera technologies.

Future directions could explore extending the generalization capabilities to Visual-Inertial Odometry (VIO), where IMU data is integrated, or stereo setups that provide additional depth information. Further research could also investigate the integration of multi-frame processing to refine trajectory estimations, akin to how existing models benefit from temporal information while maintaining the model's robustness without excessive complexity.

Overall, TartanVO showcases a promising step towards more versatile and resilient VO systems capable of handling the unpredictabilities of real-world applications, pushing the boundaries of what learning-based vision models can achieve without being tethered to specific scenarios or datasets.

PDF Markdown

Related Papers

GitHub

GitHub - castacks/tartanvo: TartanVO: A Generalizable Learning-based VO (237 stars)

Tweets

https://twitter.com/AirLabCMU/status/1337532624508760065