LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry (2401.01887v2)
Abstract: Visual odometry estimates the motion of a moving camera based on visual input. Existing methods, mostly focusing on two-view point tracking, often ignore the rich temporal context in the image sequence, thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability. These shortcomings hinder performance in scenarios with occlusion, dynamic objects, and low-texture areas. To address these challenges, we present the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover, LEAP's temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise uncertainty. Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes. Our mindful integration showcases a novel practice by employing long-term point tracking as the front-end. Extensive experiments demonstrate that the proposed pipeline significantly outperforms existing baselines across various visual odometry benchmarks.
- Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters, 3(4):4076–4083, 2018.
- Context-tap: Tracking any point demands spatial context features. arXiv preprint arXiv:2306.02000, 2023.
- Codeslam—learning a compact, optimisable representation for dense visual slam. In CVPR, pages 2560–2568, 2018.
- A naturalistic open source movie for optical flow evaluation. In ECCV, pages 611–625. Springer-Verlag, 2012.
- Emerging properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021.
- Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, pages 7063–7072, 2019.
- Monoslam: Real-time single camera slam. IEEE TPAMI, 29(6):1052–1067, 2007.
- Tap-vid: A benchmark for tracking any point in a video. NeurIPS, 35:13610–13626, 2022.
- Tapir: Tracking any point with per-frame initialization and temporal refinement. ICCV, 2023.
- Lsd-slam: Large-scale direct monocular slam. In ECCV, pages 834–849. Springer, 2014.
- Direct sparse odometry. IEEE TPAMI, 40(3):611–625, 2017.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, pages 59–75. Springer, 2022.
- Space-time correspondence as a contrastive random walk. NeurIPS, 33:19545–19560, 2020.
- Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023.
- Video object segmentation with language referring expressions. In ACCV, 2018.
- Robust consistent video depth estimation. In CVPR, pages 1611–1621, 2021.
- Decoupled weight decay regularization. In ICLR, 2018.
- David G Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157. Ieee, 1999.
- Streaming and exploration of dynamically changing dense 3d reconstructions in immersive virtual reality. In 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), pages 43–48. IEEE, 2016.
- ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
- Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
- Airdos: Dynamic slam benefits from articulated objects. In ICRA, pages 8047–8053. IEEE, 2022.
- Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR, pages 12240–12249, 2019.
- Structure-from-motion revisited. In CVPR, 2016.
- Dytanvo: Joint refinement of visual odometry and motion segmentation in dynamic environments. In ICRA, pages 4048–4055. IEEE, 2023.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419. Springer, 2020.
- Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. NeurIPS, 34:16558–16569, 2021.
- Deep patch visual odometry. arXiv preprint arXiv:2208.04726, 2022.
- Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.
- Tracking everything everywhere all at once. arXiv preprint arXiv:2306.05422, 2023.
- Stereo dso: Large-scale direct sparse visual odometry with stereo cameras. In ICCV, pages 3903–3911, 2017a.
- Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In ICRA, pages 2043–2050. IEEE, 2017b.
- Tartanair: A dataset to push the limits of visual slam. In IROS, pages 4909–4916. IEEE, 2020.
- Tartanvo: A generalizable learning-based vo. In CoRL, pages 1761–1772. PMLR, 2021.
- D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In CVPR, pages 1281–1292, 2020.
- Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, pages 1983–1992, 2018.
- An overview to visual odometry and visual slam: Applications to mobile robotics. Intelligent Industrial Systems, 1(4):289–311, 2015.
- Structure and motion from casual videos. In ECCV, pages 20–37. Springer, 2022.
- Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In ECCV, pages 523–542. Springer, 2022.
- In-place scene labelling and understanding with implicit scene representation. In CVPR, pages 15838–15847, 2021.
- Unsupervised learning of depth and ego-motion from video. In CVPR, pages 1851–1858, 2017.