Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GIM: Learning Generalizable Image Matcher From Internet Videos (2402.11095v1)

Published 16 Feb 2024 in cs.CV

Abstract: Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. We also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures; with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4%-18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1(c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The video presentation is available at https://www.youtube.com/watch?v=FU_MJLD8LeY.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Three things everyone should know to improve object retrieval. In 2012 IEEE conference on computer vision and pattern recognition, pp.  2911–2918. IEEE, 2012.
  2. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.
  3. MAGSAC: marginalizing sample consensus. In Conference on Computer Vision and Pattern Recognition, 2019.
  4. Surf: Speeded up robust features. In Aleš Leonardis, Horst Bischof, and Axel Pinz (eds.), Computer Vision – ECCV 2006, pp.  404–417, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33833-8.
  5. Scannet: Richly-annotated 3d reconstructions of indoor scenes. CVPR, pp.  2432–2443, 2017.
  6. D2-net: A trainable cnn for joint description and detection of local features. CVPR, pp.  8084–8093, 2019.
  7. Dkm: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17765–17775, 2023.
  8. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24:381–395, 1981.
  9. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  10. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004.
  11. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014.
  12. Image Matching across Wide Baselines: From Paper to Practice. International Journal of Computer Vision, 2020.
  13. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  14. Megadepth: Learning single-view depth prediction from internet photos. CVPR, pp.  2041–2050, 2018.
  15. David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91–110, 2004.
  16. 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research (IJRR), 36(1):3–15, 2017.
  17. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In ICCV, 2017.
  18. Working hard to know your neighbor’s margins: Local descriptor learning loss. In NeurIPS, 2017.
  19. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  20. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  21. R2d2: Reliable and repeatable detector and descriptor. In NeurIPS, 2019.
  22. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pp.  2564–2571, 2011. doi: 10.1109/ICCV.2011.6126544.
  23. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  24. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  25. Structure-from-motion revisited. CVPR, pp.  4104–4113, 2016.
  26. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  27. Matchable image retrieval by learning from surface reconstruction. In The Asian Conference on Computer Vision (ACCV, 2018.
  28. Loftr: Detector-free local feature matching with transformers. In CVPR, 2021.
  29. InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
  30. L2-net: Deep learning of discriminative patch descriptor in euclidean space. CVPR, 2017.
  31. Disk: Learning local features with policy gradient. NeurIPS, 2020.
  32. Shimon Ullman. The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 203(1153):405–426, 1979.
  33. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  34. Flow-motion and depth network for monocular stereo and beyond. IEEE Robotics Autom. Lett., 5(2):3307–3314, 2020.
  35. Self-supervised geometric perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14350–14361, 2021.
  36. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
  37. Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
  38. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.
  39. Benefit of large field-of-view cameras for visual odometry. In 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 16-21, 2016, pp.  801–808. IEEE, 2016.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com