Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization (2312.16648v1)

Published 27 Dec 2023 in cs.RO and cs.CV

Abstract: Global visual localization in LiDAR-maps, crucial for autonomous driving applications, remains largely unexplored due to the challenging issue of bridging the cross-modal heterogeneity gap. Popular multi-modal learning approach Contrastive Language-Image Pre-Training (CLIP) has popularized contrastive symmetric loss using batch construction technique by applying it to multi-modal domains of text and image. We apply this approach to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization. Our method is explained as follows: A batch of N (image, LiDAR) pairs is constructed so as to predict what is the right match between N X N possible pairings across the batch by jointly training an image encoder and LiDAR encoder to learn a multi-modal embedding space. In this way, the cosine similarity between N positive pairings is maximized, whereas that between the remaining negative pairings is minimized. Finally, over the obtained similarity scores, a symmetric cross-entropy loss is optimized. To the best of our knowledge, this is the first work to apply batched loss approach to a cross-modal setting of image & LiDAR data and also to show Zero-shot transfer in a visual localization setting. We conduct extensive analyses on standard autonomous driving datasets such as KITTI and KITTI-360 datasets. Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images, in contrast to the state-of-the-art approach, which utilizes the more informative fisheye images. Additionally, this superior performance is achieved without resorting to complex architectures. Moreover, we demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it. Furthermore, we establish the first benchmark for cross-modal localization on the KITTI dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1437–1451, Jun 2018.
  2. Surf: Speeded up robust features. In Aleš Leonardis, Horst Bischof, and Axel Pinz, editors, Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
  3. Cmrnet: Camera to lidar-map registration. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 1283–1289, 2019.
  4. Global visual localization in lidar-maps through shared 2d-3d embedding space. 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020.
  5. Deep learning features at scale for visual place recognition. 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017.
  6. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 958–979, January 2024.
  7. Robust neural routing through space partitions for camera relocalization in dynamic indoor environments, 2020.
  8. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  9. 2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud. 2019 International Conference on Robotics and Automation (ICRA), May 2019.
  10. Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun 2012.
  11. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2):237–254, Jun 2017.
  12. Deep residual learning for image recognition, 2015.
  13. The many faces of robustness: A critical analysis of out-of-distribution generalization, 2021.
  14. Lidarclip or: How i learned to talk to point clouds, 2023.
  15. Aggregating local descriptors into a compact image representation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3304–3311, 2010.
  16. A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In CVPR 2011, pages 2969–2976, 2011.
  17. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
  18. Real-time solution to the absolute pose problem with unknown radial distortion and focal length. In 2013 IEEE International Conference on Computer Vision, pages 2816–2823, 2013.
  19. Adafusion: Visual-lidar fusion with adaptive weights for place recognition. IEEE Robotics and Automation Letters, PP:1–8, 10 2022.
  20. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958, 2009.
  21. Making minimal solvers for absolute pose estimation compact and robust. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2335–2343, 2017.
  22. Fixing the locally optimized ransac. 09 2012.
  23. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2022.
  24. Lpd-net: 3d point cloud learning for large-scale place recognition and environment analysis. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2831–2840, 2019.
  25. Lccnet: Lidar and camera self-calibration using cost volume network, 2020.
  26. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  27. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2161–2168, 2006.
  28. Augmenting visual place recognition with structural cues. IEEE Robotics and Automation Letters, 5(4):5534–5541, Oct 2020.
  29. Alt-pilot: Autonomous navigation with language augmented topometric maps. arXiv preprint arXiv:2310.02324, 2023.
  30. Topological mapping for manhattan-like repetitive environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 6268–6274, 2020.
  31. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017.
  32. Learning transferable visual models from natural language supervision, 2021.
  33. Improving language understanding by generative pre-training. 2018.
  34. Language models are unsupervised multitask learners. 2019.
  35. Oneshot global localization: Instant lidar-visual pose estimation. 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020.
  36. Do imagenet classifiers generalize to imagenet?, 2019.
  37. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564–2571, 2011.
  38. Benchmarking 6dof outdoor visual localization in changing conditions, 2018.
  39. Efficient vision-language pretraining with visual concepts and hierarchical alignment, 2022.
  40. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  41. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume, 2017.
  42. Inloc: Indoor visual localization with dense matching and view synthesis, 2018.
  43. Measuring robustness to natural distribution shifts in image classification, 2020.
  44. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
  45. Representation learning with contrastive predictive coding, 2019.
  46. Beyond controlled environments: 3d camera re-localization in changing indoor scenes, 2020.
  47. Learning robust global representations by penalizing local predictive power, 2019.
  48. Dynamic graph cnn for learning on point clouds, 2019.
  49. i3dloc: Image-to-range cross-domain localization robust to inconsistent environmental conditions. 07 2021.
  50. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial intelligence, 78(1-2):87–119, 1995.
  51. Attention-enhanced cross-modal localization between 360 images and point clouds, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sai Shubodh Puligilla (3 papers)
  2. Mohammad Omama (8 papers)
  3. Husain Zaidi (1 paper)
  4. Udit Singh Parihar (6 papers)
  5. Madhava Krishna (24 papers)
Citations (8)