LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization (2312.16648v1)
Abstract: Global visual localization in LiDAR-maps, crucial for autonomous driving applications, remains largely unexplored due to the challenging issue of bridging the cross-modal heterogeneity gap. Popular multi-modal learning approach Contrastive Language-Image Pre-Training (CLIP) has popularized contrastive symmetric loss using batch construction technique by applying it to multi-modal domains of text and image. We apply this approach to the domains of 2D image and 3D LiDAR points on the task of cross-modal localization. Our method is explained as follows: A batch of N (image, LiDAR) pairs is constructed so as to predict what is the right match between N X N possible pairings across the batch by jointly training an image encoder and LiDAR encoder to learn a multi-modal embedding space. In this way, the cosine similarity between N positive pairings is maximized, whereas that between the remaining negative pairings is minimized. Finally, over the obtained similarity scores, a symmetric cross-entropy loss is optimized. To the best of our knowledge, this is the first work to apply batched loss approach to a cross-modal setting of image & LiDAR data and also to show Zero-shot transfer in a visual localization setting. We conduct extensive analyses on standard autonomous driving datasets such as KITTI and KITTI-360 datasets. Our method outperforms state-of-the-art recall@1 accuracy on the KITTI-360 dataset by 22.4%, using only perspective images, in contrast to the state-of-the-art approach, which utilizes the more informative fisheye images. Additionally, this superior performance is achieved without resorting to complex architectures. Moreover, we demonstrate the zero-shot capabilities of our model and we beat SOTA by 8% without even training on it. Furthermore, we establish the first benchmark for cross-modal localization on the KITTI dataset.
- Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1437–1451, Jun 2018.
- Surf: Speeded up robust features. In Aleš Leonardis, Horst Bischof, and Axel Pinz, editors, Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
- Cmrnet: Camera to lidar-map registration. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 1283–1289, 2019.
- Global visual localization in lidar-maps through shared 2d-3d embedding space. 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020.
- Deep learning features at scale for visual place recognition. 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017.
- A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 958–979, January 2024.
- Robust neural routing through space partitions for camera relocalization in dynamic indoor environments, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- 2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud. 2019 International Conference on Robotics and Automation (ICRA), May 2019.
- Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun 2012.
- End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2):237–254, Jun 2017.
- Deep residual learning for image recognition, 2015.
- The many faces of robustness: A critical analysis of out-of-distribution generalization, 2021.
- Lidarclip or: How i learned to talk to point clouds, 2023.
- Aggregating local descriptors into a compact image representation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3304–3311, 2010.
- A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In CVPR 2011, pages 2969–2976, 2011.
- Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
- Real-time solution to the absolute pose problem with unknown radial distortion and focal length. In 2013 IEEE International Conference on Computer Vision, pages 2816–2823, 2013.
- Adafusion: Visual-lidar fusion with adaptive weights for place recognition. IEEE Robotics and Automation Letters, PP:1–8, 10 2022.
- Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958, 2009.
- Making minimal solvers for absolute pose estimation compact and robust. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2335–2343, 2017.
- Fixing the locally optimized ransac. 09 2012.
- Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–1, 2022.
- Lpd-net: 3d point cloud learning for large-scale place recognition and environment analysis. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2831–2840, 2019.
- Lccnet: Lidar and camera self-calibration using cost volume network, 2020.
- 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
- Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2161–2168, 2006.
- Augmenting visual place recognition with structural cues. IEEE Robotics and Automation Letters, 5(4):5534–5541, Oct 2020.
- Alt-pilot: Autonomous navigation with language augmented topometric maps. arXiv preprint arXiv:2310.02324, 2023.
- Topological mapping for manhattan-like repetitive environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 6268–6274, 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017.
- Learning transferable visual models from natural language supervision, 2021.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Oneshot global localization: Instant lidar-visual pose estimation. 2020 IEEE International Conference on Robotics and Automation (ICRA), May 2020.
- Do imagenet classifiers generalize to imagenet?, 2019.
- Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564–2571, 2011.
- Benchmarking 6dof outdoor visual localization in changing conditions, 2018.
- Efficient vision-language pretraining with visual concepts and hierarchical alignment, 2022.
- Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume, 2017.
- Inloc: Indoor visual localization with dense matching and view synthesis, 2018.
- Measuring robustness to natural distribution shifts in image classification, 2020.
- Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
- Representation learning with contrastive predictive coding, 2019.
- Beyond controlled environments: 3d camera re-localization in changing indoor scenes, 2020.
- Learning robust global representations by penalizing local predictive power, 2019.
- Dynamic graph cnn for learning on point clouds, 2019.
- i3dloc: Image-to-range cross-domain localization robust to inconsistent environmental conditions. 07 2021.
- A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial intelligence, 78(1-2):87–119, 1995.
- Attention-enhanced cross-modal localization between 360 images and point clouds, 2022.
- Sai Shubodh Puligilla (3 papers)
- Mohammad Omama (8 papers)
- Husain Zaidi (1 paper)
- Udit Singh Parihar (6 papers)
- Madhava Krishna (24 papers)