Text2Loc: 3D Point Cloud Localization from Natural Language (2311.15977v2)
Abstract: We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at \url{https://yan-xia.github.io/projects/text2loc/}.
- Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 422–440. Springer, 2020.
- Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4470–4479, 2018.
- Attdlnet: Attention-based deep network for 3d lidar place recognition. In Iberian Robotics conference, pages 309–320. Springer, 2022.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, pages 202–221. Springer, 2020.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2013.
- Ppfnet: Global context aware local features for robust 3d point matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 195–205, 2018.
- 3d point cloud registration for localization using a deep neural network auto-encoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2017.
- Svt-net: Super light-weight sparse voxel transformer for large scale place recognition. AAAI, 2022.
- Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3722–3731, 2021.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Text2pos: Text-to-point-cloud cross-modal localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6687–6696, 2022.
- Jacek Komorowski. Minkloc3d: Point cloud based large-scale place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1790–1799, 2021.
- Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
- David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
- Overlaptransformer: An efficient and yaw-angle-invariant transformer network for lidar-based place recognition. IEEE Robotics and Automation Letters, 7(3):6958–6965, 2022.
- Cvtnet: A cross-view transformer network for place recognition using lidar data. arXiv preprint arXiv:2302.01665, 2023.
- Neurocs: Neural nocs supervision for monocular 3d object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21404–21414, 2023.
- Embodied language grounding with implicit 3d visual feature representations. 2019.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
- Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1655–1668, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011.
- Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR, abs/1402.1128, 2014.
- From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
- SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
- Orienternet: Visual localization in 2d public maps with neural matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21632–21642, 2023.
- Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
- Generalized-icp. In Robotics: science and systems, page 435. Seattle, WA, 2009.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Text to point cloud localization with relation-enhanced transformer. arXiv preprint arXiv:2301.05372, 2023.
- Soe-net: A self-attention and orientation encoding network for point cloud based place recognition. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11348–11357, 2021.
- Casspr: Cross attention single scan place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8461–8472, 2023.
- Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1791–1800, 2021.
- Rank-pointretrieval: Reranking point cloud retrieval via a visually consistent registration evaluation. IEEE Transactions on Visualization and Computer Graphics, 2022.
- Ndt-transformer: Large-scale 3d point cloud localisation using the normal distribution transform representation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5654–5660. IEEE, 2021.
- Yan Xia (170 papers)
- Letian Shi (4 papers)
- Zifeng Ding (26 papers)
- João F. Henriques (55 papers)
- Daniel Cremers (274 papers)