Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text2Loc: 3D Point Cloud Localization from Natural Language (2311.15977v2)

Published 27 Nov 2023 in cs.CV

Abstract: We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at \url{https://yan-xia.github.io/projects/text2loc/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 422–440. Springer, 2020.
  2. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4470–4479, 2018.
  3. Attdlnet: Attention-based deep network for 3d lidar place recognition. In Iberian Robotics conference, pages 309–320. Springer, 2022.
  4. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, pages 202–221. Springer, 2020.
  5. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2013.
  6. Ppfnet: Global context aware local features for robust 3d point matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 195–205, 2018.
  7. 3d point cloud registration for localization using a deep neural network auto-encoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2017.
  8. Svt-net: Super light-weight sparse voxel transformer for large scale place recognition. AAAI, 2022.
  9. Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3722–3731, 2021.
  10. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  11. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  12. Text2pos: Text-to-point-cloud cross-modal localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6687–6696, 2022.
  13. Jacek Komorowski. Minkloc3d: Point cloud based large-scale place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1790–1799, 2021.
  14. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
  15. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  16. Overlaptransformer: An efficient and yaw-angle-invariant transformer network for lidar-based place recognition. IEEE Robotics and Automation Letters, 7(3):6958–6965, 2022.
  17. Cvtnet: A cross-view transformer network for place recognition using lidar data. arXiv preprint arXiv:2302.01665, 2023.
  18. Neurocs: Neural nocs supervision for monocular 3d object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21404–21414, 2023.
  19. Embodied language grounding with implicit 3d visual feature representations. 2019.
  20. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  21. Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1655–1668, 2018.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  24. Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011.
  25. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR, abs/1402.1128, 2014.
  26. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
  27. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  28. Orienternet: Visual localization in 2d public maps with neural matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21632–21642, 2023.
  29. Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
  30. Generalized-icp. In Robotics: science and systems, page 435. Seattle, WA, 2009.
  31. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
  32. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  33. Text to point cloud localization with relation-enhanced transformer. arXiv preprint arXiv:2301.05372, 2023.
  34. Soe-net: A self-attention and orientation encoding network for point cloud based place recognition. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11348–11357, 2021.
  35. Casspr: Cross attention single scan place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8461–8472, 2023.
  36. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1791–1800, 2021.
  37. Rank-pointretrieval: Reranking point cloud retrieval via a visually consistent registration evaluation. IEEE Transactions on Visualization and Computer Graphics, 2022.
  38. Ndt-transformer: Large-scale 3d point cloud localisation using the normal distribution transform representation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 5654–5660. IEEE, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yan Xia (170 papers)
  2. Letian Shi (4 papers)
  3. Zifeng Ding (26 papers)
  4. João F. Henriques (55 papers)
  5. Daniel Cremers (274 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.