Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grounded 3D-LLM with Referent Tokens (2405.10370v2)

Published 16 May 2024 in cs.CV
Grounded 3D-LLM with Referent Tokens

Abstract: Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent LLMing, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets are available at the https://groundedsceneLLM.github.io/grounded_3d-LLM.github.io.

Grounded 3D-LLM: A Unified Framework for 3D Scene Understanding

The paper "Grounded 3D-LLM" introduces an innovative approach to 3D scene understanding by proposing a unified generative framework. This framework leverages grounded phrase-level LLMing to consolidate various 3D vision tasks. By integrating scene referent tokens into LLMs, the model aims to perform tasks such as object detection, visualization grounding, and 3D QA without task-specific fine-tuning. I will provide a detailed overview of the methodology, the dataset generation, the empirical results, and the implications for future AI developments.

Methodology

The Grounded 3D-LLM model is constructed to address the limitations of existing 3D vision models, which are typically specialized for specific tasks. The core innovation lies in using referent tokens, denoted <ref>, to represent scene regions or object features as special noun phrases. To establish effective scene-text alignment, the paper introduces the Contrastive LAnguage-Scene Pre-training (CLASP) framework. This method:

  1. Extracts point-level embeddings through a sparse convolutional network.
  2. Employs a cross-modal interactor to couple text embeddings from BERT with visual representations.
  3. Utilizes learnable queries as proxies to connect textual phrases with raw 3D point clouds.

Technical enhancements like these ensure phrase-level alignment between natural language and visual scenes, which facilitates multiple downstream tasks within a unified framework. The LLMing capability is extended using instruction templates that transform existing datasets into task-specific instructions, thus eliminating the necessity for independent detectors or task-specific tuning.

Dataset Generation

To facilitate the proposed model, the paper presents the Grounded Scene Caption (G-SceneCap) dataset. This dataset provides fine-grained scene-text correspondence necessary for phrase-level grounding. The G-SceneCap dataset was generated through a pipeline that combines:

  1. Object captions derived from dense object annotations and refined using visual and textual models.
  2. Condensed scene captions using GPT-4, integrating related spatial relationships programmatically.

Apart from G-SceneCap, the model utilizes transformed existing datasets like Grounded ScanRefer and Grounded Multi3DRef for broader generalization. This extensive dataset amalgamation ensures comprehensive pre-training and evaluation coverage across multiple 3D vision tasks.

Empirical Results

Evaluations demonstrate the model's superior performance as follows:

  • Grounding Tasks: The model outperforms previous discriminative and generative models significantly in single-object and multi-object grounding tasks. It achieves an accuracy of 47.9% at 0.25 IoU and 44.1% at 0.5 IoU on the ScanRefer grounding task.
  • 3D QA and Captioning: The model also excels in language-oriented tasks, achieving the highest CIDEr score of 70.6 in Scan2Cap and a strong BLEU-4 score of 13.4 in ScanQA.
  • Detection: Unique among generative models, Grounded 3D-LLM supports 3D object detection, demonstrating its versatility.

The comparison with models like 3D-LLM, Chat-3D, and LL3DA highlights the effectiveness of phrase-level alignment facilitated through CLASP. The ablative studies underscore the critical role of diverse datasets and fine-grained scene captions in elevating the model's performance.

Implications and Future Directions

Grounded 3D-LLM opens the pathway for creating comprehensive 3D multi-modal models that can generalize across numerous tasks without the need for specialized architectures. This unified approach is particularly relevant for applications in VR/AR, robotics, interactive embodied agents, and autonomous navigation, where multifunctional understanding and interaction with 3D environments are crucial.

Future developments may explore:

  1. Scaling the dataset to cover more diverse environments and objects, enhancing the model's robustness and adaptability.
  2. Extending the model to incorporate dynamic environments where objects and entities are in motion.
  3. Integrating more sophisticated reasoning capabilities to handle complex 3D scene interactions and higher-order question answering.

In summary, the Grounded 3D-LLM paper offers a significant advancement in the integration of language and 3D visual data, providing a versatile framework that bridges multiple vision tasks seamlessly. The implications for AI and robotics are profound, marking a step-forward in creating truly intelligent multi-modal systems capable of understanding and interacting with complex 3D environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Openai chatgpt. https://openai.com/gpt-4.
  2. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 422–440. Springer, 2020.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022.
  5. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
  6. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16464–16473, 2022.
  7. Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  8. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, pages 202–221. Springer, 2020.
  9. D3net: A speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans, 2021.
  10. Clip2scene: Towards label-efficient 3d scene understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7030, 2023.
  11. Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651, 2023.
  12. Language conditioned spatial relation reasoning for 3d object grounding. Advances in Neural Information Processing Systems, 35:20522–20535, 2022.
  13. End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11124–11133, 2023.
  14. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021.
  15. Unit3d: A unified transformer for 3d dense captioning and visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18109–18119, 2023.
  16. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  17. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  18. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024.
  19. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  20. Multi-clip: Contrastive vision-language pre-training for question answering tasks in 3d scenes. arXiv preprint arXiv:2306.02329, 2023.
  21. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  22. Lowis3d: Language-driven open-world instance-level 3d scene understanding. arXiv preprint arXiv:2308.00353, 2023.
  23. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  24. Votenet: A deep learning label fusion method for multi-atlas segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, pages 202–210. Springer, 2019.
  25. DreamLLM: Synergistic multimodal comprehension and creation. In The Twelfth International Conference on Learning Representations, 2024.
  26. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  27. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9224–9232, 2018.
  28. Zero-shot detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2(3):4, 2021.
  29. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
  30. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
  31. 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023.
  32. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  33. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168, 2023.
  34. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022.
  35. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36, 2024.
  36. Bottom up top down detection transformers for language grounding in images and point clouds. In European Conference on Computer Vision, pages 417–433. Springer, 2022.
  37. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020.
  38. More: Multi-order relation mining for dense captioning in 3d scenes. In European Conference on Computer Vision, pages 528–545. Springer, 2022.
  39. Context-aware alignment and mutual masking for 3d-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10984–10994, 2023.
  40. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  41. Segment anything. arXiv:2304.02643, 2023.
  42. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  43. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  44. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  45. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  46. Uni3d-llm: Unifying point cloud perception, generation and editing with large language models. arXiv preprint arXiv:2402.03327, 2024.
  47. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  48. Openshape: Scaling up 3d shape representation towards open-world understanding. Advances in Neural Information Processing Systems, 36, 2024.
  49. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  50. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2949–2958, 2021.
  51. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022.
  52. Clip-guided vision-language pre-training for question answering in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5606–5611, 2023.
  53. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  54. Shapellm: Universal 3d object understanding for embodied interaction. arXiv preprint arXiv:2402.17766, 2024.
  55. Gpt4point: A unified framework for point-language understanding and generation. arXiv preprint arXiv:2312.02980, 2023.
  56. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  57. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
  58. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  59. Language-grounded indoor 3d semantic segmentation in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  60. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  61. Mask3d: Mask transformer for 3d semantic instance segmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8216–8223. IEEE, 2023.
  62. Emu: Generative pretraining in multimodality. In The Twelfth International Conference on Learning Representations, 2023.
  63. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  64. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
  65. Rio: 3d object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019.
  66. Spatiality-guided transformer for 3D dense captioning on point clouds. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 2022.
  67. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  68. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
  69. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  70. 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363, 2023.
  71. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes. arXiv preprint arXiv:2308.08769, 2023.
  72. Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
  73. Ulip: Learning unified representation of language, image and point cloud for 3d understanding. arXiv preprint arXiv:2212.05171, 2022.
  74. Ulip-2: Towards scalable multimodal pre-training for 3d understanding, 2023.
  75. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311, 2023.
  76. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Advances in Neural Information Processing Systems, 35:9125–9138, 2022.
  77. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  78. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8563–8573, 2022.
  79. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
  80. Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949, 2023.
  81. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022.
  82. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15225–15236, 2023.
  83. 3dvg-transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2928–2937, 2021.
  84. Uni3d: Exploring unified 3d representation at scale. In International Conference on Learning Representations (ICLR), 2024.
  85. Object2scene: Putting objects in context for open-vocabulary 3d detection. arXiv preprint arXiv:2309.09456, 2023.
  86. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  87. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911–2921, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yilun Chen (48 papers)
  2. Shuai Yang (140 papers)
  3. Haifeng Huang (20 papers)
  4. Tai Wang (47 papers)
  5. Ruiyuan Lyu (3 papers)
  6. Runsen Xu (13 papers)
  7. Dahua Lin (336 papers)
  8. Jiangmiao Pang (77 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com