Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding (2401.09340v3)

Published 17 Jan 2024 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.RO

Abstract: 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  2. Taskography: Evaluating robot task planning over large 3d scene graphs. In Proceedings of Conference on Robot Learning (CoRL), 2022.
  3. Flamingo: a visual language model for few-shot learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  4. 3d scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
  5. Scanqa: 3d question answering for spatial scene understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  6. Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  7. Lawrence W Barsalou. Perceptual symbol systems. Behavioral and brain sciences, 22(4):577–660, 1999.
  8. Lawrence W Barsalou. Grounded cognition. Annu. Rev. Psychol., 59:617–645, 2008.
  9. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. In Proceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets and Benchmarks Track), 2021.
  10. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  11. Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
  12. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  13. Matterport3d: Learning from rgb-d data in indoor environments. Proceedings of International Conference on 3D Vision (3DV), 2017.
  14. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  15. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  16. Scanrefer: 3d object localization in rgb-d scans using natural language. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  17. D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
  18. Language conditioned spatial relation reasoning for 3d object grounding. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  19. End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  20. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  21. Unit3d: A unified transformer for 3d dense captioning and visual grounding. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
  22. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  23. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  24. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  25. Objaverse-xl: A universe of 10m+ 3d objects. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
  26. Objaverse: A universe of annotated 3d objects. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  27. Procthor: Large-scale embodied ai using procedural generation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  28. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  29. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  30. Votenet: A deep learning label fusion method for multi-atlas segmentation. In Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2019.
  31. Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  32. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
  33. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:2309.16650, 2023.
  34. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In Proceedings of Conference on Robot Learning (CoRL), 2022.
  35. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of ACM International Conference on Multimedia (MM), 2021.
  36. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
  37. 3d concept learning and reasoning from multi-view images. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  38. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  39. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2021.
  40. Multi-view transformer for 3d visual grounding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  41. Bottom up top down detection transformers for language grounding in images and point clouds. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
  42. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  43. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  44. Segment anything. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
  45. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In International Journal of Computer Vision (IJCV), 2017.
  46. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
  47. Language-driven semantic segmentation. In Proceedings of International Conference on Learning Representations (ICLR), 2022.
  48. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  49. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of International Conference on Machine Learning (ICML), 2022.
  50. Grounded language-image pre-training. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  51. Visual instruction tuning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
  52. Openshape: Scaling up 3d shape representation towards open-world understanding. arXiv preprint arXiv:2305.10764, 2023.
  53. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
  54. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  55. Scalable 3d captioning with pretrained models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
  56. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of International Conference on Learning Representations (ICLR), 2019.
  57. Sqa3d: Situated question answering in 3d scenes. In Proceedings of International Conference on Learning Representations (ICLR), 2023.
  58. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  59. An end-to-end transformer model for 3d object detection. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  60. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  61. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  62. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  63. Episodic transformer for vision-and-language navigation. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  64. Openscene: 3d scene understanding with open vocabularies. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  65. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
  66. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  67. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. In Proceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets and Benchmarks Track), 2021.
  68. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In Proceedings of Conference on Robot Learning (CoRL), 2023.
  69. Kimera: From slam to spatial perception with 3d dynamic scene graphs. International Journal of Robotics Research (IJRR), 2021.
  70. Photorealistic text-to-image diffusion models with deep language understanding. Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  71. Laion-5b: An open large-scale dataset for training next generation image-text models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  72. Mask3d: Mask transformer for 3d semantic instance segmentation. In Proceedings of International Conference on Robotics and Automation (ICRA), 2023.
  73. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
  74. Openmask3d: Open-vocabulary 3d instance segmentation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
  75. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  76. Multimodal few-shot learning with frozen language models. Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  77. Softgroup for 3d instance segmentation on point clouds. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  78. Rio: 3d object instance re-localization in changing indoor environments. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
  79. Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  80. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  81. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  82. Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  83. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  84. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
  85. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv preprint arXiv:2304.06906, 2023.
  86. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  87. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
  88. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  89. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  90. Glipv2: Unifying localization and vision-language understanding. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  91. Pointclip: Point cloud understanding by clip. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  92. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  93. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
  94. 3dvg-transformer: Relation modeling for visual grounding on point clouds. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  95. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  96. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
  97. Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345, 2020.
  98. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Baoxiong Jia (35 papers)
  2. Yixin Chen (126 papers)
  3. Huangyue Yu (3 papers)
  4. Yan Wang (733 papers)
  5. Xuesong Niu (16 papers)
  6. Tengyu Liu (27 papers)
  7. Qing Li (430 papers)
  8. Siyuan Huang (123 papers)
Citations (28)

Summary

Overview of SceneVerse and GPS Framework

Embodied AI, which combines 3D spatial understanding with natural language processing, is critical for the development of robots and systems that can navigate and interact in real-world spaces. However, aligning language with 3D physical environments presents significant hurdles due to the complex nature of 3D data and the scarcity of structured learning datasets. Addressing these challenges, a dataset named SceneVerse offers a revolutionary leap in 3D vision-language learning, and with it comes a new training model known as Grounded Pre-training for Scenes (GPS).

The Problem with 3D Vision-Language Grounding

Integrating language with 3D environments is more challenging than in 2D due to inherent complexities. The rich attributes of objects, their diverse configurations, and the intricate relationships they share heavily complicate scene understanding. Moreover, the data required for training AI in this domain is sparse and crafting a unified learning framework has not been achieved—until now.

Introducing SceneVerse and GPS

SceneVerse stands as the first expansive dataset created for 3D vision-language learning, featuring a staggering 68,406 indoor scenes complemented by a whopping 2.5 million scene-language pairs. This monumental scale is made possible by combining human annotations with a scalable approach for automated generation of scene descriptions using scene graphs and LLMs.

Exploring GPS Capabilities through Extensive Experiments

Alongside SceneVerse, the GPS model has been unveiled, trained across different levels of scene-text alignment through contrastive learning. Counter to other models, GPS doesn’t rely on additional complex structures; rather, it simplifies the training process yet achieves professional-grade performance. The model's capability for zero-shot generalization in varied 3D vision-language tasks indicates the effectiveness of the underlying data and framework.

Potential Exposed Through Data Scaling and Model Generalization

A series of experiments reveal that as data scales up, GPS consistently improves, suggesting a strong correlation between data volume and model proficiency. Additionally, GPS’s ability to adapt knowledge from SceneVerse to apply to unseen scenarios, known as zero-shot transfer, highlights the model's potential and the dataset's robustness. It underscores the value of SceneVerse as offering a rich training ground for future research in 3D vision-language tasks.

Github Logo Streamline Icon: https://streamlinehq.com