SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding (2401.09340v3)
Abstract: 3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.
- Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
- Taskography: Evaluating robot task planning over large 3d scene graphs. In Proceedings of Conference on Robot Learning (CoRL), 2022.
- Flamingo: a visual language model for few-shot learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- 3d scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
- Scanqa: 3d question answering for spatial scene understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Lawrence W Barsalou. Perceptual symbol systems. Behavioral and brain sciences, 22(4):577–660, 1999.
- Lawrence W Barsalou. Grounded cognition. Annu. Rev. Psychol., 59:617–645, 2008.
- Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. In Proceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets and Benchmarks Track), 2021.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2020.
- 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Matterport3d: Learning from rgb-d data in indoor environments. Proceedings of International Conference on 3D Vision (3DV), 2017.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
- D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
- Language conditioned spatial relation reasoning for 3d object grounding. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Unit3d: A unified transformer for 3d dense captioning and visual grounding. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Objaverse-xl: A universe of 10m+ 3d objects. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Objaverse: A universe of annotated 3d objects. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Procthor: Large-scale embodied ai using procedural generation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
- Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Votenet: A deep learning label fusion method for multi-atlas segmentation. In Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2019.
- Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
- Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
- Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:2309.16650, 2023.
- Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In Proceedings of Conference on Robot Learning (CoRL), 2022.
- Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of ACM International Conference on Multimedia (MM), 2021.
- Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- 3d concept learning and reasoning from multi-view images. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2021.
- Multi-view transformer for 3d visual grounding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Bottom up top down detection transformers for language grounding in images and point clouds. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Segment anything. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. In International Journal of Computer Vision (IJCV), 2017.
- Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- Language-driven semantic segmentation. In Proceedings of International Conference on Learning Representations (ICLR), 2022.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of International Conference on Machine Learning (ICML), 2022.
- Grounded language-image pre-training. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Visual instruction tuning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Openshape: Scaling up 3d shape representation towards open-world understanding. arXiv preprint arXiv:2305.10764, 2023.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Scalable 3d captioning with pretrained models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of International Conference on Learning Representations (ICLR), 2019.
- Sqa3d: Situated question answering in 3d scenes. In Proceedings of International Conference on Learning Representations (ICLR), 2023.
- Multiscan: Scalable rgbd scanning for 3d environments with articulated objects. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- An end-to-end transformer model for 3d object detection. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Episodic transformer for vision-and-language navigation. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
- Openscene: 3d scene understanding with open vocabularies. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. In Proceedings of Advances in Neural Information Processing Systems Datasets and Benchmarks (NeurIPS Datasets and Benchmarks Track), 2021.
- Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In Proceedings of Conference on Robot Learning (CoRL), 2023.
- Kimera: From slam to spatial perception with 3d dynamic scene graphs. International Journal of Robotics Research (IJRR), 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Mask3d: Mask transformer for 3d semantic instance segmentation. In Proceedings of International Conference on Robotics and Automation (ICRA), 2023.
- The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
- Openmask3d: Open-vocabulary 3d instance segmentation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Multimodal few-shot learning with frozen language models. Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Softgroup for 3d instance segmentation on point clouds. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Rio: 3d object instance re-localization in changing indoor environments. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
- Learning 3d semantic scene graphs from 3d indoor reconstructions. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv preprint arXiv:2304.06906, 2023.
- Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
- Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
- Glipv2: Unifying localization and vision-language understanding. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Pointclip: Point cloud understanding by clip. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- 3dvg-transformer: Relation modeling for visual grounding on point clouds. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
- Structured3d: A large photo-realistic dataset for structured 3d modeling. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
- Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345, 2020.
- 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of International Conference on Computer Vision (ICCV), 2023.
- Baoxiong Jia (35 papers)
- Yixin Chen (126 papers)
- Huangyue Yu (3 papers)
- Yan Wang (733 papers)
- Xuesong Niu (16 papers)
- Tengyu Liu (27 papers)
- Qing Li (430 papers)
- Siyuan Huang (123 papers)