MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations (2406.09401v1)

Published 13 Jun 2024 in cs.CV, cs.AI, and cs.RO

Abstract: With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

Authors (11)

Ruiyuan Lyu (3 papers)
Tai Wang (47 papers)
Jingli Lin (2 papers)
Shuai Yang (140 papers)
Xiaohan Mao (5 papers)
Yilun Chen (48 papers)
Runsen Xu (13 papers)
Haifeng Huang (20 papers)
Chenming Zhu (11 papers)
Dahua Lin (336 papers)
Jiangmiao Pang (77 papers)

Citations (2)

View on Semantic Scholar

Summary

MMScan: A Comprehensive Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

The paper "MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations" introduces an extensive multi-modal 3D scene dataset designed to address the limitations of existing datasets, which primarily focus on object properties and inter-object spatial relationships. By incorporating hierarchical grounded language annotations, the MMScan dataset facilitates a more holistic understanding of spatial, attribute, and relational aspects of 3D scenes. This essay provides an expert overview of the dataset construction, the methodology for annotations, the benchmarks established using this dataset, its evaluation, and future research implications.

Construction and Meta-Annotations

MMScan is built on the basis of real-scanned 3D data from the EmbodiedScan dataset. The dataset comprises 1.4 million meta-annotated captions detailing 109k objects and 7.7k regions derived from 5.2k 3D scenes. The meta-annotations are generated via a top-down logic covering region and object levels and their inter-relationships, ensuring comprehensive spatial and attribute understanding.

The annotation process integrates Visual LLMs (VLMs) with human corrections to generate accurate and natural language descriptions. For object-level annotations, optimal views are selected using image quality metrics and visible surface points projection. VLMs such as GPT-4v, Qwen-VL-Max, and InternVL-Chat are employed to describe objects' shape, pose, material, category, functionality, and placement. Region-level annotations are further enriched with architectural elements and object-region relationships.

Post-Processing for Benchmarks

Based on the meticulous meta-annotations, MMScan establishes benchmarks for 3D visual grounding and question-answering. The visual grounding benchmark includes tasks like locating entities in the scene using language prompts, while the question-answering benchmark evaluates models' ability to respond to attribute and spatial queries. The dataset comprises 1.28 million visual grounding samples and 1.76 million question-answering samples, generated through systematic extraction and annotation processes involving human revisions and ChatGPT-based information summarization.

Additionally, MMScan provides grounded scene captions generated by integrating object and region annotations into coherent scene-level descriptions with explicit grounding tokens. This aids in efficient training of grounding models and LLMs.

Evaluation and Analysis

The paper evaluates several representative models on the MMScan benchmarks. For the 3D visual grounding benchmark, models like ScanRefer, BUTD-DETR, ViL3DRef, and EmbodiedScan exhibit significantly lower performance compared to prior benchmarks, reflecting the complexity and diversity of the MMScan dataset. EmbodiedScan demonstrates the highest performance, notably in inter-object spatial relationships, suggesting that integrating image modality can enhance grounding capabilities.

The 3D question-answering benchmark reveals considerable performance improvements through fine-tuning with MMScan data, with models like LL3DA and LEO showing remarkable gains in zero-shot and fine-tuning scenarios. This underscores the importance of high-quality training data in improving LLMs' performance on complex scene comprehension tasks.

Implications and Future Directions

MMScan addresses critical gaps in existing multi-modal 3D datasets by providing a comprehensive range of annotated data that enables training and evaluation of more sophisticated 3D scene understanding models. The hierarchical, grounded annotations support both fundamental tasks like visual grounding and question-answering and complex scene comprehension capabilities necessary for advanced 3D-LLMs.

Practically, MMScan will enhance the development of more capable robotic systems and embodied agents that can interact seamlessly with their surroundings by understanding both the spatial configurations and functional attributes of objects within complex scenes. The results from the benchmarks indicate that integrating multi-modal signals and incorporating hierarchical annotations significantly improve models' abilities to perform intricate 3D reasoning tasks.

Future research can explore scaling up MMScan by further increasing scene diversity and automating the annotation process to reduce reliance on human corrections. Additionally, investigating multimodal learning approaches that integrate 3D data with 2D image modalities and advanced LLMs will likely yield more robust models capable of real-world applications.

In summary, MMScan represents a substantial advancement in the creation of multi-modal 3D datasets, offering comprehensive and hierarchical language annotations that facilitate the training and evaluation of sophisticated 3D-LLMs and visual grounding models. The dataset’s robust design and scale hold significant promise for future developments in artificial intelligence and robotics.

Related Papers

Find Related Papers

GitHub

GitHub - OpenRobotLab/EmbodiedScan: [CVPR 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI (488 stars)

Tweets

https://twitter.com/wangtai97/status/1802479317743784329