Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI (2312.16170v1)

Published 26 Dec 2023 in cs.CV, cs.AI, and cs.RO

Abstract: In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories. Building upon this database, we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and in the wild. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

References (68)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces a comprehensive multi-modal 3D perception suite that combines over 1M RGB-D views, language prompts, and dense 3D annotations for indoor scenes.
It presents the Embodied Perceptron framework, which uses separate encoders for RGB-D and language inputs to achieve robust scene understanding and fusion.
The dataset, surpassing previous benchmarks in diversity with 760 categories and extensive annotations, paves the way for advanced embodied AI research in natural language grounded perception.

The paper "EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI" introduces a comprehensive dataset and benchmark for 3D perception focused on embodied AI applications in indoor environments. This work addresses critical challenges by providing extensive multi-modal data and annotations, necessary for developing embodied agents capable of understanding and interacting with their surroundings through natural language.

Dataset and Annotations

EmbodiedScan represents a substantial effort in creating a dataset consisting of over 5k real-scanned 3D scenes, encompassing 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes across 760 categories, and dense semantic occupancy. This multi-modal dataset is designed to enrich the diversity and detail of annotations, notably surpassing prior datasets like ScanNet, SUN RGB-D, and others in terms of categories and annotation density.

The authors employ a SAM-assisted pipeline to annotate objects with 3D bounding boxes and generate language descriptions. The dataset includes elaborate 3D scene annotations, making it a resource for training models on holistic scene understanding grounded in natural language.

Proposed Framework: Embodied Perceptron

To evaluate the dataset, the authors present a baseline framework named Embodied Perceptron. This framework effectively processes multiple forms of input, including ego-centric RGB-D sequences and language prompts. Leveraging separate encoders for each modality, Embodied Perceptron performs dense and sparse fusion methods, producing robust 3D scene representations used in various perception tasks.

The framework supports continuous scene perception, multi-view analysis, and language-grounded understanding, showcasing flexibility in handling varying input complexities and configurations.

Benchmarks and Results

The paper establishes two core benchmarks on the capabilities of the dataset:

Fundamental 3D Perception Tasks: The benchmarks include traditional tasks like 3D detection and semantic occupancy prediction under different conditions. The reported results reflect the advantages of incorporating both depth and RGB data, demonstrating significant improvements over baselines.
Language-Grounded Scene Understanding: This benchmark evaluates the interaction of 3D perception output with natural language inputs. The results illustrate the potential for integrating vision and language, although challenges remain, particularly with complex prompts.

Implications and Future Directions

EmbodiedScan sets a new standard for datasets intended for indoor scene understanding, especially in the field of embodied AI. The richness and diversity of the dataset provide a robust foundation for exploring how embodied agents perceive and understand complex environments.

The implications of this work are profound, influencing the development of embodied AI that can seamlessly interact with human environments via language. Future research could explore improving 3D perception accuracy, refining the integration of multimodal data, and expanding the dataset to include more complex interaction scenarios.

Moreover, the findings suggest promising directions for enhancing scene reconstruction and understanding through advanced perceptual models, paving the way for sophisticated, real-world embodied AI systems.

In summary, "EmbodiedScan" represents a significant contribution to the field of embodied AI, offering both resources and benchmarks that are likely to catalyze further innovations in 3D perception and human-computer interaction.