Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI (2312.16170v1)

Published 26 Dec 2023 in cs.CV, cs.AI, and cs.RO

Abstract: In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories. Building upon this database, we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and in the wild. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In European conference on computer vision, 2020.
  2. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  3. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  4. Omni3D: A large benchmark and model for 3D object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  5. nuscenes: A multimodal dataset for autonomous driving. CoRR, abs/1903.11027, 2019.
  6. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  7. Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  8. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019.
  9. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, 2020.
  10. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021.
  11. Unit3d: A unified transformer for 3d dense captioning and visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  12. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  13. MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020.
  14. Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572, 2013.
  15. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  18. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  19. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  20. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  21. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023.
  22. Bottom up top down detection transformers for language grounding in images and point clouds. In European Conference on Computer Vision, 2022.
  23. Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation. arXiv preprint, 2023.
  24. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  25. Pointpillars: Fast encoders for object detection from point clouds. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  26. Sustech points: A portable 3d point cloud interactive annotation platform system. In 2020 IEEE Intelligent Vehicles Symposium (IV), 2020.
  27. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  28. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 2022.
  29. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  30. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  31. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 996–997, 2020.
  32. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  33. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987, 2022.
  36. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022.
  37. One million scenes for autonomous driving: Once dataset. arXiv preprint arXiv:2106.11037, 2021.
  38. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  39. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  40. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  41. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
  42. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
  43. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  44. Language-grounded indoor 3d semantic segmentation in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  45. Fcaf3d: Fully convolutional anchor-free 3d object detection. In European Conference on Computer Vision, 2022a.
  46. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In WACV, pages 2397–2406, 2022b.
  47. Pointrcnn: 3d object proposal generation and detection from point cloud. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  48. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  49. Disentangling monocular 3d object detection. In IEEE International Conference on Computer Vision, 2019.
  50. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  51. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  52. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  53. Openmask3d: Open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631, 2023.
  54. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
  55. Fcos: Fully convolutional one-stage object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  56. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  57. Rio: 3d object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  58. FCOS3D: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021.
  59. Monocular 3d object detection with depth from motion. In European Conference on Computer Vision (ECCV), 2022a.
  60. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022b.
  61. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  62. Habitat-matterport 3d semantics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  63. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018.
  64. Exploring data augmentation for multi-modality 3d object detection. arXiv preprint arXiv:2012.12741, 2020.
  65. Voxelnet: End-to-end learning for point cloud based 3d object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  66. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  67. Object2scene: Putting objects in context for open-vocabulary 3d detection. arXiv preprint arXiv:2309.09456, 2023a.
  68. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023b.
Citations (30)

Summary

  • The paper introduces a comprehensive multi-modal 3D perception suite that combines over 1M RGB-D views, language prompts, and dense 3D annotations for indoor scenes.
  • It presents the Embodied Perceptron framework, which uses separate encoders for RGB-D and language inputs to achieve robust scene understanding and fusion.
  • The dataset, surpassing previous benchmarks in diversity with 760 categories and extensive annotations, paves the way for advanced embodied AI research in natural language grounded perception.

Overview of "EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI"

The paper "EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI" introduces a comprehensive dataset and benchmark for 3D perception focused on embodied AI applications in indoor environments. This work addresses critical challenges by providing extensive multi-modal data and annotations, necessary for developing embodied agents capable of understanding and interacting with their surroundings through natural language.

Dataset and Annotations

EmbodiedScan represents a substantial effort in creating a dataset consisting of over 5k real-scanned 3D scenes, encompassing 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes across 760 categories, and dense semantic occupancy. This multi-modal dataset is designed to enrich the diversity and detail of annotations, notably surpassing prior datasets like ScanNet, SUN RGB-D, and others in terms of categories and annotation density.

The authors employ a SAM-assisted pipeline to annotate objects with 3D bounding boxes and generate language descriptions. The dataset includes elaborate 3D scene annotations, making it a resource for training models on holistic scene understanding grounded in natural language.

Proposed Framework: Embodied Perceptron

To evaluate the dataset, the authors present a baseline framework named Embodied Perceptron. This framework effectively processes multiple forms of input, including ego-centric RGB-D sequences and language prompts. Leveraging separate encoders for each modality, Embodied Perceptron performs dense and sparse fusion methods, producing robust 3D scene representations used in various perception tasks.

The framework supports continuous scene perception, multi-view analysis, and language-grounded understanding, showcasing flexibility in handling varying input complexities and configurations.

Benchmarks and Results

The paper establishes two core benchmarks on the capabilities of the dataset:

  1. Fundamental 3D Perception Tasks: The benchmarks include traditional tasks like 3D detection and semantic occupancy prediction under different conditions. The reported results reflect the advantages of incorporating both depth and RGB data, demonstrating significant improvements over baselines.
  2. Language-Grounded Scene Understanding: This benchmark evaluates the interaction of 3D perception output with natural language inputs. The results illustrate the potential for integrating vision and language, although challenges remain, particularly with complex prompts.

Implications and Future Directions

EmbodiedScan sets a new standard for datasets intended for indoor scene understanding, especially in the field of embodied AI. The richness and diversity of the dataset provide a robust foundation for exploring how embodied agents perceive and understand complex environments.

The implications of this work are profound, influencing the development of embodied AI that can seamlessly interact with human environments via language. Future research could explore improving 3D perception accuracy, refining the integration of multimodal data, and expanding the dataset to include more complex interaction scenarios.

Moreover, the findings suggest promising directions for enhancing scene reconstruction and understanding through advanced perceptual models, paving the way for sophisticated, real-world embodied AI systems.

In summary, "EmbodiedScan" represents a significant contribution to the field of embodied AI, offering both resources and benchmarks that are likely to catalyze further innovations in 3D perception and human-computer interaction.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube