Papers
Topics
Authors
Recent
2000 character limit reached

ODIN: A Single Model for 2D and 3D Segmentation (2401.02416v3)

Published 4 Jan 2024 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
  2. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
  3. Hierarchical aggregation for 3d instance segmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15447–15456, 2021.
  4. Masked-attention mask transformer for universal image segmentation. 2022.
  5. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019.
  6. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  7. Procthor: Large-scale embodied ai using procedural generation. Advances in Neural Information Processing Systems, 35:5982–5994, 2022.
  8. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7010–7019, 2023.
  9. Solq: Segmenting objects by learning queries. Advances in Neural Information Processing Systems, 34:21898–21909, 2021.
  10. Leveraging large-scale pretrained vision foundation models for label-efficient 3d point cloud segmentation. arXiv preprint arXiv:2311.01989, 2023.
  11. Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875, 2021.
  12. Learning 3d semantic segmentation with only 2d image supervision. In 2021 International Conference on 3D Vision (3DV), pages 361–372. IEEE, 2021.
  13. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  14. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In 6th Annual Conference on Robot Learning, 2022.
  15. Occuseg: Occupancy-aware 3d instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2940–2949, 2020.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  17. Bidirectional projection network for cross dimension scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14373–14382, 2021.
  18. Texturenet: Consistent local parametrizations for learning from high-resolution signals on meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4440–4449, 2019.
  19. Bottom up top down detection transformers for language grounding in images and point clouds. In European Conference on Computer Vision, pages 417–433. Springer, 2022.
  20. Multi-view pointnet for 3d scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  21. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
  22. Pointgroup: Dual-set point grouping for 3d instance segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4866–4875, 2020.
  23. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19729–19739, 2023.
  24. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  25. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
  26. Virtual multi-view fusion for 3d semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 518–535. Springer, 2020.
  27. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8500–8509, 2022.
  28. Mask-attention-free transformer for 3d instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3693–3703, 2023.
  29. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  30. Instance segmentation in 3d scenes using semantic superpoint tree networks. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2763–2772, 2021.
  31. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  32. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  33. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  34. Query refinement transformer for 3d instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18516–18526, 2023.
  35. Film: Following instructions in language with modular methods. In International Conference on Learning Representations, 2021.
  36. Teach: Task-driven embodied agents that chat. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2017–2025, 2022.
  37. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 815–824, 2023.
  38. Pix4point: Image pretrained transformers for 3d point cloud understanding. 2022.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Learning multi-view aggregation in the wild for large-scale 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5575–5584, 2022.
  41. Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
  42. Open-ended instructable embodied agents with memory-augmented large language models. arXiv preprint arXiv:2310.15127, 2023.
  43. Dualconvmesh-net: Joint geodesic and euclidean convolutions on 3d meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8622, 2020.
  44. Mask3d: Mask transformer for 3d semantic instance segmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8216–8223. IEEE, 2023.
  45. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  46. Panoptic lifting for 3d scene understanding with neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9043–9052, 2023.
  47. Openmask3d: Open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631, 2023.
  48. Vl-fields: Towards language-grounded neural implicit spatial representations. arXiv preprint arXiv:2305.12427, 2023.
  49. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
  50. Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds. arXiv preprint arXiv:2305.03045, 2023.
  51. Point transformer v2: Grouped vector attention and partition-based pooling. Advances in Neural Information Processing Systems, 35:33330–33342, 2022.
  52. Image2point: 3d point-cloud understanding with 2d image pretrained models. arXiv preprint arXiv:2106.04180, 2021.
  53. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  54. Habitat-matterport 3d semantics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4927–4936, 2023.
  55. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv preprint arXiv:2304.06906, 2023.
  56. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022.
  57. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
  58. Divide and conquer: 3d point cloud instance segmentation with point-wise binarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 562–571, 2023.
  59. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 519–535. Springer, 2020.
  60. Understanding imbalanced semantic segmentation through neural collapse. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19550–19560, 2023.
  61. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023.
Citations (2)

Summary

  • The paper presents a unified transformer-based model for fusing 2D and 3D instance segmentation, achieving competitive results on multiple benchmarks.
  • It alternates between 2D within-view and 3D cross-view fusion, employing positional encodings to effectively distinguish features.
  • The model supports practical applications such as embodied agent integration and raw sensor data processing for responsive AI systems.

Overview of ODIN

ODIN--an acronym for Omni-Dimensional INstance segmentation--is an innovative model pioneering in the field of both 2D and 3D perception using a transformer structure. Leveraging a sophisticated architecture, ODIN is proficient at integrating information within a single view and across multiple views to deliver instance segmentation and labeling. A key differentiator of this model is its proficiency in parsing multiview RGB-D sequences as well as single RGB images, embracing an approach that intertwines 2D and 3D fusion during the processing stages.

Transcending Model Boundaries

The prevalent belief in the separation between 2D and 3D perception due to distinct model requirements has been a constraint in the field. ODIN steps beyond these boundaries, showcasing a unified model that navigates through both 2D RGB images and 3D point clouds with remarkable finesse. This is evident in the model's ability to impressively handle the ScanNet200, Matterport3D, AI2THOR, ScanNet, S3DIS, and COCO benchmarks. By alternating between 2D within-view and 3D cross-view fusion, ODIN distinguishes 2D and 3D features through positional encodings, reflecting pixel coordinates for 2D inputs and spatial coordinates for 3D inputs.

Method and Architecture

Delving deeper into ODIN's method, it features a design that cycles between 2D fusion within individual image views and 3D attention-based cross-view fusion. These stages are essential for achieving consistency in representations across various angles. The model ingenously repurposes 2D features to maintain 3D perspective, creating a resilient processing flow that unifies the advantages of both dimensions. Particularly noteworthy is the model's shared majority of parameters across RGB and RGB-D inputs, utilizing the strengths of pre-trained 2D backbones.

Impact and Applications

ODIN's significance lies in its performance and practical implications. It has set new benchmarks in 3D instance segmentation and displayed competitive prowess across diverse datasets. An instructive application has seen ODIN embedded within an embodied agent architecture, paving the way for efficient and intuitive human-machine interaction. Moreover, the ability to process raw sensor data in contrast to pre-processed meshes allows for more dynamic and responsive AI systems.

A Glimpse into the Future

The research presents a compelling case for unified 2D and 3D perception models. By catering to the nuanced needs of both dimensions within a singular framework, ODIN presents an avenue for more integrated and powerful AI-driven perception systems. The model and its source code offer a valuable resource for the community to further this line of innovation. DispatchQueue Prospects for further improvement in noise resilience and cross-dataset training suggest that ODIN is just the beginning of a transformative journey in the field of AI-driven perception.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 6 tweets with 53 likes about this paper.