ODIN: A Single Model for 2D and 3D Segmentation (2401.02416v3)

Published 4 Jan 2024 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).

References (61)

Citations (2)

View on Semantic Scholar

Summary

The paper presents a unified transformer-based model for fusing 2D and 3D instance segmentation, achieving competitive results on multiple benchmarks.
It alternates between 2D within-view and 3D cross-view fusion, employing positional encodings to effectively distinguish features.
The model supports practical applications such as embodied agent integration and raw sensor data processing for responsive AI systems.

Overview of ODIN

ODIN--an acronym for Omni-Dimensional INstance segmentation--is an innovative model pioneering in the field of both 2D and 3D perception using a transformer structure. Leveraging a sophisticated architecture, ODIN is proficient at integrating information within a single view and across multiple views to deliver instance segmentation and labeling. A key differentiator of this model is its proficiency in parsing multiview RGB-D sequences as well as single RGB images, embracing an approach that intertwines 2D and 3D fusion during the processing stages.

Transcending Model Boundaries

The prevalent belief in the separation between 2D and 3D perception due to distinct model requirements has been a constraint in the field. ODIN steps beyond these boundaries, showcasing a unified model that navigates through both 2D RGB images and 3D point clouds with remarkable finesse. This is evident in the model's ability to impressively handle the ScanNet200, Matterport3D, AI2THOR, ScanNet, S3DIS, and COCO benchmarks. By alternating between 2D within-view and 3D cross-view fusion, ODIN distinguishes 2D and 3D features through positional encodings, reflecting pixel coordinates for 2D inputs and spatial coordinates for 3D inputs.

Method and Architecture

Delving deeper into ODIN's method, it features a design that cycles between 2D fusion within individual image views and 3D attention-based cross-view fusion. These stages are essential for achieving consistency in representations across various angles. The model ingenously repurposes 2D features to maintain 3D perspective, creating a resilient processing flow that unifies the advantages of both dimensions. Particularly noteworthy is the model's shared majority of parameters across RGB and RGB-D inputs, utilizing the strengths of pre-trained 2D backbones.

Impact and Applications

ODIN's significance lies in its performance and practical implications. It has set new benchmarks in 3D instance segmentation and displayed competitive prowess across diverse datasets. An instructive application has seen ODIN embedded within an embodied agent architecture, paving the way for efficient and intuitive human-machine interaction. Moreover, the ability to process raw sensor data in contrast to pre-processed meshes allows for more dynamic and responsive AI systems.

A Glimpse into the Future

The research presents a compelling case for unified 2D and 3D perception models. By catering to the nuanced needs of both dimensions within a singular framework, ODIN presents an avenue for more integrated and powerful AI-driven perception systems. The model and its source code offer a valuable resource for the community to further this line of innovation. DispatchQueue Prospects for further improvement in noise resilience and cross-dataset training suggest that ODIN is just the beginning of a transformative journey in the field of AI-driven perception.