OpenMask3D: Open-Vocabulary 3D Instance Segmentation (2306.13631v2)

Published 23 Jun 2023 in cs.CV

Abstract: We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.

PDF Abstract

OpenMask3D: Advancing 3D Instance Segmentation with Open-Vocabulary Queries

Introduction

Recent advancements in 3D instance segmentation have largely been confined to recognizing and classifying objects within a set of predefined categories, which may not suffice for real-world applications encountering a broader array of objects. Addressing this limitation, we introduce OpenMask3D, a pioneering approach that leverages open-vocabulary queries for 3D instance segmentation. This method represents a significant step forward, enabling the identification of objects in 3D scenes based on descriptive queries that encompass object semantics, affordances, geometry, and material properties.

Methodology

OpenMask3D approach comprises two key components: a class-agnostic mask proposal head and a novel mask-feature aggregation module. The framework begins by generating class-agnostic 3D mask proposals from an RGB-D sequence and the corresponding 3D reconstructed geometry of the scene. Following this, it aggregates features for each mask by utilizing multi-view fusion based on CLIP-based image embeddings. This distinction allows OpenMask3D to compute mask-feature representation for each object instance, enabling the retrieval of object instances through similarity to any given query in a zero-shot manner. In contrast to prior works that rely on per-point features, OpenMask3D's instance-based feature computation significantly enhances its ability to capture detailed object representations.

Performance and Evaluation

Extensive experiments were conducted on the ScanNet200 and Replica datasets to assess the effectiveness of OpenMask3D against other open-vocabulary methods, particularly in scenarios involving long-tail distribution. The results demonstrate that OpenMask3D consistently outperforms its counterparts, showcasing superior performance in identifying object instances across a wide spectrum of categories. This performance is especially notable in the context of recognizing and segmenting objects described by open-vocabulary queries, underscoring the model's robustness and versatility.

Implications and Future Directions

The introduction of OpenMask3D to the field of 3D instance segmentation opens new avenues for research and application. Practically, it paves the way for more intuitive human-computer interaction within 3D environments, such as augmented reality and robotics, where understanding and responding to a diverse range of objects based on natural language descriptions is crucial. Theoretically, this work expands the boundaries of 3D scene understanding, highlighting the potential of integrating open-vocabulary capabilities into 3D vision tasks. Looking ahead, future research may explore improving the quality of class-agnostic masks, understanding global scene context, and developing evaluation methodologies tailored for open-vocabulary tasks, thereby deepening the integration of linguistic and visual modalities in 3D scene analysis.

Conclusion

OpenMask3D marks a significant advancement in 3D instance segmentation by incorporating open-vocabulary capabilities. By doing so, it not only addresses a crucial gap in the field but also sets a new precedent for future research at the intersection of language and 3D vision. This method's ability to comprehend and segment objects based on descriptive queries, without the need for category-specific training, signifies a move towards more adaptable and understanding 3D vision systems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Elisabetta Fedele (4 papers)
Robert W. Sumner (7 papers)
Marc Pollefeys (229 papers)
Federico Tombari (214 papers)
Francis Engelmann (37 papers)
Ayça Takmaz (3 papers)

Citations (116)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/FrancisEngelman/status/1789008242804146564

https://twitter.com/390632058/status/1741207100678922603