OneFormer: One Transformer to Rule Universal Image Segmentation (2211.06220v2)

Published 10 Nov 2022 in cs.CV

Abstract: Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer

PDF Abstract

Overview of "OneFormer: One Transformer to Rule Universal Image Segmentation"

The paper "OneFormer: One Transformer to Rule Universal Image Segmentation" introduces a novel framework aimed at achieving universal image segmentation. The primary contribution of this work is the development of the OneFormer model, designed to unify semantic, instance, and panoptic segmentation tasks within a single architectural framework. Unlike previous approaches requiring separate models and substantial resources for each task, OneFormer employs a task-conditioned joint training strategy to effectively integrate these segmentation tasks into one model.

Key Contributions and Methodology

OneFormer leverages transformers to formulate a task-dynamic architecture that adapts between different segmentation tasks using a task token input. The key components of the methodology include:

Task-Conditioned Joint Training Strategy: This involves training the model on ground truths from semantic, instance, and panoptic segmentation simultaneously, conditioning the model on the task using a task token. This approach allows for a reduction in training time and resource requirements, as demonstrated with significant improvements over traditional methods like Mask2Former.
Query Initialization and Task Conditioning: Object queries are initialized with repetitions of a task token, providing task-specific context. This task-conditioned initialization is crucial in effectively training the model across multiple tasks in a unified manner.
Query-Text Contrastive Loss: The model employs a query-text contrastive loss, utilizing textual representations of the ground truth to guide inter-task and inter-class distinctions. This component is fundamental in OneFormer’s ability to reduce category mispredictions and improve overall segmentation accuracy.
Single Architecture for Multiple Tasks: Using a unified architecture allows OneFormer to outperform current state-of-the-art models trained on semantic, instance, and panoptic segmentation tasks individually.

Results and Performance

OneFormer achieves state-of-the-art results across several benchmark datasets, including ADE20K, Cityscapes, and COCO. With a single model, OneFormer surpasses specialized Mask2Former models in performance metrics like PQ, AP, and mIoU. Notably, with Swin-L and DiNAT backbones, the model demonstrates enhanced capabilities, emphasizing its robustness and adaptability when integrated with different architectural components.

Numerical Results:

ADE20K: Achieved a PQ of 51.5% with DiNAT-L, outperforming earlier models with the same backbone.
Cityscapes: Set new records with a PQ of 68.5% using ConvNeXt-L.
COCO: With DiNAT-L, achieved an impressive mIoU of 68.1%.

Implications and Future Directions

The implications of this research are significant for both practical and theoretical advancement in the field of image segmentation. Practically, OneFormer can substantially reduce computational and storage resources, making segmentation more accessible and efficient. Theoretically, it poses new questions about the potential for unifying additional computer vision tasks within a single model framework.

Future developments may explore extending the OneFormer architecture to additional segmentation challenges or even broader vision tasks, leveraging its task-conditioned dynamic capabilities. Further research might also delve into optimizing the task-token input and exploring alternative transformer architectures to enhance performance and efficiency.

The open-source release of OneFormer serves to encourage ongoing research and development in this domain, propelling further innovations in universal segmentation models within the artificial intelligence research community.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jitesh Jain (11 papers)
Jiachen Li (144 papers)
MangTik Chiu (1 paper)
Ali Hassani (17 papers)
Nikita Orlov (10 papers)
Humphrey Shi (97 papers)

Citations (255)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - SHI-Labs/OneFormer: OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023 (1,397 stars)

Tweets

https://twitter.com/36723/status/1739782071462048106

YouTube

Show All Videos