Unsupervised Universal Image Segmentation (2312.17243v1)

Published 28 Dec 2023 in cs.CV

Abstract: Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g., STEGO) or class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks -- instance, semantic and panoptic -- using a novel unified framework. U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models followed by clustering; each cluster represents different semantic and/or instance membership of pixels. We then self-train the model on these pseudo semantic labels, yielding substantial performance gains over specialized methods tailored to each task: a +2.6 AP$^{{\text{box}}$} boost vs. CutLER in unsupervised instance segmentation on COCO and a +7.0 PixelAcc increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff. Moreover, our method sets up a new baseline for unsupervised panoptic segmentation, which has not been previously explored. U2Seg is also a strong pretrained model for few-shot segmentation, surpassing CutLER by +5.0 AP$^{{\text{mask}}$} when trained on a low-data regime, e.g., only 1% COCO labels. We hope our simple yet effective method can inspire more research on unsupervised universal image segmentation.

PDF HTML Abstract

Introduction

The methodology of image segmentation in the field of artificial intelligence and computer vision has advanced remarkably, particularly with techniques that reduce the dependency on meticulously labeled datasets. Traditionally, image segmentation tasks such as semantic segmentation, instance segmentation, and panoptic segmentation have relied on separate frameworks. The recent development aims to consolidate these tasks into a unified model, thereby widening the horizons of unsupervised learning in image segmentation.

Methodology

A unified model, hereafter referred to as "U2Seg," has been introduced, targeting the ability to handle instance, semantic, and panoptic segmentation tasks without needing labeled data for training. This model capitalizes on the benefits of self-supervised representation learning and clustering techniques. U2Seg begins by deriving pseudo semantic labels for instance masks obtained through an existing model, DINO, and an algorithm named MaskCut. Then, it clusters semantically similar instance masks. In the next step, it integrates the semantically labeled "things" with "stuff" pixels obtained from another method called STEGO, creating pseudo semantic labels for every pixel. The final model is self-trained on these labels.

Benchmarks and Performance

When evaluated across different tasks and datasets, the U2Seg model demonstrates superior performance compared to task-specific models. In unsupervised instance segmentation on COCO, it surpasses its predecessors in detection and segmentation accuracy. U2Seg also sets a new baseline in unsupervised panoptic segmentation and shows promise as a pretraining model for few-mask segmentation, outperforming existing models when trained with a minimal amount of labeled data. The method signals an innovative step forward for research in unsupervised universal image segmentation.

Conclusion

U2Seg's introduction marks an exploration into the extent to which image segmentation can procede without relying on human-generated labels, a significant move toward making AI systems more autonomous and less data-hungry. With its ability to perform multiple segmentation tasks within a single, noise-tolerant framework, U2Seg could pave the way for future models that further minimize the dependency on extensive, dense, human-labeled data required for training. Further, the underlying method encourages the development of AI systems capable of more comprehensive scene understanding from images, an advancement with promising practical implications.