Scaling Open-Vocabulary Image Segmentation with Image-Level Labels (2112.12143v2)

Published 22 Dec 2021 in cs.CV

Abstract: We design an open-vocabulary image segmentation model to organize an image into meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining impressive open-vocabulary classification accuracy with image-level caption labels, are unable to segment visual concepts with pixels. We argue that these models miss an important step of visual grouping, which organizes pixels into groups before learning visual-semantic alignments. We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions. First, it learns to propose segmentation masks for possible organizations. Then it learns visual-semantic alignments by aligning each word in a caption to one or a few predicted masks. We find the mask representations are the key to support learning image segmentation from captions, making it possible to scale up the dataset and vocabulary sizes. OpenSeg significantly outperforms the recent open-vocabulary method of LSeg by +19.9 mIoU on PASCAL dataset, thanks to its scalability.

PDF Abstract

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

The paper "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels" addresses the limitations of traditional semantic segmentation models by introducing OpenSeg, an open-vocabulary image segmentation model. This model is designed to organize an image into semantically meaningful regions directed by arbitrary text inputs.

Recent advancements in open-vocabulary models such as CLIP and ALIGN have demonstrated the ability to perform image-level classification using natural language inputs. However, these models face significant challenges when it comes to pixel-level tasks such as image segmentation. The authors identify that these models bypass an essential visual grouping step which organizes the image into pixel groups, crucial for semantic alignment.

OpenSeg proposes a biphasic approach: First, it employs a segmentation mechanism to predict possible organizational masks; then, it aligns visual concepts by matching each word in an image caption to the predicted masks. This work notably leverages scalable image-level captions facilitating the model's expansion in both dataset size and vocabulary.

The experimental results are compelling, with OpenSeg outperforming the recent open-vocabulary segmentation method LSeg across benchmarks like the PASCAL dataset by a significant margin of 19.9 mIoU. The robustness of OpenSeg is measured through zero-shot transfer capabilities across different segmentation datasets, illustrating its broad generalization potential without fine-tuning on specific datasets.

In technical terms, OpenSeg employs an architecture where a feature pyramid network (FPN) and cross-attention mechanisms are pivotal for generating mask proposals. This architecture supports both class-agnostic mask predictions and mask-to-caption alignments, achieved through pooling techniques and contrastive loss mechanisms.

The use of captions as supervisory signals allows the training process to eschew costly pixel-level labeling. The visual-semantic alignments are weakly supervised, reflecting a scalable approach. The methodology not only achieves superior segmentation but also maintains the semantic richness afforded by natural language descriptions.

Ablation studies conducted outline the architectural choices in detail, iterating the contributions of mask proposals and visual-semantic alignment. For instance, initializing the model with robust pre-trained checkpoints such as ALIGN provides a stronger foundation, although models initialized from the NoisyStudent checkpoint also demonstrate substantial efficacy, signifying the model's robustness.

Future research may extend OpenSeg's application further incorporating more extensive unlabeled datasets or refining mask prediction accuracies. Potential societal impacts include its application in domains needing interactive image understanding, although the necessity of careful paper to mitigate biases inherent in large-scale datasets is cautioned.

Overall, OpenSeg represents a significant stride in open-vocabulary segmentation. By utilizing scalable supervisory techniques, it circumvents the limitations of existing models that require closed-set categories, setting a precedent for future models focusing on seamless integration of language into visual tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Golnaz Ghiasi (20 papers)
Xiuye Gu (17 papers)
Yin Cui (45 papers)
Tsung-Yi Lin (49 papers)

Citations (303)

View on Semantic Scholar

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels (2112.12143v2)

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

Related Papers