Language-driven Semantic Segmentation (2201.03546v2)

Published 10 Jan 2022 in cs.CV, cs.CL, and cs.LG

Abstract: We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., "grass" or "building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., "cat" and "furry"). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. Code and demo are available at https://github.com/isl-org/lang-seg.

Citations (519)

View on Semantic Scholar

Summary

The paper introduces LSeg, a novel model using language-driven contrastive learning to enable zero-shot segmentation of previously unseen categories.
It employs a dual encoder architecture with a state-of-the-art text encoder and transformer-based image encoder to align pixel-level and semantic embeddings.
LSeg demonstrates strong performance with 61.3% mIoU on PASCAL and up to 87.8% on FSS-1000, underscoring its robust generalization capabilities.

Language-driven Semantic Segmentation: A Technical Overview

Introduction and Motivation

The paper introduces LSeg, a sophisticated model designed for language-driven semantic segmentation, addressing critical limitations in conventional semantic segmentation methods. Traditional approaches often depend on a fixed and limited set of class labels as derived from training datasets, which constrains generalization to novel categories. Annotating images with extensive class labels is both labor-intensive and cost-prohibitive. Thus, a considerable opportunity lies in leveraging LLMs to expand this capability, allowing for zero-shot semantic segmentation where models perform effectively with previously unseen categories.

Methodology

LSeg implements a contrastive learning strategy by integrating a text encoder with a transformer-based image encoder. This model computes embeddings for input labels and aligns them with pixel-level visual features. Specifically, a contrastive objective is employed, aligning pixel embeddings to corresponding semantic class embeddings derived from text. This alignment ensures semantically similar labels, such as “cat” and “furry,” reside in proximity within the embedding space, enabling the model to perform zero-shot segmentation without additional samples or retraining.

Architecture

The architecture consists of the following components:

Text Encoder: Utilizes state-of-the-art LLMs, such as those in CLIP, which are co-trained on visual data. The text encoder embeds label sets into a vector space, facilitating semantic partitioning.
Image Encoder: Employs dense prediction transformers (DPT) for computing dense per-pixel embeddings. The encoder maps each pixel to the closest label embedding during training via a pixelwise softmax objective.
Spatial Regularization: Utilizes depthwise or bottleneck blocks for spatially regularizing predictions. These blocks preserve the order arbitrariness of label embeddings and ensure equivariance with respect to label permutations.

Experimental Results

LSeg showcases competitive performance against state-of-the-art models on several benchmarks, including PASCAL-5 $^i$ , COCO-20 $^i$ , and FSS-1000. Notably, it surpasses existing zero-shot models in performance and rivals certain few-shot methods without requiring additional training samples. With strong quantitative results—61.3% mIoU on PASCAL and up to 87.8% on FSS-1000—LSeg demonstrates its robust generalization to unseen categories.

Implications and Future Directions

The practical implications of LSeg are extensive. By allowing semantic segmentation models to interpret a dynamic and expansive set of labels, LSeg reduces reliance on exhaustive annotations, enhancing the adaptability of AI systems in real-world applications such as autonomous driving and medical imaging.

Theoretically, the work paves the way for new research into integrating LLMs with traditional computer vision tasks. Future studies may focus on improving the embedding spaces and enhancing the flexibility and efficiency of such models.

Conclusion

The paper provides a substantial contribution to the advancement of zero-shot semantic segmentation through LSeg. By effectively utilizing modern LLMs such as CLIP, this research demonstrates the feasibility of synthesizing flexible and general semantic segmentation models. This flexibility, devoid of performance trade-offs, ensures significant advancements in computer vision applications and opens new avenues for further exploration in the AI community.

PDF Markdown

Related Papers

GitHub

GitHub - isl-org/lang-seg: Language-Driven Semantic Segmentation (720 stars)