Textual Query-Driven Mask Transformer for Domain Generalized Segmentation (2407.09033v2)

Published 12 Jul 2024 in cs.CV

Abstract: In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-LLMs. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5$\rightarrow$Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.

Summary

The paper introduces a novel framework that integrates textual embeddings within a transformer-based segmentation model to effectively handle extreme domain shifts.
It leverages three regularization strategies to ensure strong vision-language alignment and enhance pixel semantic clarity.
The method achieves state-of-the-art performance by improving mIoU by 2.5 points on the GTA5→Cityscapes benchmark.

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Overview

The paper "Textual Query-Driven Mask Transformer for Domain Generalized Segmentation" by Pak, Woo, Kim, et al. proposes a novel method for domain generalized semantic segmentation (DGSS) that leverages the semantic robustness of vision-LLMs (VLMs). The approach integrates text embeddings directly into a transformer-based segmentation framework, referred to as "textual object queries," to achieve significant improvements in generalization across varied domains.

Key Contributions

The main contributions of the paper are threefold:

Language-Driven DGSS Approach: This work is pioneering in directly utilizing text embeddings from VLMs for DGSS, effectively addressing extreme domain shifts.
Textual Query-Driven Mask Transformer (tqdm): The authors introduce a novel segmentation framework that utilizes textual object queries enhanced by three regularization losses to support robust vision-language alignment.
State-of-the-Art Performance: The proposed method achieves exceptional performance across multiple benchmarks, e.g., achieving 68.9 mIoU on GTA5→Cityscapes, surpassing the previous state-of-the-art by 2.5 mIoU.

Methodology

Textual Object Queries

The core idea is to leverage text embeddings from pre-trained VLMs as object queries within a transformer-based segmentation framework. These "textual object queries" are designed to encapsulate domain-invariant semantic knowledge, making them robust against domain shifts.

Initial Textual Query Generation: Text embeddings for target classes are obtained from a pre-trained text encoder and processed through a multi-layer perceptron to generate initial textual object queries.
Pixel Semantic Clarity Enhancement: The framework employs text-to-pixel attention mechanisms within the pixel decoder to enhance the semantic clarity of pixel features. This process ensures that pixel features are aligned more closely with the textual clustering centers, thereby improving their grouping by textual object queries.
Transformer Decoder for Mask Prediction: The textual object queries are refined through multiple layers of a transformer decoder to generate final mask predictions.

Regularization Strategies

In addition to the key framework components, the authors implement three regularization strategies to maintain robust vision-language alignment:

Language Regularization: Ensures that learnable text prompts retain the semantic integrity of text embeddings.
Vision-Language Regularization: Promotes pixel-level alignment between visual features and textual embeddings through an auxiliary segmentation loss.
Vision Regularization: Maintains the image-level alignment capabilities of the visual backbone from the pre-trained VLM.

Empirical Results

The proposed method has been extensively evaluated under both synthetic-to-real (GTA5→{Cityscapes, BDD100K, Mapillary}) and real-to-real (Cityscapes→{BDD100K, Mapillary}) settings.

Synthetic-to-Real Performance: The tqdm framework demonstrates substantial improvements over existing methods, achieving 68.9 mIoU on GTA5→Cityscapes. It outperforms strong baselines by leveraging domain-invariant semantic knowledge and robust pixel grouping abilities.
Unseen Domain Generalization: The qualitative analysis shows that the method can effectively identify target classes even in extreme domain shifts, such as hand-drawn images and varied game scene images.

Implications and Future Developments

The introduction of textual object queries represents a significant step forward in leveraging the semantic robustness of VLMs for DGSS tasks. By effectively combining textual semantics and dense visual predictions, the paper opens up new avenues for research in domain generalization and cross-modal semantic segmentation.

Future research could explore multiple directions:

Extended Vocabulary: Leveraging larger and more varied text corpora to enrich the semantic knowledge embedded in textual object queries.
Deployment in Real-World Applications: Implementing and testing the framework in diverse real-world scenarios, such as autonomous driving and robotic vision, where domain shifts are prevalent.
Further Optimization: Enhancing the efficiency and scalability of the framework to handle higher-resolution images and more complex segmentation tasks.

Overall, the proposed textual query-driven mask transformer sets a solid foundation for future advancements in domain generalized semantic segmentation by embracing the inherent semantic alignment capabilities of vision-LLMs.

PDF Markdown

Related Papers

GitHub

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Tweets

https://twitter.com/_vztu/status/1812923482990387411

https://twitter.com/CSVisionPapers/status/1812892989565403532

YouTube

Show All Videos