Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models (2304.10597v2)

Published 20 Apr 2023 in cs.CV and cs.AI

Abstract: Remote sensing imagery has attracted significant attention in recent years due to its instrumental role in global environmental monitoring, land usage monitoring, and more. As image databases grow each year, performing automatic segmentation with deep learning models has gradually become the standard approach for processing the data. Despite the improved performance of current models, certain limitations remain unresolved. Firstly, training deep learning models for segmentation requires per-pixel annotations. Given the large size of datasets, only a small portion is fully annotated and ready for training. Additionally, the high intra-dataset variance in remote sensing data limits the transfer learning ability of such models. Although recently proposed generic segmentation models like SAM have shown promising results in zero-shot instance-level segmentation, adapting them to semantic segmentation is a non-trivial task. To tackle these challenges, we propose a novel method named Text2Seg for remote sensing semantic segmentation. Text2Seg overcomes the dependency on extensive annotations by employing an automatic prompt generation process using different visual foundation models (VFMs), which are trained to understand semantic information in various ways. This approach not only reduces the need for fully annotated datasets but also enhances the model's ability to generalize across diverse datasets. Evaluations on four widely adopted remote sensing datasets demonstrate that Text2Seg significantly improves zero-shot prediction performance compared to the vanilla SAM model, with relative improvements ranging from 31% to 225%. Our code is available at https://github.com/Douglas2Code/Text2Seg.

Citations (38)

View on Semantic Scholar

Summary

The paper introduces Text2Seg, a novel pipeline that leverages text-guided visual foundation models for semantic segmentation of remote sensing images.
The methodology combines SAM’s zero-shot segmentation, Grounding DINO’s text-based bounding box extraction, and CLIP’s semantic alignment to enhance segmentation precision.
Experiments across multiple datasets show reliable segmentation for clear categories like buildings while exposing challenges with abstract classes such as vegetation.

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

The paper explores the application of foundation models (FMs) within the field of remote sensing, specifically focusing on the semantic segmentation of remote sensing imagery through a text-guided approach. The authors introduce Text2Seg, a pipeline that leverages multiple visual foundation models to perform semantic segmentation tasks driven by text prompts. Key components of Text2Seg include using the Segment Anything Model (SAM), Grounding DINO, and CLIP models, each contributing uniquely to the overall task.

Motivation and Context

Recent advancements in foundational models, such as GPT-4 and LLaMA, have demonstrated outstanding zero-shot learning capabilities in various domains. Parallel developments in visual learning have yielded models like Grounding DINO and SAM, which exhibit proficient open-set detection and instance segmentation. Despite these advancements, remote sensing images pose distinct challenges due to their heterogeneity and dissimilarity from standard datasets. This paper aims to enhance the application of visual foundation models in remote sensing by proposing a pipeline that minimizes the need for extensive model tuning.

Methods Overview

Text2Seg is structured to integrate various foundation models, each pre-trained on diverse datasets:

SAM: Developed by Meta AI, SAM stands out as a foundational model capable of performing zero-shot object segmentation. Given its ability to use points, boxes, or text as prompts, SAM is pivotal to the Text2Seg pipeline.
Grounding DINO: Using language guidance to understand concepts, Grounding DINO generates bounding boxes for referred objects based on text prompts, aiding SAM in localizing target objects.
CLIP and CLIP Surgery: CLIP excels in zero-shot image prediction tasks, while CLIP Surgery focuses on generating explanation maps, serving as weak segmentation outputs.

The proposed architecture employs pre-SAM and post-SAM methods, where pre-SAM methods involve Grounding DINO and CLIP Surgery to generate bounding boxes and point prompts. Post-SAM methods use SAM to initially segment all instances and then filter them through CLIP for semantic alignment with text prompts.

Experiments and Results

The pipeline was tested across several standard remote sensing datasets, including UAVid, LoveDA, Vaihingen, and Potsdam. The experiments revealed varied performance across datasets and semantic categories:

UAVid and LoveDA: Demonstrated strong segmentation results for well-defined categories like buildings and roads. The combination of Grounding DINO and SAM often yielded the most precise segmentations.
Vaihingen and Potsdam: Performance variability was noted, particularly in the Vaihingen dataset due to its unique near-infrared data, which posed challenges for generic visual models. While strong results were observed for certain categories, such as buildings, more abstract classes like vegetation were difficult to segment accurately.

The research identifies the challenges associated with remote sensing data that stem from different geographic regions, times, and sensors. This complexity underscores the need for adaptable and generalized foundation models for specific contexts.

Conclusion and Future Directions

The paper presents a notable contribution by illustrating the integration of existing visual foundation models to address the unique challenges in remote sensing image segmentation. While the pipeline shows promise, the inherent characteristics of remote sensing imagery still present obstacles that require tailored solutions. Future research directions may involve further refining foundation models for domain-specific tasks and advancing visual prompt engineering to optimize performance in this field. These developments could considerably enhance the applicability and efficiency of foundation models in a wider array of real-world tasks.

PDF Markdown

Related Papers

GitHub

GitHub - Douglas2Code/Text2Seg (100 stars)