Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

Published 17 Apr 2026 in cs.CV | (2604.15652v1)

Abstract: Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a comprehensive benchmark (OVRSIS95K and OVRSISBenchV2) alongside Pi-Seg, a novel baseline tailored for open-vocabulary remote sensing segmentation.
It utilizes positive-incentive noise perturbations via Text-SPM and Image-SPM modules to enhance feature alignment and improve generalization across diverse RS data.
Extensive experiments show that Pi-Seg achieves strong performance gains, efficient computation, and robust transferability to tasks like building and road extraction.

Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

Problem Formulation and Motivation

Open-vocabulary remote sensing image segmentation (OVRSIS) presents key challenges, fundamentally distinguished from conventional open-vocabulary segmentation (OVS) in natural images due to the distinctive visual and geometric properties of remote sensing data. Standard OVS models, primarily based on vision-LLMs (VLMs) like CLIP, are insufficient for the RS domain, lacking rotation invariance and robust multi-scale modeling required by top-down perspectives and highly diverse object scales inherent in RS imagery. Furthermore, fragmented datasets and non-unified evaluation protocols impede progress in OVRSIS, as prior benchmarks suffer from insufficient scene diversity, class imbalance, and limited applicability to real-world geospatial tasks.

This work systematizes and extends the OVRSIS task by proposing a comprehensive benchmarking platform and a strong baseline method. The chief contributions are: (1) the construction of OVRSIS95K, a large and balanced training set; (2) establishment of OVRSISBenchV2, a unified and realistic evaluation suite covering standard and downstream geospatial tasks; and (3) introduction of Pi-Seg, a noise-aware, perturbation-injected baseline for OVRSIS.

OVRSIS95K and OVRSISBenchV2: Benchmark Construction

To address the severe limitations of prior datasets, OVRSIS95K is constructed as a new large-scale dataset containing ~95,000 image-mask pairs, balanced over 35 semantic categories and structured into five core scene domains: town, industrial, forest, waterfront, and wasteland. Data curation relies on a scalable semi-automated pipeline: caption-driven category parsing, automated mask generation and human-audited correction. Audits ensure 97.25% positive category acceptance and 91.66% mask acceptance rates, with robust correction procedures for the remainder, ensuring high annotation quality and balanced semantic/scene representation.

Building on OVRSIS95K, OVRSISBenchV2 aggregates ten diverse downstream remote sensing datasets (e.g., DLRSD, UAVid, FLAIR, VDD, LoveDA) for a total of 170,000+ annotated images and 128 semantic categories, ensuring wide coverage of sensing platforms, spatial resolutions, and scene distributions. The protocol enforces open-vocabulary training and cross-dataset transfer, with non-trivial overlaps and disjoint test semantics to rigorously evaluate generalization. Importantly, it incorporates three downstream, application-oriented protocols—building extraction, road extraction, and flood detection—mirroring real-world geospatial decision tasks.

Pi-Seg: Perturbation-Injected Framework for OVRSIS

Pi-Seg is designed as a lightweight, efficient, and highly transferable baseline for open-vocabulary RS segmentation. The core idea is to regularize the vision-language embedding alignment via semantically guided stochastic perturbations during training—referred to as positive-incentive noise (Pi-Noise). This prevents overfitting to narrow feature distributions, improves generalization to unseen classes, and circumvents the need for external heavy encoders (as used in, e.g., RSKT-Seg).

The architecture consists of:

CLIP Encoders: Extraction of text and dense visual features.
Text-SPM and Image-SPM Modules: Learnable perturbation modules broaden the semantic/text prototype neighborhood and inject adaptive spatial noise into visual features, respectively. This is mediated by Gaussian/Laplace/Student-t/Uniform stochasticity with learnable scale and bias, allowing distribution-agnostic operation.
Cost Volume Construction and Aggregation: Dense pixel-text similarity maps are constructed and refined with spatial and class aggregation modules for robust alignment and smoothness.
Decoder: Upsampling and final segmentation prediction.

Extensive ablation shows that regularization via both branches is essential: Text-SPM alone can harm performance due to uncoordinated semantic deviations, while Image-SPM alone provides moderate benefits. Their combination yields consistent performance boosts.

Experimental Analysis

Quantitative results establish Pi-Seg as a strong and robust baseline. On both OVRSISBenchV1 and the substantially more challenging OVRSISBenchV2, Pi-Seg achieves the best or highly competitive mean mIoU and mACC across both ViT-B and ViT-L configurations. For instance, on OVRSISBenchV2 with ViT-L, Pi-Seg achieves m-mIoU = 44.40 and m-mACC = 63.16, outperforming prior state-of-the-art heavy frameworks (2604.15652). Performance gains are robust across multiple perturbation distributions and parameter regimes, and remain stable under varying random seeds.

On downstream tasks, Pi-Seg attains new best results (e.g., mIoU 85.88 on WHUAerial for building extraction), validating transfer to application-level scenarios. The model generalizes across image scales, demonstrating particular efficacy in high-resolution regimes that challenge feature continuity and boundary preservation.

Efficiency analyses reveal that Pi-Seg is considerably more parameter- and computation-efficient than RSKT-Seg. It maintains inference complexity on par with CAT-Seg but avoids the severe overhead of prior methods that require sliding-window evaluation, making it compatible with high-resolution practical deployment.

Qualitative studies and dynamic correlation analysis demonstrate that Pi-Seg: (1) yields spatially complete, semantically coherent segmentations with superior boundary awareness and background suppression; (2) adapts the feature alignment during training to maximize target responses and minimize non-target activation, substantiating the semantic incentive rather than randomness of the injected noise.

Limitations and Future Research Directions

Despite its advantages, Pi-Seg still confounds visually similar, fine-grained categories due to reliance on brief label prompts and absence of explicit high-order context modeling. Class confusions (e.g., ship vs airplane vs vehicle) persist, underscoring the need for richer semantic prompts, multi-level taxonomic supervision, or advanced context/attribute modeling. Another open avenue is the targeted study of perturbation strategies better tuned to RS-specific intra-class structure and inter-class relationships. Additionally, Pi-Seg does not explicitly enforce rotation equivariance, instead relying on feature-space smoothing that indirectly supports orientation invariance.

Theoretical and Practical Implications

The introduction of OVRSIS95K and OVRSISBenchV2 supplies the field with a much-needed, unified foundation for open-vocabulary geospatial semantic segmentation, analogous to the role played by COCO/ADE20K in natural image domains, and will likely catalyze advances in robust, scalable RS scene understanding. The positive-incentive perturbation approach of Pi-Seg formalizes a regularization mechanism with general utility for cross-domain transfer and open-set recognition, potentially influencing future VLM adaptation methodologies.

From an application perspective, Pi-Seg's robustness and transferability enhance the feasibility of deploying segmentation models for dynamic, real-world geospatial tasks (e.g., disaster monitoring, infrastructure mapping, environmental change detection) without requiring extensive task-specific retraining or annotation.

Conclusion

This work establishes new standards for open-vocabulary remote sensing segmentation by providing a large-scale, unified benchmark (OVRSISBenchV2), robust data foundation (OVRSIS95K), and an efficient, perturbation-injected baseline (Pi-Seg) facilitating strong transfer performance under realistic geospatial demands. The systematic benchmarking and novel regularization strategies set a foundation for future studies in robust, open-set semantic segmentation across Earth observation domains and beyond.

Markdown Report Issue