- The paper presents a comprehensive benchmark (OVRSIS95K and OVRSISBenchV2) alongside Pi-Seg, a novel baseline tailored for open-vocabulary remote sensing segmentation.
- It utilizes positive-incentive noise perturbations via Text-SPM and Image-SPM modules to enhance feature alignment and improve generalization across diverse RS data.
- Extensive experiments show that Pi-Seg achieves strong performance gains, efficient computation, and robust transferability to tasks like building and road extraction.
Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
Open-vocabulary remote sensing image segmentation (OVRSIS) presents key challenges, fundamentally distinguished from conventional open-vocabulary segmentation (OVS) in natural images due to the distinctive visual and geometric properties of remote sensing data. Standard OVS models, primarily based on vision-LLMs (VLMs) like CLIP, are insufficient for the RS domain, lacking rotation invariance and robust multi-scale modeling required by top-down perspectives and highly diverse object scales inherent in RS imagery. Furthermore, fragmented datasets and non-unified evaluation protocols impede progress in OVRSIS, as prior benchmarks suffer from insufficient scene diversity, class imbalance, and limited applicability to real-world geospatial tasks.
This work systematizes and extends the OVRSIS task by proposing a comprehensive benchmarking platform and a strong baseline method. The chief contributions are: (1) the construction of OVRSIS95K, a large and balanced training set; (2) establishment of OVRSISBenchV2, a unified and realistic evaluation suite covering standard and downstream geospatial tasks; and (3) introduction of Pi-Seg, a noise-aware, perturbation-injected baseline for OVRSIS.
OVRSIS95K and OVRSISBenchV2: Benchmark Construction
To address the severe limitations of prior datasets, OVRSIS95K is constructed as a new large-scale dataset containing ~95,000 image-mask pairs, balanced over 35 semantic categories and structured into five core scene domains: town, industrial, forest, waterfront, and wasteland. Data curation relies on a scalable semi-automated pipeline: caption-driven category parsing, automated mask generation and human-audited correction. Audits ensure 97.25% positive category acceptance and 91.66% mask acceptance rates, with robust correction procedures for the remainder, ensuring high annotation quality and balanced semantic/scene representation.
Building on OVRSIS95K, OVRSISBenchV2 aggregates ten diverse downstream remote sensing datasets (e.g., DLRSD, UAVid, FLAIR, VDD, LoveDA) for a total of 170,000+ annotated images and 128 semantic categories, ensuring wide coverage of sensing platforms, spatial resolutions, and scene distributions. The protocol enforces open-vocabulary training and cross-dataset transfer, with non-trivial overlaps and disjoint test semantics to rigorously evaluate generalization. Importantly, it incorporates three downstream, application-oriented protocols—building extraction, road extraction, and flood detection—mirroring real-world geospatial decision tasks.
Pi-Seg: Perturbation-Injected Framework for OVRSIS
Pi-Seg is designed as a lightweight, efficient, and highly transferable baseline for open-vocabulary RS segmentation. The core idea is to regularize the vision-language embedding alignment via semantically guided stochastic perturbations during training—referred to as positive-incentive noise (Pi-Noise). This prevents overfitting to narrow feature distributions, improves generalization to unseen classes, and circumvents the need for external heavy encoders (as used in, e.g., RSKT-Seg).
The architecture consists of:
- CLIP Encoders: Extraction of text and dense visual features.
- Text-SPM and Image-SPM Modules: Learnable perturbation modules broaden the semantic/text prototype neighborhood and inject adaptive spatial noise into visual features, respectively. This is mediated by Gaussian/Laplace/Student-t/Uniform stochasticity with learnable scale and bias, allowing distribution-agnostic operation.
- Cost Volume Construction and Aggregation: Dense pixel-text similarity maps are constructed and refined with spatial and class aggregation modules for robust alignment and smoothness.
- Decoder: Upsampling and final segmentation prediction.
Extensive ablation shows that regularization via both branches is essential: Text-SPM alone can harm performance due to uncoordinated semantic deviations, while Image-SPM alone provides moderate benefits. Their combination yields consistent performance boosts.
Experimental Analysis
Quantitative results establish Pi-Seg as a strong and robust baseline. On both OVRSISBenchV1 and the substantially more challenging OVRSISBenchV2, Pi-Seg achieves the best or highly competitive mean mIoU and mACC across both ViT-B and ViT-L configurations. For instance, on OVRSISBenchV2 with ViT-L, Pi-Seg achieves m-mIoU = 44.40 and m-mACC = 63.16, outperforming prior state-of-the-art heavy frameworks (2604.15652). Performance gains are robust across multiple perturbation distributions and parameter regimes, and remain stable under varying random seeds.
On downstream tasks, Pi-Seg attains new best results (e.g., mIoU 85.88 on WHUAerial for building extraction), validating transfer to application-level scenarios. The model generalizes across image scales, demonstrating particular efficacy in high-resolution regimes that challenge feature continuity and boundary preservation.
Efficiency analyses reveal that Pi-Seg is considerably more parameter- and computation-efficient than RSKT-Seg. It maintains inference complexity on par with CAT-Seg but avoids the severe overhead of prior methods that require sliding-window evaluation, making it compatible with high-resolution practical deployment.
Qualitative studies and dynamic correlation analysis demonstrate that Pi-Seg: (1) yields spatially complete, semantically coherent segmentations with superior boundary awareness and background suppression; (2) adapts the feature alignment during training to maximize target responses and minimize non-target activation, substantiating the semantic incentive rather than randomness of the injected noise.
Limitations and Future Research Directions
Despite its advantages, Pi-Seg still confounds visually similar, fine-grained categories due to reliance on brief label prompts and absence of explicit high-order context modeling. Class confusions (e.g., ship vs airplane vs vehicle) persist, underscoring the need for richer semantic prompts, multi-level taxonomic supervision, or advanced context/attribute modeling. Another open avenue is the targeted study of perturbation strategies better tuned to RS-specific intra-class structure and inter-class relationships. Additionally, Pi-Seg does not explicitly enforce rotation equivariance, instead relying on feature-space smoothing that indirectly supports orientation invariance.
Theoretical and Practical Implications
The introduction of OVRSIS95K and OVRSISBenchV2 supplies the field with a much-needed, unified foundation for open-vocabulary geospatial semantic segmentation, analogous to the role played by COCO/ADE20K in natural image domains, and will likely catalyze advances in robust, scalable RS scene understanding. The positive-incentive perturbation approach of Pi-Seg formalizes a regularization mechanism with general utility for cross-domain transfer and open-set recognition, potentially influencing future VLM adaptation methodologies.
From an application perspective, Pi-Seg's robustness and transferability enhance the feasibility of deploying segmentation models for dynamic, real-world geospatial tasks (e.g., disaster monitoring, infrastructure mapping, environmental change detection) without requiring extensive task-specific retraining or annotation.
Conclusion
This work establishes new standards for open-vocabulary remote sensing segmentation by providing a large-scale, unified benchmark (OVRSISBenchV2), robust data foundation (OVRSIS95K), and an efficient, perturbation-injected baseline (Pi-Seg) facilitating strong transfer performance under realistic geospatial demands. The systematic benchmarking and novel regularization strategies set a foundation for future studies in robust, open-set semantic segmentation across Earth observation domains and beyond.