RemoteSAM: Towards Segment Anything for Earth Observation (2505.18022v3)

Published 23 May 2025 in cs.CV

Abstract: We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.

Summary

Overview of RemoteSAM: Segmenting Anything for Earth Observation

The paper "RemoteSAM: Towards Segment Anything for Earth Observation" proposes a novel approach to developing a segment anything model tailored for Earth observation tasks. This research is centered on addressing the inherent limitations of current visual recognition systems, which predominantly feature task-specific architectures with narrow data scopes and limited semantic coverage. The authors introduce RemoteSAM, a comprehensive vision foundation model that unifies multiple vision-centric perception tasks through robust data and modeling strategies.

Key contributions include the development of an automatic data engine that substantially scales the dataset for Earth observation from narrow and template-based datasets to a broad, well-diversified dataset. The dataset consists of approximately 270,000 image-text-mask triplets, thereby significantly enhancing the semantic breadth and variety of remote sensing data. This expanded dataset serves as the cornerstone for constructing a foundational model capable of handling diverse tasks such as classification, detection, segmentation, and grounding without task-specific modifications.

RemoteSAM employs a novel architectural paradigm centered around referring expression segmentation, which serves as a versatile framework accommodating various inputs and outputs. This approach allows RemoteSAM to establish new state-of-the-art results across multiple benchmarks, surpassing existing models like Falcon, GeoChat, and LHRS-Bot, particularly in tasks requiring pixel-level understanding.

Numerical Results and Bold Claims

RemoteSAM demonstrates substantial improvements in pixel-level tasks, particularly referring expression segmentation, with reported gains of over 3% in mean Intersection over Union (mIoU) over previous methods in several benchmark datasets. Additionally, it achieves state-of-the-art performance in semantic segmentation, outperforming recent foundational models with greater parameter efficiency. The results indicate that RemoteSAM effectively resolves existing deficiencies in fine-grained pixel-wise prediction tasks while utilizing fewer resources (millions versus billions of parameters).

Implications and Future Directions

The development of RemoteSAM opens new avenues for employing flexible, foundational models in Earth observation. By harnessing rich semantic datasets and a unified architecture, it provides enhanced adaptability to varied application scenarios and supports better integration of knowledge across domains. Practically, these advancements offer significant improvements in remote sensing tasks crucial for urban planning, agriculture monitoring, and disaster management.

The approach also underscores the potential for leveraging multimodal data and semantic diversity to create models capable of inferring complex spatial dynamics and relationships at high resolutions. The research suggests that this segmentation-centered paradigm can be further refined and expanded, potentially serving as a blueprint for future developments in vision models targeting other domains beyond remote sensing.

In conclusion, RemoteSAM offers a substantial leap toward creating versatile Earth observation systems capable of segmenting various elements within intricate environments. This paper lays critical groundwork for improved, task-unified architectures essential for advanced AI applications in spatial and environmental analytics.