Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation (2312.12470v3)

Published 19 Dec 2023 in cs.CV

Abstract: Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing, delineating specific regions in aerial images as described by textual queries. Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery, leading to suboptimal segmentation results. To address these challenges, we introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accuracy. To assess the efficacy of RMSIN, we have curated an expansive dataset comprising 17,402 image-caption-mask triplets, which is unparalleled in terms of scale and variety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. Our experimental evaluations demonstrate the exceptional performance of RMSIN, surpassing existing state-of-the-art models by a significant margin. All datasets and code are made available at https://github.com/Lsan2401/RMSIN.

PDF HTML Abstract

Overview of the Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

The paper entitled "Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation" by Sihan Liu et al. presents an innovative approach to addressing the challenges of referring image segmentation within the field of remote sensing. The task of Referring Remote Sensing Image Segmentation (RRSIS) requires the ability to segment specified areas in aerial imagery based on textual descriptions, integrating the domains of computer vision and natural language processing. Given the inherent complexities of spatial scales and object orientations in aerial images, traditional RIS methods have demonstrated limitations that this paper seeks to overcome with the introduction of a novel framework termed the Rotated Multi-Scale Interaction Network (RMSIN).

Key Contributions and Methodology

Dataset Introduction: The authors introduce a comprehensive RRSIS benchmark dataset, RRSIS-D. This dataset comprises 17,402 image-caption-mask triplets, providing a broad range of spatial scales and orientations for training and evaluating segmentation models. This dataset surpasses previous datasets such as RefSegRS in terms of volume and diversity.
Rotated Multi-Scale Interaction Network (RMSIN): At the core of RMSIN are several architectural innovations designed to manage multi-scale and rotational variations in remote sensing imagery. The network employs an Intra-scale Interaction Module (IIM) to harness fine-grained details within individual layers and a Cross-scale Interaction Module (CIM) for comprehensive feature integration across various scales. The Adaptive Rotated Convolution (ARC) is specifically developed to tackle diverse orientations, enabling the network to capture and align features with superior adaptability to rotations.
Architectural Design: RMSIN advances existing segmentation architectures through its Compound Scale Interaction Encoder (CSIE), which facilitates stage-wise feature fusion while preserving the semantic and spatial coherence required for remote sensing applications. Additionally, the use of a specialized Oriented-Aware Decoder integrates ARC to dynamically adjust convolutional kernels based on learned object orientations, promoting enhanced specificity in capturing object borders.

Evaluation and Results

The proposed RMSIN outperforms existing state-of-the-art methods on the RRSIS-D dataset. Specifically, RMSIN demonstrates a 3.64% to 6.10% improvement in mean Intersection-over-Union (mIoU) over its closest competitor, LAVT, on validation and test sets. Precision metrics across varying Intersection-over-Union (IoU) thresholds further reinforce the performance gains, especially in complex scenes involving small or oriented objects.

Implications and Future Directions

The robust performance of RMSIN on a newly established benchmark underscores its potential impact in advancing the field of referring image segmentation in remote sensing. Practically, the enhanced segmentation accuracy facilitated by RMSIN could significantly benefit diverse applications, including urban planning, environmental monitoring, and land-use analysis. Theoretically, the introduction of ARC and its integration into the multi-scale interaction framework offer new avenues for further refining object detection and segmentation tasks in remote sensing and beyond.

Future research can explore extensions of RMSIN with recent advancements in vision-language co-learning, as well as its adaptability to other complex image domains beyond remote sensing. Signature elements such as ARC could also be investigated for integration into broader architectures aiming to tackle rotation-specific challenges in other vision-heavy fields, thus potentially catalyzing a wave of innovation in multi-domain image analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Sihan Liu (17 papers)
Yiwei Ma (24 papers)
Xiaoqing Zhang (30 papers)
Haowei Wang (32 papers)
Jiayi Ji (51 papers)
Xiaoshuai Sun (91 papers)
Rongrong Ji (315 papers)

Citations (20)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Lsan2401/RMSIN: Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation (88 stars)