InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition (2505.15818v1)

Published 21 May 2025 in cs.CV

Abstract: Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-LLMs to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

Summary

InstructSAM: A Training-Free Framework for Remote Sensing Object Recognition

The paper "InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition" presents a novel approach in the application of artificial intelligence to remote sensing imagery. The authors introduce InstructSAM, a framework designed to overcome the challenges of open-vocabulary object recognition in remote sensing, where explicit category annotations are often absent, and advanced reasoning is required for complex or implicit queries. This paper is significant for its goal of enabling large-scale mapping and automated data annotation without the need for extensive model training.

The core contribution is the InstructSAM framework, which operates free from task-specific training. It utilizes existing large vision-LLMs (VLMs) to interpret user instructions in natural language, predict object categories and counts, and then employs a semantic segmentation model termed SAM2 for generating mask proposals. The assignment of these masks to the predicted categories is formulated as a binary integer programming problem, which integrates semantic similarity and global object counting constraints to optimize category assignment.

InstructSAM benchmarks itself against the newly proposed EarthInstruct dataset, which includes diverse remote sensing datasets with different spatial resolutions and annotation rules across 20 categories. This benchmark aims to reflect the diversity and complexity of real-world remote sensing tasks and necessitates models to interpret detailed and dataset-specific instructions.

Experimental results reveal that InstructSAM consistently matches or surpasses specialized baseline models in performance across multiple tasks, such as object counting, detection, and segmentation, while maintaining near-constant inference time regardless of the number of objects in the image. The framework also notably reduces output tokens by 89% and runtime by over 32% when compared to direct generation approaches. This efficiency is critical for practical applications, where computational resources and time are often constrained.

From a theoretical standpoint, InstructSAM sets a precedent in instruction-driven object recognition systems by demonstrating that robust performance can be achieved without conventional model training. This paradigm can significantly reduce the costs associated with pre-training and model development while offering flexibility and scalability for managing diverse remote sensing tasks. The framework leverages the capabilities of foundation models without needing further adaptation, which implies a versatile application across different image domains when appropriate adjustments are made in the semantic embedding models used.

Looking forward, the theoretical implications of InstructSAM suggest potential advances in AI for remote sensing, where improved architectural designs and more refined VLMs could be integrated to enhance performance. Additionally, the EarthInstruct benchmark provides a foundation for testing and developing future remote sensing models.

Practically, InstructSAM's contributions can be instrumental in domains requiring automated object recognition from aerial imagery, such as disaster management, wildlife monitoring, and urban planning. As satellite imagery continues to evolve with higher resolutions and increased availability, frameworks like InstructSAM present an opportunity to harness these data effectively and efficiently, thereby supporting various applications aligned with the United Nations' Sustainable Development Goals.

In summary, this paper introduces an innovative, training-free approach to remote sensing object recognition, setting a standard for future research in integrating instructional guidance with AI-driven analysis in complex data environments. The InstructSAM framework and EarthInstruct benchmark together create a meaningful contribution to advancing the field of remote sensing, enabling more dynamic and responsive AI systems.