TinyRS-R1: Compact Multimodal Language Model for Remote Sensing (2505.12099v1)

Published 17 May 2025 in cs.CV

Abstract: Remote-sensing applications often run on edge hardware that cannot host today's 7B-parameter multimodal LLMs. This paper introduces TinyRS, the first 2B-parameter multimodal small LLM (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially benefits spatial grounding and scene understanding, while the non-reasoning TinyRS excels in concise, latency-sensitive VQA tasks. TinyRS-R1 represents the first domain-specialized MSLM with GRPO-aligned CoT reasoning for general-purpose remote sensing.

Summary

TinyRS-R1: A Compact Multimodal LLM for Remote Sensing

The paper presents TinyRS and its variant TinyRS-R1, two innovative 2-billion-parameter multimodal small LLMs specifically optimized for remote sensing tasks. These models address the challenges of deploying large-scale models on resource-limited platforms, such as edge devices often used in remote sensing applications. TinyRS builds on the Qwen2-VL-2B framework, adopting a modular and efficient architecture suitable for satellite image analysis and other related tasks.

Methodology

The models undergo a comprehensive four-stage training pipeline:

Pre-training: The models are initially pre-trained on a vast set of satellite images to establish a foundational understanding of remote sensing imagery.
Instruction Tuning: This stage involves fine-tuning the model using a visual instruction dataset, enhancing its capability to handle vision-language tasks effectively.
Chain-of-Thought (CoT) Fine-tuning: This novel step involves fine-tuning with Chain-of-Thought annotations from a specially curated reasoning dataset. This method significantly improves the model's ability to perform reasoning tasks, particularly in spatial grounding and scene comprehension.
Group Relative Policy Optimization (GRPO) Alignment: The final training stage involves alignment via GRPO, a reinforcement learning approach, to optimize the model’s reasoning capabilities further.

Evaluation and Results

The performance of TinyRS-R1 and its base variant, TinyRS, is rigorously evaluated against various benchmarks for remote sensing tasks such as classification, visual question answering (VQA), and visual grounding. TinyRS-R1 not only matches but often surpasses the performance of the larger 7-billion-parameter remote sensing models across these domains. Specifically, TinyRS-R1 exhibits exceptional performance in complex tasks requiring CoT reasoning, while the base TinyRS model excels in latency-sensitive VQA tasks due to its concise output capability.

Implications and Future Directions

The successful implementation of a compact yet powerful multimodal LLM for remote sensing showcases the potential of scaling down model parameters without compromising performance. This research broadens the scope of deploying efficient AI solutions in real-world applications where computational resources are constrained. The models' success implies a promising future for domain-specialized small LLMs.

Further investigations might involve enhancing the knowledge retrieval capabilities of these models to improve their performance in general knowledge tasks. The development of retrieval-augmented models, coupled with the efficient framework of TinyRS, could yield even higher accuracy and robustness in various complex tasks faced in remote sensing.