RS-GPT4V Dataset for Remote Sensing

Updated 5 January 2026

RS-GPT4V is a unified multimodal dataset designed for remote sensing image understanding via instruction-following (Q, A) pairs.
It employs a novel annotation adaptation using GPT-4V to convert legacy labels into detailed, multi-turn dialogue and chain-of-thought reasoning structures.
Empirical results show improvements in captioning, VQA, and visual grounding tasks, setting new benchmarks for remote sensing vision-language models.

RS-GPT4V is a unified, multimodal instruction-following dataset explicitly designed for remote sensing image (RSI) understanding in the era of multi-modal LLMs (MLLMs). The dataset addresses the limitations of legacy domain models by enabling generalization, fine-grained scene understanding, and high-level reasoning capabilities vital for the next generation of adaptive domain models in remote sensing. RS-GPT4V leverages GPT-4V's instruction-following abilities and synthesizes diverse remote sensing tasks within a single (Question, Answer) paradigm, supporting model training across image captioning, visual question answering (VQA), visual grounding, region-level captioning, and multi-turn dialogue (Xu et al., 2024).

1. Motivation and Conceptual Foundations

RS-GPT4V was conceived in response to the paradigm shift from domain-specific models (LaDM) to a two-stage paradigm involving a pre-trained general foundation model adapted to the remote sensing domain (LaGD). Previous datasets, while pivotal for classic RSI analysis tasks, do not provide the generalization, complex scene understanding, and reasoning requirements imposed by instruction-following MLLMs. The RS-GPT4V criteria emphasize:

Generalization: Architecture-neutral training signals to promote cross-task knowledge sharing and ease task adaptation.
Fine-Grained Understanding: Hierarchical instructions enable models to discern object attributes and spatial relationships, fostering detailed natural language scene descriptions.
Reasoning: Dialogic multi-turn QA structures support explicit high-level visual reasoning, including chain-of-thought workflows, object set identification, attribute extraction, and deductive inference (Xu et al., 2024).

2. Dataset Construction and Task Unification

RS-GPT4V encompasses 91,937 training images and 15,999 test images, yielding 991,206 and 258,419 (Q, A) instance pairs, respectively. Six major task categories are unified:

Image captioning: NWPU-Captions, RSICD, RSITMD, Sydney-Captions, UCM-Captions.
Visual QA: RSVQA-LR, RSVQA-HR, FloodNet, RSIVQA.
Visual grounding: DIOR-RSVG.
Region-level captioning: DIOR-RSVG.
Multi-turn conversation/detailed description: RS-GPT4V-Instruct.

The construction workflow employs two main strategies:

Instruction-Annotation Adaption: Existing dataset annotations—labels, bounding boxes, class indices, captions—are reformulated into instruction templates, e.g., "Provide a one-sentence caption for this RSI," resulting in (Q, A) pairs.
Instruction-Response Generation: GPT-4V is prompted with image data and geometric (rotated bounding box) coordinates to generate high-resolution (Q, A) pairs, eliciting object attributes, spatial relations, and logical reasoning (Xu et al., 2024).

3. Instructional and Annotation Formalism

Each instance is formalized as a tuple $(Q, A)$ , where $Q$ encodes a system prompt, optional task-specific instruction, and user query, and $A$ contains the corresponding GPT-4V response or a manually verified answer. Multi-turn dialogues instantiate chain-of-thought reasoning across turns, e.g.:

$Q^1$ : "List all tennis and basketball courts in the image."
$A^1$ : "There are two tennis courts at bottom-left and one basketball court top-right."
$Q^2$ : "Describe the color and surroundings of the basketball court."
$A^2$ : "It has orange flooring, surrounded by green fields and a fence."

The annotation design incorporates:

Hierarchical Descriptions:
- Local strategy: For a set of objects $I=\{o_1,\ldots,o_n\}$ , each $o_i$ is associated with attributes $A(o_i)$ and pairwise spatial relations $R(o_i,o_j)$ (e.g., "left_of," "adjacent_to").
- Global strategy: Aggregates $\{A(o_i), R(o_i,o_j)\}$ to produce a coherent, context-rich scene description.

The autoregressive fine-tuning objective is defined as:

$p(X_a|X_v,Q) = \prod_{i=1}^L p_\theta(x_i | X_v, Q_{<i}, X_{a,<i})$

where $X_v$ is the visual input and $X_a$ the answer tokens (Xu et al., 2024).

4. Dataset Breakdown and Statistics

RS-GPT4V integrates and restructures multiple foundational datasets, as summarized below:

Task Type	Source Dataset	Train Images	Train QA	Test Images	Test QA
Image Captioning	NWPU-Captions	25,200	125,894	3,150	1,093
	RSICD	8,734	17,813	1,093	1,093
Visual QA	RSVQA-LR	572	57,223	100	10,004
	RSVQA-HR	6,251	625,340	2,226	222,684
Visual Grounding/Region Captioning	DIOR-RSVG	9,466	19,643	7,936	18,677
Multi-turn & Detailed Description	RS-GPT4V-Instruct	9,466	62,067 (MT)	613	3,987(MT)
			9,465(DD)		613(DD)

Note: (MT) = multi-turn QA, (DD) = detailed description (Xu et al., 2024).

5. Multi-Turn Reasoning and Dialogue Paradigms

Multi-turn dialogue sequences are engineered to instantiate chain-of-thought reasoning:

Object Set Identification: "List objects of interest."
Attribute Extraction: "What is the color and surface material of each?"
Deductive Inference: "Based on the ship’s wake, is it moving or stationary?"

The formal reasoning process is

$\text{Step 1: } I \text{ (object set)}\ \text{Step 2: } A(o_i) \text{ (attributes)}\ \text{Step 3: } S = g(I, A, R) \text{ (symbolic compositional knowledge)}$

where $S$ summarizes high-level conclusions drawn from visual evidence (Xu et al., 2024).

6. Empirical Evaluations and Comparative Performance

Fine-tuning was conducted using LLaVA-1.5-7B and compared across full finetuning, LoRA, and MoE-LoRA protocols (rank 128, 4 experts, learning rate $2 \times 10^{-4}$ , 1 epoch). Major results include:

Image Captioning (NWPU-Captions): BLEU-4 improved from $\sim$ 15 to $\sim$ 26, CIDEr from $\sim$ 65 to $\sim$ 112, SPICE from $\sim$ 8 to $\sim$ 14.
Visual QA (RSVQA-HR): MoE-LoRA achieved average accuracy $\sim$ 78% vs. $\sim$ 65% for Bi-Modal and SHRNet. Presence/Comparison metrics improved by 10–15% absolute.
Visual Grounding (DIOR-RSVG, [email protected]): Qwen-vl-Chat 25.05%, LLaVA-1.5 9.52%, Full-FT 36.31%, LoRA 33.15%, MoE-LoRA 37.86%.
Dialogue Evaluation (GPT-4V scoring, 1–10): Complex reasoning—Full-FT 6.27, LoRA 6.06, MoE-LoRA 6.11; Baselines (LLaVA-1.5 5.21, Qwen-vl-Chat 2.65). Detailed description—Full-FT 6.53, LoRA 6.37, MoE-LoRA 6.47; baseline range $\sim$ 4–5.

These results demonstrate statistically and qualitatively superior performance for captioning, VQA, grounding, and dialogue when training on RS-GPT4V (Xu et al., 2024).

7. Context within Remote Sensing Vision-Language Resources

RS-GPT4V is distinct from other contemporary multimodal RS datasets such as MMM-RS (Luo et al., 2024) and GAIA (Zavras et al., 13 Feb 2025). While MMM-RS focuses on large-scale text-to-image pairs for generative diffusion model benchmarking (2.1M pairs), and GAIA emphasizes global multi-modal, multi-scale retrieval and captioning with five synthetic captions per image, RS-GPT4V uniquely prioritizes unified instruction-following, fine-grained spatial annotation, hierarchical local/global scene description, and explicit multi-turn reasoning.

A plausible implication is that RS-GPT4V acts as an enabling resource for instruction-following vision-LLMs needing broad generalization over RS tasks requiring hierarchical reasoning, rather than only generative fidelity or global description coverage (Xu et al., 2024).

References:

RS-GPT4V: (Xu et al., 2024) MMM-RS: (Luo et al., 2024) GAIA: (Zavras et al., 13 Feb 2025)