RS-M-CLIP: Multilingual Remote Sensing Model

Updated 18 February 2026

The paper introduces RS-M-CLIP, a vision-and-language model that extends CLIP with multilingual and self-distillation capabilities for remote sensing tasks.
It employs a ViT-B/32 image encoder and an XLM-RoBERTa text encoder, using a DINO-inspired self-distillation loss to align local and global features.
Empirical results demonstrate state-of-the-art performance in cross-modal retrieval, zero-shot classification, and multilingual retrieval across diverse remote sensing benchmarks.

Remote Sensing Multilingual CLIP (RS-M-CLIP) is a vision-and-LLM specifically developed for the remote sensing domain, extending contrastive language–image pre-training (CLIP) with multilingual and self-distillation capabilities. RS-M-CLIP jointly optimizes a multilingual CLIP framework with a self-supervised objective, leveraging automated translation to support cross-lingual vision–language tasks such as retrieval and zero-shot classification, and achieves state-of-the-art (SOTA) performance across multiple remote sensing benchmarks (Silva et al., 2024).

1. Model Architecture and Training Objectives

RS-M-CLIP integrates a multilingual backbone with an explicit self-distillation mechanism for enhanced local–global visual understanding. The core architectural and optimization components are as follows:

Image Encoder: Utilizes a Vision Transformer (ViT-B/32) initialized from the XLM-RoBERTa–based CLIP checkpoint ("laion5b_s13b_b90k") pre-trained on approximately 5B multilingual web images. The output is a 512-dimensional, ℓ₂-normalized global [CLS] embedding, which is used in the InfoNCE contrastive loss.
Text Encoder: Employs a multilingual XLM-RoBERTa-base model, projecting input text (caption) into a 512-dimensional, ℓ₂-normalized embedding over the [CLS] token.
Self-Distillation Projector: For both student and teacher ViT encoders, a 3-layer MLP followed by ℓ₂ normalization and a weight-normalized fully connected layer outputs a probability distribution $P(\cdot)\in\Delta^{K-1}$ over $K=65\,536$ “codes.” This structure facilitates local–global alignment via a DINO-inspired self-distillation loss.
Contrastive CLIP Loss (InfoNCE): Defines

$L_{t\rightarrow i} = -\frac{1}{N}\sum_{i=1}^N \log \left[\frac{\exp(\text{sim}(u_i, v_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(u_i, v_j)/\tau)}\right],$

$L_{i\rightarrow t} = -\frac{1}{N}\sum_{i=1}^N \log \left[\frac{\exp(\text{sim}(v_i, u_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(v_i, u_j)/\tau)}\right],$

$L_\text{clip} = \frac{1}{2}(L_{t\rightarrow i} + L_{i\rightarrow t}),$

where $\text{sim}(u,v) = \frac{u \cdot v}{\|u\|\|v\|}$ , and $N$ is the batch size.

Self-Distillation (Local–Global Alignment): For each image $x$ , S=10 views are sampled—two global crops and eight local crops. The teacher, updated via exponential moving average (EMA, $\lambda=0.996$ ), receives only global crops; the student receives both. The self-distillation loss is

$L_{self} = \sum_{x \in \{x_1^g, x_2^g\}} \sum_{x' \in S, x' \neq x} [ -P_t(x) \cdot \log P_s(x') ],$

with teacher outputs centered (mean-subtracted) and sharpened to prevent collapse.

Joint Objective: The total training loss is $L = \frac{1}{2}(L_\text{clip} + L_\text{self})$ .

This configuration ensures alignment between local and global visual features while maintaining strong cross-modal contrastive alignment between image and text modalities.

2. Multilingual Fine-Tuning Strategy

RS-M-CLIP employs a systematic multilingual augmentation pipeline and data sampling approach:

Languages and Dataset Composition: The base English dataset (Cap-5) concatenates five remote-sensing caption collections for approximately 49,900 images and 249,385 English captions. Machine translation, via TowerInstruct-13B-v0.1 (state-of-the-art LLM for high-resource machine translation), generates nine additional language variants: German, French, Spanish, Chinese, Portuguese, Italian, Russian, Korean, and Dutch. Each language provides an equivalent number of captions.
Batch Construction: At each training step, a batch of 128 image IDs is sampled. For each image, one of its 10 available captions (English or translation) is chosen at random. Positive pairs are formed (image, caption), with all other cross-pairs in a batch treated as negatives in the contrastive loss.
Translation Quality: No explicit automated quality control is performed, other than manual spot checking. Short and simple remote-sensing captions minimize translation errors.

This multilingual augmentation serves as both cross-lingual supervision and text-based data augmentation, with empirical evidence that even English-only retrieval performance is enhanced by the inclusion of translated captions.

3. Training Protocol and Hyperparameters

Training is performed using PyTorch with the OpenCLIP library on research-grade GPUs (e.g., NVIDIA A100/Tesla-V100), requiring approximately 1–2 days. Specifics include:

Optimization: AdamW optimizer, base learning rate $2.5\times10^{-4}$ , linear warmup for 10 epochs, total of 100 epochs.
Batching and Sampling: Batch size of 128, with random caption selection over the 10 language variants per image.
Self-Distillation Head: Hidden dimension of 2048, output size $K=65\,536$ codes.
Image Augmentation: Follows DINO protocol with color jitter, Gaussian blur, and solarization; two global crops (224 × 224), eight local crops (96 × 96 upscaled to 224 × 224).
Text Augmentation: Achieved by random language sampling per caption.
EMA Momentum: Teacher model weights updated with $\lambda=0.996$ .

This regimen establishes both broad cross-modal alignment and robust multilingual transfer.

4. Evaluation and Empirical Results

RS-M-CLIP is evaluated on three vision-language tasks across multiple benchmarks:

Datasets: RSICD, UCM, RSITMD; metrics: Recall@1, Recall@5, Recall@10, and mean Recall (mR).

Model	RSICD mR	RSITMD mR	UCM mR
Zero-shot CLIP-XLM	18.7	27.5	36.5
CLIP/Cap-5 (English only)	45.6	61.4	55.1
RS-M-CLIP (English only)	61.6	71.4	47.4
RS-M-CLIP (+translations)	63.7	74.6	49.5

Performance gains over prior SOTA are up to +16.4 percentage points (pp) mean Recall (mR) on RSICD and +17.1 pp on RSITMD, with multilingual augmentation yielding an additional +2–3 pp.

B. Zero-Shot Image Classification

Benchmarks: 12 datasets (RSI-CB128/RSI-CB256, WHU-Earth, EuroSAT, MLRSNet, PatternNet, RESISC45, AID, RSSCN7, OPTIMAL-31, RSC11, WHU-RS19). Metric: top-1 accuracy with prompt “a satellite photo of {class name}.”

Model	Mean Accuracy (%)
CLIP-ViT-B	48.0
RemoteCLIP-ViT-B	61.5
CLIP-XLM RoBERTa-B	59.8
RS-M-CLIP-B	63.2

RS-M-CLIP surpasses original CLIP and matches or exceeds RemoteCLIP on most datasets.

C. Multilingual Retrieval

RSICD and RSITMD are evaluated with test captions in all 10 languages. On RSICD, RS-M-CLIP achieves nearly identical performance across languages, e.g., mR=59.9 (German), mR=62.8 (French), compared to English (mR=63.7), indicating robust cross-lingual alignment (variations of ±1–2 pp).

5. Analysis and Practical Insights

Multiple analyses highlight the broader impact and distinctive properties of RS-M-CLIP:

Performance does not degrade with multilingual queries: cross-lingual retrieval on EuroSAT with prompts in English, Portuguese, Chinese, and French yields nearly identical top-1 accuracies (variation within ±1 pp), and in some instances, non-English prompts slightly outperform English.
Augmenting with translated captions consistently boosts retrieval in English by 2–3 pp mR, demonstrating that cross-lingual supervision acts as effective text augmentation.
When integrated with NACLIP for open-vocabulary segmentation, RS-M-CLIP’s local features yield sharper object masks for classes such as “plane,” “ground track field,” and “storage tank” compared with standard CLIP models.
Qualitative retrieval examples (e.g., Figure 1 of the source) show highly semantically coherent false positives and correct image retrieval at rank 1 regardless of query language.

A plausible implication is that the combination of multilingual and local–global self-supervision leads to feature spaces that are both semantically rich and language-independent.

6. Implications for Remote Sensing Vision-Language Research

RS-M-CLIP demonstrates the efficacy of combining a multilingual CLIP backbone, self-distillation aligning local and global crops, and translated caption augmentations for remote sensing tasks. The joint InfoNCE and self-distillation objective provides simultaneously strong cross-modal retrieval (up to 63.7 mR on RSICD), robust zero-shot classification (63.2% mean accuracy), and seamless multilingual support.

This approach validates the synergy between self-supervised local–global visual alignment and multilingual pre-training, presenting a versatile and scalable paradigm for building remote sensing vision–LLMs adaptable across languages and modalities (Silva et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Multilingual Vision-Language Pre-training for the Remote Sensing Domain (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Remote Sensing Multilingual CLIP (RS-M-CLIP).