Papers
Topics
Authors
Recent
2000 character limit reached

GeoMELT: Efficient Multimodal Learning Transformer

Updated 24 December 2025
  • GeoMELT is a multimodal transformer designed for remote sensing, efficiently unifying image captioning, visual question answering, and cross-modal retrieval.
  • It employs a 12-layer encoder based on BEiT-3 with specialized feed-forward modules for modality-specific processing of images and text.
  • The model leverages dynamic loss balancing and a two-stage training protocol to optimize performance while drastically reducing computational demands compared to large LVLMs.

GeoMELT (Geo Multi-task Efficient Learning Transformer) is a multimodal encoder-only transformer model designed for efficient, unified handling of computer vision and language tasks in the remote sensing domain. Unlike conventional Large Vision and LLMs (LVLMs) that require billions of parameters and extensive computational resources, GeoMELT implements an architecture optimized for multiple tasks—such as image captioning, visual question answering (VQA), visual grounding, and cross-modal retrieval—while maintaining a compact model size of 271 million parameters. The framework achieves competitive results on standard benchmarks, balancing task performance and computational efficiency, and is specifically adapted for image–language multi-tasking across a diverse set of remote-sensing datasets (Silva et al., 17 Dec 2025).

1. Model Architecture

GeoMELT adopts an encoder-only backbone based on the BEiT-3 architecture, consisting of 12 transformer blocks. Each block comprises Multi-Head Self-Attention (MHSA) with 12 heads and a hidden dimension of 768, accompanied by modality-specific feed-forward (FFN) experts of size 3072 with GeLU activations. The top three transformer layers incorporate additional vision–language expert FFN modules. Images are tokenized by dividing them into non-overlapping 16×16 patches, each mapped to 768-dimensional embeddings and augmented by learnable 2D positional embeddings. Text is encoded using byte pair encoding (BPE) into a standard 30k vocabulary, each token mapped to a 768-dimensional embedding plus a 1D positional encoding.

Tokens from both modalities are interleaved and processed jointly; a prepended [CLS] token is used for pooled representations. Causal attention masks prevent text tokens from attending to future text tokens during generation. The MHSA operation for each attention head is given by: headi=softmax(QiKid/h)Vi\mathrm{head}_i = \mathrm{softmax}\left(\frac{Q_i K_i^{\top}}{\sqrt{d/h}}\right)V_i where Qi=XWiQ,Ki=XWiK,Vi=XWiVQ_i = XW^Q_i, K_i = XW^K_i, V_i = XW^V_i, and dd is the hidden dimension (Silva et al., 17 Dec 2025).

2. Multi-Task Training Objectives

GeoMELT supports:

  • Text Generation (MLM): Tasks like captioning, VQA, and grounding use a Masked Language Modeling head. A random subset ℳ of tokens is masked (with probability pMp_M), and the cross-entropy loss is minimized:

LMLM=iMlogPθ(yiy~,I)\mathcal{L}_{\mathrm{MLM}} = -\sum_{i\in\mathcal M}\log P_\theta\bigl(y_i\mid \tilde{\mathbf y},I\bigr)

Generation is performed autoregressively by appending a [MASK], and tokens are sampled iteratively.

  • Cross-Modal Retrieval (InfoNCE): For dual-encoder settings, pooled image (vRd\mathbf v \in \mathbb R^d) and text (uRd\mathbf u \in \mathbb R^d) embeddings are contrasted using normalized cosine similarity. The InfoNCE loss is:

LInfoNCE=12(Lt2i+Li2t)\mathcal{L}_{\mathrm{InfoNCE}} = \tfrac12\bigl(\mathcal{L}_{\mathrm{t2i}}+\mathcal{L}_{\mathrm{i2t}}\bigr)

where Lt2i\mathcal{L}_{\mathrm{t2i}} and Li2t\mathcal{L}_{\mathrm{i2t}} represent text-to-image and image-to-text retrieval.

  • Joint Optimization: Dynamic Weight Averaging (DWA) regulates the relative importance of MLM and InfoNCE losses across training epochs. Task weights λi(k)\lambda_i^{(k)} are computed adaptively using the recent history of task losses, with temperature parameter γ\gamma set to 2.0.

This unified objective allows GeoMELT to learn representations that serve both retrieval and text generation, a combination rarely addressed in encoder-only models (Silva et al., 17 Dec 2025).

3. Datasets, Training Protocol, and Implementation

GeoMELT is trained on approximately 282,000 image–text pairs from a mixture of remote sensing datasets, including NWPU-Captions, RSICD, UCM, Sydney, RSITMD for captioning; RSICap and VRSBench for detailed captioning; RSVQA-HR and RSVQA-LR for VQA; and DIOR-RSVG for visual grounding.

Training proceeds in two stages:

  1. Stage 1: Warm-up on DIOR-RSVG (visual grounding) for 30 epochs (batch=64, learning rate 4×1054\times 10^{-5}), including a 1-epoch linear learning rate warmup.
  2. Stage 2: Multi-task training on the complete mixture for 15 epochs (batch=128, learning rate 2×1042\times 10^{-4}, AdamW optimizer).

Initialization uses a BEiT-3 checkpoint pretrained for CLIP-style retrieval tasks. After both training stages, weights are merged using a simple WISE-FT fusion (equal averaging, α=0.5\alpha=0.5) to balance retrieval and generation capabilities. GeoMELT requires approximately 16 GFLOPs per forward pass (for 224×224 images) and around 8 GB GPU memory at batch size 64 (Silva et al., 17 Dec 2025).

Task Datasets Used #Train Examples #Test Examples
Image Captioning NWPU, RSICD, UCM, etc. 126,000–2,485 15,750–290
VQA RSVQA-HR, RSVQA-LR 20,000–10,000 131,468–7,057
Visual Grounding DIOR-RSVG 26,991 7,500

4. Experimental Results and Benchmarks

GeoMELT achieves strong results compared to both compact and very large vision-LLMs:

  • Cross-Modal Retrieval (RSICD, UCM, RSITMD): Outperforms specialized small models and larger CLIP-based variants. Example Recall@1/5/10 for RSICD (GeoMELT): I→T: 22.87/44.10/56.63; T→I: 17.20/43.59/59.93.
  • Image Captioning (RSICD): BLEU-1: 0.610, BLEU-4: 0.365, CIDEr: 2.652—on par with or better than models with >7B parameters (e.g., RSGPT at 13B, GeoChat at 7B).
  • Long Captioning (VRSBench): BLEU-1: 45.8, CIDEr: 29.8 versus 48.1/33.9 (LLaVA-1.5, 7B) and 47.6/33.5 (Mini-Gemini, 7B).
  • Visual Grounding: On DIOR-RSVG, GeoMELT achieves mIoU=65.95 at 224×224, outperforming many LVLMs that use higher image resolutions.
  • VQA: Comparable or superior on presence-type questions (RSVQA-HR), but not on reasoning-intensive compare-type questions (where large LLMs dominate).
  • Zero-Shot Classification: 73.78% average top-1 accuracy across 12 remote-sensing datasets, ~8 points better than RemoteCLIP-L.

The impact is immediate: GeoMELT demonstrates that encoder-only models, when properly trained and fused, can unify retrieval and captioning/generation without incurring the prohibitive compute costs of LVLMs (Silva et al., 17 Dec 2025).

5. Model Efficiency, Ablations, and Limitations

Ablation results reveal that starting from retrieval-focused checkpoints yields strong retrieval but severely underperforms on grounding. Conversely, grounding-only fine-tuning improves mIoU but hinders retrieval/captioning. The two-stage protocol with WISE-FT merging restores state-of-the-art performance across tasks.

GeoMELT's architecture and training regimen offer substantial efficiency gains: ∼16 GFLOPs per forward pass and inference times down to ~40 ms/image on an A100 GPU. This is orders of magnitude faster—and requires orders of magnitude fewer parameters—than >7B parameter LVLMs.

Limitations include:

  • Inferior performance on reasoning-heavy VQA tasks compared to large LLMs, due to the absence of generative decoder or explicit reasoning modules.
  • Reduced accuracy on very low-resolution Sentinel-2 imagery (RSVQA-LR), indicating a need for additional low-resolution training data (Silva et al., 17 Dec 2025).

6. Extensions and Future Directions

Planned directions include:

  • Adapting GeoMELT to support multisensor and variable spatial resolution modalities, such as multispectral and radar inputs.
  • Introducing change-captioning and composed image retrieval to leverage the cross-encoder in complex scene understanding.
  • Using GeoMELT as a fast retrieval/scoring module within larger LVLM pipelines, particularly in retrieval-augmented generation frameworks.
  • Incorporating lightweight reasoning augmentations (e.g., adapters, LoRA) to address the VQA reasoning gap (Silva et al., 17 Dec 2025).

7. Significance within Remote Sensing Vision-Language Research

GeoMELT represents a step towards practical, multi-task multimodal learning in remote sensing with tractable computational demands. It unifies generative and retrieval tasks within a single encoder framework, leveraging dynamic loss balancing and fused checkpoints. The model's design demonstrates that extensive scaling of parameters is not strictly required to close much of the gap in remote sensing vision-language benchmarks, provided data curation, architecture specialization, and training protocols are meticulously tuned.

GeoMELT's architecture and training methods offer a robust, scalable foundation for future work in efficient multimodal learning, with implications for both research and practical deployment in geospatial analytics (Silva et al., 17 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GeoMELT Model.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube