REMSA: Remote Sensing Model Selection Agent
- REMSA is a language model-based agent that automates the selection of remote sensing foundation models using natural language queries.
- It integrates a rigorously structured Remote-Sensing Foundation Model Database with semantic retrieval methods like Sentence-BERT and FAISS.
- The infrastructure enforces data integrity through schema validation, confidence thresholds, and rule-based constraints to ensure transparent model selection.
REMSA refers to a LLM-based agent and associated infrastructure for foundation model selection in remote sensing, introduced as part of a unified framework that combines the Remote-Sensing Foundation Model Database (RS-FMD) with advanced retrieval and reasoning components. REMSA addresses the acute need for transparent, automated, and constraint-aware selection of Remote Sensing Foundation Models (RSFMs) in a domain characterized by heterogeneous modality support, varied metadata formats, and task-specific performance requirements (Chen et al., 21 Nov 2025).
1. Scope and Definition
REMSA (Remote-sensing foundation-model Selection Agent) is the first LLM-based agent system constructed for automated RSFM selection via natural language queries. It is built upon the RS-FMD, a curated schema-verified database documenting over 150 publicly released foundation models within the remote sensing (RS) domain. RSFMD encompasses models spanning multiple data modalities (optical, multispectral, hyperspectral, SAR, LiDAR, and image-text), resolutions, and pretraining paradigms. REMSA interprets unstructured user instructions, resolves missing or ambiguous constraints, retrieves candidate models, ranks them with transparent justification, and outputs rationales for each selection (Chen et al., 21 Nov 2025).
2. RS-FMD: Structure, Schema, and Coverage
RS-FMD (“Remote-Sensing Foundation Model Database”) provides the structured backbone for REMSA. Each entry is represented as a JSON object conforming to a rigorously defined schema (type-checked via pydantic and detailed in Appendix A of (Chen et al., 21 Nov 2025)). The schema consists of high-granularity top-level fields and two nested structures for pretraining phases and evaluation benchmarks.
Key Top-Level Fields:
- model_id, model_name, version, release_date, last_updated
- short_description, paper_link, citations, repository, weights
- backbone, num_layers, num_parameters
- pretext_training_type, masking_strategy, pretraining
- domain_knowledge, backbone_modifications, supported_sensors
- modality_integration_type, modalities
- spectral_alignment, temporal_alignment
- spatial_resolution, temporal_resolution, bands
Nested Structures (abbreviated):
PretrainingPhase(list): dataset, regions_coverage, time_range, num_images, token_size, image_resolution, epochs, batch_size, learning_rate, augmentations, processing, sampling, processing_level, cloud_cover, missing_data, masking_ratio.Benchmark(list): task, application, dataset, metrics, metrics_value, sensor, regions, original_samples, num_samples, sampling_percentage, num_classes, classes, image_resolution, spatial_resolution, bands_used, augmentations, optimizer, batch_size, learning_rate, epochs, loss_function, split_ratio.
LaTeX-formatted schema tables (see Appendix A in (Chen et al., 21 Nov 2025)) formalize field types and constraints for data integrity.
Coverage:
- ≃150 RSFMs
- Modalities: Optical (RGB), Multispectral, Hyperspectral, SAR, LiDAR, Vision–Language
- Supported tasks (as documented in nested Benchmarks): Semantic segmentation, classification, change detection, visual question answering
3. Model Categorization and Metrics
RSFMs in RS-FMD are stratified by both learning paradigm and application task:
Learning Paradigms:
- Supervised (fine-tuned on labeled RS data)
- Self-supervised/masked modeling (MAE-style pretraining)
- Contrastive Vision–Language (e.g., CLIP-style on image–text pairs)
- Multimodal fusion architectures integrating heterogeneous modalities
Relevant schema fields: pretext_training_type, masking_strategy, modality_integration_type.
Tasks and Evaluation:
- Each Benchmark object in the schema captures a unique combination of task (e.g., change detection), dataset, sensor, metric(s), and result value(s).
- Key metrics include Intersection-over-Union (IoU):
and F1-score:
- Parallel arrays (
metrics,metrics_value) associate each metric to its measured value for each task.
4. Query Processing and Retrieval Pipeline
REMSA operationalizes model selection through a multi-stage information processing pipeline:
Data Structures and Indexing:
- Records are stored in JSONL, validated via pydantic, and versioned with DVC.
- Each model record is embedded using Sentence-BERT, incorporating specialized prefix tokens (e.g., [MODALITY], [APPLICATION]) to enhance semantic retrieval.
- Embeddings are indexed in FAISS for efficient approximate nearest-neighbor search based on cosine similarity.
- Explicit rule-based filters enforce hard constraints (e.g., required modality, minimum spatial resolution, performance thresholds).
Semantic Parsing and Retrieval:
- Free-text user queries are parsed and normalized into a structured constraint dict (JSON) with required fields like application, modality, sensor, spatial_resolution, priority_metrics, min_performance, etc.
- Example structured query:
1 2 3 4 5 6 7 8 9 10 11
{ "application": "change detection", "modality": "SAR", "sensor": ["Sentinel-1"], "spatial_resolution": 10, "priority_metrics": ["F1"], "min_performance": { "metric": ["F1"], "value": [0.80] } } - The retrieval phase applies hard filters, then ranks remaining candidates using a hybrid similarity and LLM-based scoring function:
where is embedding similarity from FAISS, is ordinal ranking from in-context LLM prompts, and is a task-tunable balance factor.
Explanation, Clarification, and Output:
- If rule-based filtering results in candidate sets that are too large or too small, REMSA’s Clarification Generator elicits up to three disambiguation questions from the user.
- Top-k returned results are delivered with an Explanation Generator that produces a JSON array listing each model, rationale, and documentation links.
5. Agent Integration, Ranking, and Confidence
Pipeline Overview:
- Interpreter: Converts natural-language query to structured schema.
- Retriever: FAISS+Sentence-BERT produces similarity-based shortlist.
- Ranker: LLM ranks candidates using in-context learning and structured metadata.
- Explanation system: Provides reasoned output, with model metadata and links.
Confidence Mechanism:
- During database population, each field receives an extraction-confidence score:
with , , and mandatory threshold for automatic acceptance.
- Only fields with are admitted automatically; others are flagged for human review at the field level.
Consistency constraints:
- Arrays like
benchmarks.metricsandbenchmarks.metrics_valueare enforced to be of equal length. modalitiesmust be a subset of enumerated valid types: {Optical, Multispectral, Hyperspectral, SAR, LiDAR, Text}.
6. Extension, Maintenance, and Guarantees
Update and Addition Pipeline:
- Automated metadata extraction: LLM-based (OneKE-inspired) methods parse papers, model cards, and repositories to generate candidate JSON schemas.
- Community submission interface: Authors may upload new documentation, triggering auto-extraction and verification UI for schema validation.
- Versioning: All records managed with DVC for robust audit tracking.
Quality Guarantees:
- Schema validation is mandatory for any new record.
- Confidence thresholding and type-checking are implemented systematically.
- Consistency and domain-specific constraints (e.g., modalities, metrics alignment) are verified at ingestion time.
7. Significance and Benchmarking
By linking a schema-rich, auto-updating RSFM database with LLM-driven retrieval, REMSA enables fully automated, transparent model selection for use cases such as environmental monitoring, change detection, disaster assessment, and semantic land-cover mapping. REMSA is benchmarked against 75 expert-verified RS query scenarios, generating 900 query-model configurations. Empirically, REMSA outperformed naive selection agents, dense retrieval methods, and unstructured RAG-based LLMs (Chen et al., 21 Nov 2025). The infrastructure operates exclusively on public metadata, ensuring compatibility with open science mandates and the absence of sensitive data handling. This approach provides a scalable, extensible template for foundation model selection agents across data-driven scientific domains.