OceanSAR-2: Compact ViT for SAR Ocean Analysis
- OceanSAR-2 is a compact Vision Transformer-based foundation model designed for SAR ocean observation tasks, offering enhanced transferability and computational efficiency.
- It integrates advanced self-supervised learning objectives—including DINO v2, iBOT, and KoLeo regularization—to optimize feature extraction from Sentinel-1 Wave Mode imagery.
- The model employs dynamic data curation and a parameter-efficient design (<21M parameters) to achieve state-of-the-art performance in classification, regression, and object detection benchmarks.
OceanSAR-2 is a compact Vision Transformer (ViT)-based foundation model for synthetic aperture radar (SAR) ocean observation tasks, designed to serve as a universal feature extractor for Sentinel-1 Wave Mode imagery. Introduced as the successor to OceanSAR-1, it improves transferability, data efficiency, and computational resource usage through a combination of upgraded self-supervised learning objectives and dynamic data curation strategies. OceanSAR-2 demonstrates state-of-the-art performance across classification, regression, and object detection benchmarks in ocean SAR analysis, and is released with standardized datasets and protocols to facilitate systematic evaluation and comparison (Tuel et al., 12 Jan 2026).
1. Model Architecture and Representation
OceanSAR-2 adopts a Vision Transformer architecture specifically adapted for SAR data. Compared to OceanSAR-1, three central modifications define the model:
- Input Representation: Each input vignette is standardized to calibrated backscatter in decibels or linear units, supplanting the use of raw digital number (DN) amplitudes.
- ViT Backbone: The model employs a ViT-S-16 backbone (patch size ), with transformer blocks. Patch embeddings and class token are L2-normalized prior to projection.
- Forward Pass:
The final class token serves as the input for downstream probing or fine-tuning.
The overall parameter count is advertised as less than 21 million, positioning OceanSAR-2 as one of the most parameter-efficient SAR-native backbones for large-scale ocean applications.
2. Self-Supervised Learning Objectives
OceanSAR-2 utilizes the DINO v2 student–teacher framework, which incorporates multiple objective terms:
- DINO Global Cross-Entropy Loss: For multi-crop views , student and teacher projections and each yield normalized logits. The cross-entropy between teacher (for view ) and student (for view ) softmax predictions over prototypes is:
- iBOT Patch Prediction Loss: Patch-level consistency is enforced via the iBOT loss:
- KoLeo Prototype Regularizer: To ensure diversity across learned prototypes, the KoLeo regularizer maximizes entropy of teacher assignments:
- Overall Objective: The components are combined into a total self-supervised objective:
with weighting factors tuned via experiment.
This integrated objective demonstrably improves convergence and downstream generalization relative to prior DINO-based protocols.
3. Dynamic Data Curation
OceanSAR-2 implements a dynamic pruning strategy for enhanced data diversity during training cycles:
- Embedding-Based Selection: For each candidate sample , a minimal-distance criterion in teacher embedding space
identifies samples that expand the diversity of the retained training set.
- Subsampling Policy: The samples with the highest scores are selected to ensure underrepresented oceanographic phenomena are well-covered, thereby avoiding redundancy from "pure-ocean" scenes and accelerating convergence.
No closed-form sampling distribution or explicit threshold for is specified, as explicit guidelines are not tabulated in the reference.
4. Pretraining Setup and Computational Resources
OceanSAR-2 is pretrained on millions of pixel crops from the Sentinel-1A/B/C/D Wave Mode archive, calibrated to . Training details include:
- Training Epochs and Batch Sizes: DINO v2 defaults are followed (hundreds of epochs, cosine-decay learning rate, several thousand crops per batch aggregating multi-crop views).
- Compute Environment: Typically, modest clusters of commodity GPUs (8–16 NVIDIA A100s) over several days.
- Model Efficiency: Thanks to dynamic pruning, dimensionality reduction of embeddings (), and improved loss functions, pretraining cost is reduced by an estimated 30–50% compared to OceanSAR-1, though exact FLOP or GPU-hour savings are not numerically specified.
- Model Scale: The total parameter count remains under 21 million, which is substantially smaller than contemporary end-to-end models for SAR or multimodal vision tasks.
5. Downstream Evaluation and Benchmark Results
Performance is evaluated using the SAR Ocean Workbench—a suite of four standardized benchmarks:
| Task | Dataset | Metric | OceanSAR-2 (Zero-shot / Fine-tuned) | Best Baseline (Zero-shot / Fine-tuned) |
|---|---|---|---|---|
| TenGeoP (10-class geo. patterns) | 37,553 | Acc [%] | 94.0 / 98.5 | DINO v3: 91.9 / 98.5; WV-Net: 91.5 / 98.3 |
| WV-SWH (wave height regression) | 50,000 | RMSE [m] | 0.52 / 0.40 | DINO v3: 0.55 / 0.39; WV-Net: 0.64 / 0.427 |
| WV-wind (surf. wind regression) | 50,000 | RMSE [m/s]; Dir MAE | 1.32 / 1.01; 16.9 (MAE,°) | WV-Net: 1.71 / 1.23; 21.4; DINO v3: 1.68 / 1.12; 17.9 |
| YOLOIB (iceberg detection) | 2,062 | F₁@IoU≥0.1 | -- / 0.865 | WV-Net: -- / 0.855 |
OceanSAR-2 delivers highest or co-highest accuracy in all evaluated tasks, both in zero-shot (k-NN probe) and fine-tuning settings, and surpasses prior domain-specific and generic baselines including TerraMind, WV-Net (Glaser et al., 2024), and DINO v3.
6. Benchmarks, Protocols, and Open Resources
OceanSAR-2 is coupled with four standardized datasets (SAR Ocean Workbench): TenGeoP (classification), WV-SWH (wave height regression), WV-wind (wind regression), and YOLOIB (iceberg detection). Datasets are derived from the public Sentinel-1 Wave Mode archive, using reproducible co-location and splitting protocols.
- Prescribed Protocols:
- Zero-shot probing: k-NN or linear/logistic regression on frozen class token.
- Fine-tuning: Task-specific heads (small MLPs for classification/regression, DETR-style heads for detection) for a set schedule.
- Open-Source Resources: Code for data loading, split construction, standardized evaluation metrics, and scripts are publicly available, providing a foundation for reproducible benchmarking across future SAR foundation models.
7. Quantitative Improvements and Impact
Relative to OceanSAR-1 and other leading models, OceanSAR-2 provides the following benchmarked improvements:
- Classification Accuracy (TenGeoP, zero-shot k-NN): Improvement from ~92% (OceanSAR-1) to 94%.
- Wave Height RMSE (WV-SWH, zero-shot): Decrease from ≃0.60 m (OceanSAR-1) to 0.52 m.
- Wind Speed RMSE (WV-wind, zero-shot): Reduction from ≃1.5 m/s to 1.32 m/s.
- Fine-tuned Performance: 5–10% relative improvement, e.g., wave height RMSE drops from 0.45 m to 0.40 m.
- Resource Efficiency: Estimated 30–50% reduction in pretraining GPU-hours compared to the first model generation.
The cumulative effect is a highly transferable, SAR-native ViT backbone (21 M parameters) supporting robust, efficient, and reproducible SAR-ocean downstream analytics, matching or exceeding much larger models and enabling systematic comparison through standardized tasks and metrics (Tuel et al., 12 Jan 2026).