Papers
Topics
Authors
Recent
Search
2000 character limit reached

OceanSAR-2: Compact ViT for SAR Ocean Analysis

Updated 20 March 2026
  • OceanSAR-2 is a compact Vision Transformer-based foundation model designed for SAR ocean observation tasks, offering enhanced transferability and computational efficiency.
  • It integrates advanced self-supervised learning objectives—including DINO v2, iBOT, and KoLeo regularization—to optimize feature extraction from Sentinel-1 Wave Mode imagery.
  • The model employs dynamic data curation and a parameter-efficient design (<21M parameters) to achieve state-of-the-art performance in classification, regression, and object detection benchmarks.

OceanSAR-2 is a compact Vision Transformer (ViT)-based foundation model for synthetic aperture radar (SAR) ocean observation tasks, designed to serve as a universal feature extractor for Sentinel-1 Wave Mode imagery. Introduced as the successor to OceanSAR-1, it improves transferability, data efficiency, and computational resource usage through a combination of upgraded self-supervised learning objectives and dynamic data curation strategies. OceanSAR-2 demonstrates state-of-the-art performance across classification, regression, and object detection benchmarks in ocean SAR analysis, and is released with standardized datasets and protocols to facilitate systematic evaluation and comparison (Tuel et al., 12 Jan 2026).

1. Model Architecture and Representation

OceanSAR-2 adopts a Vision Transformer architecture specifically adapted for SAR data. Compared to OceanSAR-1, three central modifications define the model:

  • Input Representation: Each input vignette xRH×Wx\in\mathbb R^{H\times W} is standardized to calibrated backscatter σ0\sigma^0 in decibels or linear units, supplanting the use of raw digital number (DN) amplitudes.
  • ViT Backbone: The model employs a ViT-S-16 backbone (patch size 16×1616 \times 16), with L=12L=12 transformer blocks. Patch embeddings piR384p_i\in\mathbb R^{384} and class token zclsR384z_{\mathrm{cls}}\in\mathbb R^{384} are L2-normalized prior to projection.
  • Forward Pass:

Z(0)=[zcls;p1+PE1;;pN+PEN]Z^{(0)} = [z_{\mathrm{cls}}; p_1+\mathrm{PE}_1; \dots; p_N+\mathrm{PE}_N]

Z(+1)=TransformerBlock(Z()),=0,,L1Z^{(\ell+1)} = \mathrm{TransformerBlock}_\ell(Z^{(\ell)}),\quad \ell=0,\ldots,L-1

The final class token zcls(L)z_{\mathrm{cls}}^{(L)} serves as the input for downstream probing or fine-tuning.

The overall parameter count is advertised as less than 21 million, positioning OceanSAR-2 as one of the most parameter-efficient SAR-native backbones for large-scale ocean applications.

2. Self-Supervised Learning Objectives

OceanSAR-2 utilizes the DINO v2 student–teacher framework, which incorporates multiple objective terms:

  • DINO Global Cross-Entropy Loss: For multi-crop views {x~v}\{\tilde{x}_v\}, student and teacher projections fs()f_s(\cdot) and ft()f_t(\cdot) each yield normalized logits. The cross-entropy between teacher (for view vv') and student (for view vv) softmax predictions over KK prototypes is:

LDINO=1VsvVsk=1Kqt,k(v)logps,k(v)\mathcal L_\mathrm{DINO} = -\frac{1}{V_s}\sum_{v\in \mathcal V_s} \sum_{k=1}^K q_{t,k}^{(v')}\log p_{s,k}^{(v)}

  • iBOT Patch Prediction Loss: Patch-level consistency is enforced via the iBOT loss:

LiBOT=1Mm=1MKL(softmax(ftm/τ)    softmax(fsm/τ))\mathcal L_\mathrm{iBOT} = \frac{1}{M}\sum_{m=1}^M \mathrm{KL}\left(\mathrm{softmax}(f_t^m/\tau)\;\|\;\mathrm{softmax}(f_s^m/\tau)\right)

  • KoLeo Prototype Regularizer: To ensure diversity across learned prototypes, the KoLeo regularizer maximizes entropy of teacher assignments:

LKoLeo=1Kk=1Klogck2\mathcal L_\mathrm{KoLeo} = -\frac{1}{K}\sum_{k=1}^K \log\|c_k\|_2

  • Overall Objective: The components are combined into a total self-supervised objective:

L=LDINO+λ1LiBOT+λ2LKoLeo\mathcal L = \mathcal L_\mathrm{DINO} + \lambda_1\,\mathcal L_\mathrm{iBOT} + \lambda_2\,\mathcal L_\mathrm{KoLeo}

with weighting factors λ1,λ2\lambda_1, \lambda_2 tuned via experiment.

This integrated objective demonstrably improves convergence and downstream generalization relative to prior DINO-based protocols.

3. Dynamic Data Curation

OceanSAR-2 implements a dynamic pruning strategy for enhanced data diversity during training cycles:

  • Embedding-Based Selection: For each candidate sample xix_i, a minimal-distance criterion in teacher embedding space

di=minj<ift(xi)ft(xj)2d_i = \min_{j<i} \|f_t(x_i) - f_t(x_j)\|_2

identifies samples that expand the diversity of the retained training set.

  • Subsampling Policy: The NsubN_{\mathrm{sub}} samples with the highest did_i scores are selected to ensure underrepresented oceanographic phenomena are well-covered, thereby avoiding redundancy from "pure-ocean" scenes and accelerating convergence.

No closed-form sampling distribution or explicit threshold for NsubN_{\mathrm{sub}} is specified, as explicit guidelines are not tabulated in the reference.

4. Pretraining Setup and Computational Resources

OceanSAR-2 is pretrained on millions of 256×256256\times256 pixel crops from the Sentinel-1A/B/C/D Wave Mode archive, calibrated to σ0\sigma^0. Training details include:

  • Training Epochs and Batch Sizes: DINO v2 defaults are followed (hundreds of epochs, cosine-decay learning rate, several thousand crops per batch aggregating multi-crop views).
  • Compute Environment: Typically, modest clusters of commodity GPUs (8–16 NVIDIA A100s) over several days.
  • Model Efficiency: Thanks to dynamic pruning, dimensionality reduction of embeddings (n=384n=384), and improved loss functions, pretraining cost is reduced by an estimated 30–50% compared to OceanSAR-1, though exact FLOP or GPU-hour savings are not numerically specified.
  • Model Scale: The total parameter count remains under 21 million, which is substantially smaller than contemporary end-to-end models for SAR or multimodal vision tasks.

5. Downstream Evaluation and Benchmark Results

Performance is evaluated using the SAR Ocean Workbench—a suite of four standardized benchmarks:

Task Dataset Metric OceanSAR-2 (Zero-shot / Fine-tuned) Best Baseline (Zero-shot / Fine-tuned)
TenGeoP (10-class geo. patterns) 37,553 Acc [%] 94.0 / 98.5 DINO v3: 91.9 / 98.5; WV-Net: 91.5 / 98.3
WV-SWH (wave height regression) 50,000 RMSE [m] 0.52 / 0.40 DINO v3: 0.55 / 0.39; WV-Net: 0.64 / 0.427
WV-wind (surf. wind regression) 50,000 RMSE [m/s]; Dir MAE 1.32 / 1.01; 16.9 (MAE,°) WV-Net: 1.71 / 1.23; 21.4; DINO v3: 1.68 / 1.12; 17.9
YOLOIB (iceberg detection) 2,062 F₁@IoU≥0.1 -- / 0.865 WV-Net: -- / 0.855

OceanSAR-2 delivers highest or co-highest accuracy in all evaluated tasks, both in zero-shot (k-NN probe) and fine-tuning settings, and surpasses prior domain-specific and generic baselines including TerraMind, WV-Net (Glaser et al., 2024), and DINO v3.

6. Benchmarks, Protocols, and Open Resources

OceanSAR-2 is coupled with four standardized datasets (SAR Ocean Workbench): TenGeoP (classification), WV-SWH (wave height regression), WV-wind (wind regression), and YOLOIB (iceberg detection). Datasets are derived from the public Sentinel-1 Wave Mode archive, using reproducible co-location and splitting protocols.

  • Prescribed Protocols:
    • Zero-shot probing: k-NN or linear/logistic regression on frozen class token.
    • Fine-tuning: Task-specific heads (small MLPs for classification/regression, DETR-style heads for detection) for a set schedule.
  • Open-Source Resources: Code for data loading, split construction, standardized evaluation metrics, and scripts are publicly available, providing a foundation for reproducible benchmarking across future SAR foundation models.

7. Quantitative Improvements and Impact

Relative to OceanSAR-1 and other leading models, OceanSAR-2 provides the following benchmarked improvements:

  • Classification Accuracy (TenGeoP, zero-shot k-NN): Improvement from ~92% (OceanSAR-1) to 94%.
  • Wave Height RMSE (WV-SWH, zero-shot): Decrease from ≃0.60 m (OceanSAR-1) to 0.52 m.
  • Wind Speed RMSE (WV-wind, zero-shot): Reduction from ≃1.5 m/s to 1.32 m/s.
  • Fine-tuned Performance: 5–10% relative improvement, e.g., wave height RMSE drops from 0.45 m to 0.40 m.
  • Resource Efficiency: Estimated 30–50% reduction in pretraining GPU-hours compared to the first model generation.

The cumulative effect is a highly transferable, SAR-native ViT backbone (21 M parameters) supporting robust, efficient, and reproducible SAR-ocean downstream analytics, matching or exceeding much larger models and enabling systematic comparison through standardized tasks and metrics (Tuel et al., 12 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OceanSAR-2 Foundation Model.