Self-Supervised Learning on Ocean-SAR Imagery

Updated 20 March 2026

The paper introduces self-supervised learning that reconstructs or aligns masked and augmented SAR data to extract robust, high-level ocean features.
Modern methods leverage transformer and convolutional backbones with advanced masking, augmentation, and physics-informed losses to counteract speckle noise.
These approaches enable superior performance in classification, regression, and detection tasks, enhancing transferability across ocean monitoring applications.

Self-supervised representation learning on spaceborne ocean Synthetic Aperture Radar (SAR) imagery addresses the challenge of extracting robust, high-level features from vast unlabeled SAR archives, particularly for ocean observation, detection, and geophysical analysis. SAR data is characterized by speckle noise, multiplicative intensity statistics, polarization diversity, and, in the open ocean, strong spatial redundancy. Modern approaches leverage large-scale transformer or convolutional backbones, advanced masking and augmentation strategies, physically guided loss functions, and dynamic dataset curation to train domain-specific foundation models optimized for transfer across ocean-SAR downstream tasks.

1. Methodological Foundations: Masked Autoencoders and Contrastive Frameworks

The dominant paradigms for self-supervised representation learning in spaceborne ocean-SAR are Masked Autoencoders (MAE), their feature-guided variants (FG-MAE), contrastive architectures, and joint-embedding predictive frameworks. Each enables the model to build representations from data itself by reconstructing or aligning information in masked or augmented views.

Masked Autoencoders operate by masking random patches of the input and training a transformer encoder (e.g., ViT) to reconstruct the missing information. Standard MAE objectives use mean squared error (MSE) in raw pixel space but are often sub-optimal under SAR speckle characteristics, especially over ocean (Wang et al., 2023, Pu et al., 20 Jan 2025).
Feature-Guided MAE (FG-MAE) substitutes hand-crafted features (e.g., Histogram of Oriented Gradients, HOG) for raw pixels as the reconstruction target. HOG’s spatial-gradient focus is especially suited for ocean scenes, where it filters speckle and emphasizes semantically meaningful backscatter gradients from wave patterns, ship wakes, and boundaries (Wang et al., 2023).
Contrastive SSL (e.g., SimCLR or DINO-style) aligns augmented views of the same sample in latent space, often via cosine similarity and NT-Xent loss. This approach is robust to SAR’s adverse noise statistics and achieves domain-appropriate invariances—crucial for tasks like wave height regression. WV-Net exemplifies large-scale contrastive pretraining, leveraging 10 million Sentinel-1 WV-mode ocean patches (Glaser et al., 2024, Tuel et al., 12 Jan 2026).
Joint-Embedding Predictive Architectures (e.g., SAR-JEPA) employ local masked reconstruction and physics-informed feature targets (multi-scale gradient-by-ratio), overcoming the pitfalls of pixel-level loss by explicitly encoding speckle-robust, shape-aware features suited for small target recognition such as ships (Li et al., 2023).
Siamese and DINOv2-style ViT frameworks (SAFE, OceanSAR-2) decouple student and teacher networks, enforce consistency across global/local crops, and regularize via prototype or KoLeo constraints, further benefiting transferability and robustness (Muzeau et al., 2024, Tuel et al., 12 Jan 2026).

2. Loss Functions and Target Representations

The precise form of the self-supervised signal is critical in spaceborne ocean-SAR SSL:

Raw Pixel MSE: Simple but sensitive to speckle noise and leads to wasted capacity on reconstructing physically meaningless noise patterns (Wang et al., 2023).
HOG Feature Reconstruction: Guides the network’s decoder toward local gradient/statistical cues, de-emphasizing per-pixel intensity and enhancing semantic salience for ocean spatial structures (Wang et al., 2023).
Gradient-by-Ratio Multi-Scale Features: Used by SAR-JEPA, these targets are computed by comparing averaged intensities across multiple window sizes and directions and then logarithmically encoding their ratio. This suppresses speckle (by virtue of local averaging and log transforms) and highlights ship contours or wave patterns (Li et al., 2023).
Physically Inspired Decomposition Losses: In complex-valued SAR models, the pretext task simulates polarimetric decomposition (such as Yamaguchi’s four-component model) via learnable bases, using pixelwise cross-entropy and power conservation losses to force correspondence with interpretable physical scattering mechanisms (Wang et al., 16 Apr 2025).
Speckle-Aware and Semantic Anchor Losses: SARMAE augments reconstruction targets by synthetically injecting calibrated speckle noise and, when possible, constraining SAR features to align with optical counterparts using cross-modal cosine distance (Liu et al., 18 Dec 2025).
Contrastive Alignment and Prototype Regularization: SAFE and OceanSAR-2 enforce cross-view and patchwise consistency with temperature-scaled cross-entropy, mean entropy regularization, and prototype norm constraints (KoLeo), maintaining information diversity and resistance to collapse (Muzeau et al., 2024, Tuel et al., 12 Jan 2026).

3. Data Curation, Preprocessing, and Augmentation Strategies

Ocean-SAR datasets are extremely redundant, with millions of similar open-water observations and heavy-tailed distributions of rare phenomena. Efficient data curation and augmentation are thus necessary to ensure sample diversity, computational tractability, and invariant feature learning:

Dynamic Pruning and Sampling: OceanSAR-1 and OceanSAR-2 employ adaptive coreset selection and dynamic sampling, respectively. This involves iterative clustering of feature embeddings (e.g., via k-means or hierarchical clustering), followed by sampling strategies that over-represent rare clusters or increase pairwise distances. For example, OceanSAR-2 assigns sampling weights $w_i = f(c_i)^{-\beta}$ , oversampling underpopulated clusters (Kerdreux et al., 9 Apr 2025, Tuel et al., 12 Jan 2026).
Physics-Based Normalization and Calibration: All state-of-the-art pipelines standardize input backscatter to σ° (sigma-nought), sometimes applying incidence angle correction (e.g., via CMOD5N inversion) for Sentinel-1 (Glaser et al., 2024, Tuel et al., 12 Jan 2026).
Augmentation Protocols: Common augmentations include random rotations, intensity jitter, speckle injection (matching SAR sensor statistics), random erasing, patch masking, and sub-aperture decomposition (Muzeau et al., 2024, Liu et al., 18 Dec 2025). SAFE enforces invariance to resolution, polarization, and viewing geometry via multi-band pretraining and augmentation (Muzeau et al., 2024).
Task-Specific Masking: Masking is both global (MAE, FG-MAE) and local (SAR-JEPA), with the latter focusing predictive tasks around small targets and difficult contexts in otherwise homogeneous scenes (Li et al., 2023).

4. Model Architectures and Training Protocols

Transformer-based architectures, especially Vision Transformers (ViTs), are the backbone of recent advances in ocean-SAR SSL:

ViT Backbones: OceanSAR-1/2, FG-MAE, SAR-JEPA, SAFE, and SARMAE all employ ViT-family backbones of varying widths and depths, sometimes with optimized lightweight decoders for efficiency (Wang et al., 2023, Liu et al., 18 Dec 2025, Tuel et al., 12 Jan 2026).
Encoder-Decoder Asymmetry: Encoders are parameter-rich and process only unmasked or augmented tokens, while decoders are purposely minimal to avoid learning trivial reconstruction tasks (Wang et al., 2023, Pu et al., 20 Jan 2025).
Multi-Head Attention and Multi-Scale Features: Models such as SAR-JEPA and the complex-valued foundation models augment ViTs with deep pyramid or multi-scale heads, and explicitly encode multi-scale physics-transformed targets (Li et al., 2023, Wang et al., 16 Apr 2025).
Fine-tuning and Freezing Policies: For downstream evaluations, it is standard to freeze the pretrained encoder and train only lightweight task-specific heads (classification, regression, detection, segmentation) with modest labeled data (Kerdreux et al., 9 Apr 2025, Tuel et al., 12 Jan 2026).

5. Benchmarks, Evaluation Protocols, and Empirical Results

Evaluation leverages a suite of standardized datasets, protocols, and metrics with direct relevance to ocean-geophysical analysis:

Classification: TenGeoP (10 geophysical ocean classes, 37,553 Sentinel-1 WV images), scene-type tasks (BigEarthNet-SAR, EuroSAT-SAR), and few-shot ship classification (FUSAR-Ship) are used extensively (Wang et al., 2023, Li et al., 2023, Kerdreux et al., 9 Apr 2025, Tuel et al., 12 Jan 2026).
Regression: Significant Wave Height (SWH) and wind speed/direction prediction tasks leverage co-located altimetry or scatterometer data, measured via RMSE and MAE. OceanSAR-2 achieves 1.01 m/s RMSE for wind speed, improving by 17.9% over the baseline (Tuel et al., 12 Jan 2026).
Detection and Segmentation: Ship, iceberg, and sea-ice detection (SARDet-100k, YOLOIB, water-body segmentation on AIR-PolSAR-Seg) serve as primary detection and segmentation tasks. SARMAE ViT-B attains 92.31% IoU for water extraction, and FG-MAE improves fine-tune F1 for scene classification by >1% over pixel MAE (Wang et al., 2023, Liu et al., 18 Dec 2025, Tuel et al., 12 Jan 2026).
Zero-shot and Low-Label Regimes: OceanSAR-2 achieves 94% kNN classification accuracy on TenGeoP, outperforming prior work under zero-shot conditions (Tuel et al., 12 Jan 2026). WV-Net embeddings preserve over 0.92 AUROC in classification with only 900 labeled samples (Glaser et al., 2024).
Cross-Modal and Few-Shot Transfer: SARMAE and SAFE demonstrate that SAR models pretrained on large, mixed-mode datasets transfer well to unseen ocean-SAR sensors and tasks with minimal adaptation (Liu et al., 18 Dec 2025, Muzeau et al., 2024).

Model	Backbone	Ocean-SAR Benchmark	Metric/Result
OceanSAR-2	ViT (DINOv2)	TenGeoP, SWH, Wind, YOLOIB	98.5% acc., 0.40 m RMSE, 1.01 m/s RMSE, 0.865 [email protected]
FG-MAE	ViT-S	BigEarthNet-SAR	Fine-tune F1: 74.0%
WV-Net	ResNet-50	GOALI, Wave Ht., Temp.	AUROC: 0.958, RMSE: 0.50 m/0.902 °C
SAFE	ViT	HRSID, AIR-PolSAR-Seg	Water IoU: 63.09% (vs. top ANN 75.95%)
SARMAE	ViT-B/L	AIR-PolSAR-Seg	Water IoU: 92.31–93.06%
SAR-JEPA	ViT-Base	FUSAR-Ship	10-shot acc.: 81.3%

6. Physical Interpretability and Domain Alignment

Robust transfer to ocean observation tasks derives from physical domain alignment in both architectural choices and loss design:

Physics-Informed Targets: FG-MAE's HOG, SAR-JEPA's multi-scale gradients, and complex-valued scattering queries all encode physically meaningful features—attenuating speckle, capturing edge/surface anisotropies, and directly modeling Bragg, double-bounce, and volume scattering in oceanic contexts (Wang et al., 2023, Li et al., 2023, Wang et al., 16 Apr 2025).
Input and Preprocessing: Pretraining in sigma-nought units, using actual backscatter, and matching SAR noise models ensures robustness and domain invariance across sensors and observation geometries (Glaser et al., 2024, Liu et al., 18 Dec 2025).
Dynamic Curation: Up-weighting rare geophysical patterns through dynamic resampling encourages the foundation model to capture atypical ocean events (icebergs, slicks, wave anomalies) that are key for downstream monitoring (Kerdreux et al., 9 Apr 2025, Tuel et al., 12 Jan 2026).

7. Outlook: Scaling, Foundation Models, and Future Directions

Foundation-Scale Models: Large models (ViT-B/8, ViT-H) pretrained on multi-year, multi-sensor ocean SAR archives are establishing themselves as universal feature extractors for diverse marine tasks, outperforming ImageNet- or multi-modal pretrained models by substantial margins in all tested protocols (Kerdreux et al., 9 Apr 2025, Tuel et al., 12 Jan 2026).
Dynamic and Modular Pipelines: Adaptive dataset curation, patch-level and global objectives, and modular backbones are converging toward workbench-driven development (e.g., the OceanSAR-2 "ocean workbench") that standardizes evaluation, benchmarks, and model comparison (Tuel et al., 12 Jan 2026).
Open Problems: Scaling data diversity, incorporating complex-valued and polarimetric information at global scale, improving interpretability of foundation models, and extending SSL for direct estimation of higher-level geophysical fields (currents, submesoscale structures) remain prominent directions. A plausible implication is that future SSL pipelines will increasingly integrate multi-modal (SAR, optical, scatterometer) observations and jointly learn spatial-temporal representations.

Self-supervised learning on spaceborne ocean-SAR imagery enables robust, transferable, and physically meaningful representations, critical for advancing operational marine monitoring and foundational oceanography (Wang et al., 2023, Liu et al., 18 Dec 2025, Glaser et al., 2024, Kerdreux et al., 9 Apr 2025, Tuel et al., 12 Jan 2026).