Any Optical Model (AOM) for Remote Sensing
- Any Optical Model (AOM) is a universal remote sensing foundation model that adapts to arbitrary optical configurations using a spectrum-independent tokenizer.
- It employs multi-scale adaptive patch embedding and contrastive semantic alignment to ensure robust feature extraction across varying spatial resolutions.
- AOM demonstrates strong band-missing tolerance and cross-sensor fusion, achieving state-of-the-art results in segmentation and classification tasks.
Any Optical Model (AOM) is a universal remote sensing foundation model (RSFM) designed to operate across arbitrary optical satellite band layouts, sensor types, and spatial resolution scales. Unlike prior RSFMs that require fixed band configurations and resolutions, AOM achieves robustness to missing bands, cross-sensor fusion, and novel resolution inputs by leveraging a spectrum-independent tokenizer, multi-scale adaptive feature extraction, and explicit semantic alignment across scales. These core architectural innovations allow AOM to serve as a generalizable foundation model for remote sensing applications ranging from land cover mapping to emergency response using data from platforms such as Sentinel-2, Landsat, and high-resolution commercial satellites (Li et al., 19 Dec 2025).
1. Model Architecture and Spectrum-Independent Tokenizer
AOM’s architecture centers on its spectrum-independent tokenizer (“SiTok”), which processes each input channel separately to ensure resilience to missing or unfamiliar bands. Given an input tensor with channels, SiTok performs a band-wise patch embedding:
- Each channel () is convolved with a shared kernel (), producing tokens for the -th band.
- A dedicated, learnable band embedding is added to spatial tokens for each channel: .
- All bands’ tokens are concatenated: .
This approach enables three key properties:
- Extensibility to novel bands: New channels are indexed and assigned embeddings without retraining the core convolution kernel.
- Band-missing tolerance: Omission of tokens for absent channels allows AOM to function without performance degradation when some bands are unavailable.
- Spectral identity encoding: Explicit band embeddings preserve each channel’s spectral semantics across sensors (Li et al., 19 Dec 2025).
2. Multi-Scale Adaptive Patch Embedding
To maintain high performance across a broad range of spatial resolutions (from sub-meter to hundred-meter pixel sizes), AOM incorporates a Multi-Scale Adaptive Patch Embedding (MAPE) mechanism:
- A bank of convolutional kernels with patch sizes is maintained.
- For a required patch size , the closest kernel is selected, and an optional pseudo-inverse resize (PI-resize) is applied if an exact match is unavailable.
- Each band is embedded using the selected kernel, ensuring tokens capture texture and context at the appropriate scale for the input image’s native ground sampling distance.
This dynamic adaptation yields stable feature representations (accuracy and mIoU variations <1% across patch sizes from 16 to 64), allowing the model to generalize across resolutions without performance loss (Li et al., 19 Dec 2025).
3. Multi-Scale Semantic Alignment
AOM addresses the semantic consistency challenge posed by multi-resolution training via a contrastive alignment pretraining objective:
- Features are aggregated (after masking/encoding) for each scale, and a small projection head produces vectors that represent the global semantics at each input scale.
- An InfoNCE loss minimizes cross-scale discrepancies by maximizing similarity among for the same image, regardless of spatial scale:
This loss function pulls together the feature distributions for alternate resolutions, enforcing that the learned representations preserve scene semantics across patch sizes (Li et al., 19 Dec 2025).
4. Channel-Wise Self-Supervised Masking and Reconstruction
Following the Masked Autoencoder paradigm, AOM incorporates a dual-level masking and reconstruction objective:
- At each scale, a random fraction of the band-patch tokens is masked.
- The encoder operates only on visible tokens; the decoder reconstructs all channels and locations, minimizing mean squared error (MSE) over the masked patches:
- The pretraining loss is a weighted sum of reconstruction and alignment losses, with and .
Combined, these enable the model to jointly learn spectral-spatial dependencies, improving transferability and robustness to missing or corrupted bands (Li et al., 19 Dec 2025).
5. Training Corpus, Hyperparameters, and Pretraining Regime
AOM is pretrained on a large, heterogeneous optical remote sensing corpus:
- Sentinel-2 (SSL4EO-S12): 1.004 million samples at 10–60 m GSD
- Landsat-8 (Activefire): 146,000 samples at 30–100 m GSD
- High-resolution RGB (GeoPile, fMoW, OpenEarthMap): 108,000 samples at 0.1–30 m GSD
The aggregated 1.56 million images span ground sampling distances from 0.1 m to 100 m, covering diverse sensor types and spectral band layouts.
Key training parameters:
- 220 epochs, batch size 1024
- ViT-Base encoder, 4 decoder layers per scale
- Mask ratio
- Kernel sizes , patch sizes cycled from 16 to 64
- Learning rate
- Simple augmentations: random flip and crop, resized to native input resolution
This pretraining scheme is designed to guarantee coverage of the operational domain for optical satellites, facilitating generalization (Li et al., 19 Dec 2025).
6. Performance and Evaluation: Robustness Across Missing Bands and Sensors
AOM demonstrates state-of-the-art performance on semantic segmentation, classification, and cross-sensor adaptation benchmarks:
- Geo-Bench semantic segmentation (UPerNet head, 20 epochs fine-tuning): Mean IoU of 63.98%, a +4.53% improvement over the next best method (DOFA); on the cashew-plantation subset, 68.3% vs. 55.6%.
- Cross-sensor segmentation:
- SPARCS (Landsat-8): IoU 68.5% (+10.9% vs. SpectralGPT)
- HLS Burn Scars: IoU 85.4% (+3.0% vs. CROMA)
- Linear probe classification:
- UCM (RGB): 93.57% (+3.48% vs. DOFA)
- BigEarthNet (Sentinel-2): 85.02% mAP (+1.61% vs. CROMA)
- Band-missing robustness: On EuroSAT, accuracy remains above 95% with only three bands—an improvement of 2.3–19.1% over baselines.
- Patch-size ablation: mIoU and accuracy stable (<1% variation) across patch sizes from 16 to 64.
- Loss ablation: MSE only vs. MSE+InfoNCE: significant gains observed for all datasets, illustrating the benefit of alignment loss.
These results demonstrate that spectral-independence, multi-scale adaptation, and semantic alignment enable consistent performance across sensors, spatial resolutions, and spectral configurations (Li et al., 19 Dec 2025).
7. Implications and Applications
AOM’s universal design addresses longstanding challenges in remote sensing model transferability:
- Band-missing: Explicit band-level tokenization and embedding allows for out-of-the-box operation on novel or incomplete band sets.
- Cross-sensor: Semantic alignment and multi-scale embedding enable generalization to previously unseen sensor platforms, supporting fusion and transfer.
- Resolution-adaptive: Dynamic patch selection and PI-resize support spatial scales from centimeters to hundreds of meters.
Principal application domains include land use/land cover mapping, ecosystem monitoring, burned area detection, cross-sensor fusion in environmental monitoring, and scenarios demanding rapid adaptation to unconventional or novel sensor inputs. A plausible implication is that AOM establishes a practicable template for universal foundation models in other sensing modalities with variable input structure (Li et al., 19 Dec 2025).