- The paper introduces a novel foundation model that applies optimal checkpoint selection (EVL) to merge multi-sensor pre-trained MAEs, achieving improved generalization and performance.
- It employs separate MAE pre-training for HiRISE, CTX, and THEMIS data followed by parameter-wise model merging, validated through loss landscape analysis.
- MOMO consistently outperforms ImageNet pre-training and existing Earth Observation FMs, particularly enhancing segmentation and classification tasks.
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
Introduction and Motivation
The development of foundation models (FMs) for remote sensing has predominantly focused on Earth Observation (EO), resulting in a proliferation of models tailored to various terrestrial applications. However, equivalent advancements have been lacking for Martian remote sensing, where domain-specific pre-training is expected to enhance transferability and performance over standard ImageNet pre-training. MOMO addresses this deficiency, being the first FM specifically designed for Mars orbital applications (“MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications” (2604.02719)).
The central challenge in constructing Mars-specific FMs lies in the heterogeneity of Martian orbital sensors, which differ substantially in coverage, spatial resolution, spectral characteristics, and data distributions. The stacking and heterogeneous data combination approaches prevalent in EO FMs are impractical for Mars data due to the paucity of coincident high-resolution imagery and the diversity of sensor modalities. MOMO consequently introduces a methodology in which masked autoencoder (MAE) models are first pre-trained individually on each sensor's data, followed by fusion via a checkpoint-aligned model-merging strategy that maximizes compatibility and generalization.
Figure 1: MOMO achieves strong generalization across a wide range of spatial resolutions and Martian remote sensing tasks, with a unified model capable of handling applications from large-scale landform mapping to precise boulder localization.
Methodology
Multi-sensor Pre-training and Model Fusion
MOMO's design leverages three primary Martian orbital sensors: HiRISE (0.25 m/pixel), CTX (5 m/pixel), and THEMIS (100 m/pixel). A stratified data curation pipeline ensures samples are representative of Mars's geologic diversity, and a rigorous automated filtering process eliminates low-quality images based on SSIM and noise statistics, ensuring that only high-fidelity data is used for pre-training.











Figure 2: Example images illustrate rigorous curation, filtering out low-quality (artifacts, noise, blur) samples to retain only data suitable for high-fidelity pre-training from HiRISE, CTX, and THEMIS sensors.
Separate MAEs are pre-trained on each sensor’s dataset using a customized loss function:
- Pixel-wise MSE loss for intensity reconstruction,
- SSIM and LPIPS for perceptual similarity and higher-level feature preservation,
- Gradient-based terms to enforce boundary continuity and morphological fidelity.
Optimal Checkpoint Selection: Equal Validation Loss (EVL)
A critical innovation of MOMO is the model-fusion procedure. Rather than combining final or early-stopping checkpoints from each pre-trained sensor model, MOMO introduces the EVL strategy: checkpoints are aligned across sensors such that their validation losses fall within a small ϵ-tolerance, and are closest to the respective early-stopping epochs. This ensures merged models are at comparable stages of generalization, minimizing overfitting/underfitting risk during fusion.
The merging itself is performed by a parameter-wise addition (“task arithmetic” [ilharco2022editing]), producing a unified model that captures sensor-specific and cross-sensor representations. Loss landscape analysis further demonstrates that EVL selects checkpoints lying in adjacent loss basins, leading to merged models with both stability and generalizable performance.





Figure 3: Loss landscape visualizations confirm that EVL checkpoint selection yields merged models in flatter, lower-loss basins compared to naive merging at last- or early-stopping epochs, resulting in superior stability and generalization.
Experimental Evaluation
Baseline Comparison
MOMO is evaluated on the comprehensive Mars-Bench suite, which spans nine downstream tasks across all three sensor modalities, including classification (e.g., AtmosDust, Frost, DoMars16k, Landmark) and segmentation (e.g., Boulder, ConeQuest, multi-type Crater Segmentation, MMLS for landslides). Baselines encompass:
- Random initialization (training from scratch),
- ImageNet pre-training,
- State-of-the-art Earth Observation FMs (SatMAE, CROMA, TerraFM, DINOv3, Prithvi),
- Sensor-specific pre-training and joint-data pre-training (DM),
- Alternative checkpoint selection (Early Stopping, Last Epoch) model-merging.
MOMO consistently outperforms ImageNet pre-trained and EO-FM models, especially on segmentation tasks that demand precise morphological and spatial reasoning. Notably, MOMO achieves an average improvement of ~1.25% (F1-score) on classification and ~1% (mIoU) on segmentation tasks over ImageNet pre-training, with larger margins observed in tasks requiring detailed spatial localization (e.g., Boulder mIoU improvement ~4%).
Sensor-specific pre-training achieves slightly superior results only when evaluated on the native sensor's downstream tasks, but lacks generalization, flexibility, and requires maintaining separate models. The joint data approach (DM) suffers both in scalability and performance, particularly when integrating new sensors or modalities, whereas MOMO enables efficient incremental sensor addition without retraining the full model.
Ablation of Checkpoint Selection Strategies
Direct comparison between checkpoint selection strategies on segmentation tasks confirms that EVL yields the highest mIoU and F1-scores (average improvement of ~2.5% in mIoU) relative to naive early-stopping or last-epoch merging. Visualization of the loss landscape further demonstrates that EVL merges models in optimally-aligned regions, producing minima that are both flat and generalizable—desirable effects for robust downstream task performance.
Implications and Future Directions
MOMO establishes a strong precedent in planetary science for the construction of domain-specific, multi-sensor foundation models. Its model-merging framework not only delivers performance gains and parameter efficiency but, critically, provides a modular, scalable pathway for incorporating novel Martian instruments or for extension to other planetary domains (e.g., lunar, asteroidal).
Practically, MOMO’s architecture enables planetary researchers to pursue advanced geomorphologic mapping, multi-scale landform segmentation, and automated geospatial analysis with reduced annotation requirements and improved accuracy. Theoretically, the EVL-based model merging augments the FM literature by demonstrating that checkpoint alignment at the granularity of validation loss yields more transferable and stable foundation models when merging heterogeneous data distributions.
Limitations are acknowledged: MOMO’s merging efficacy assumes linear mode connectivity, which may degrade with highly divergent or functionally equivalent models misplaced in parameter space due to symmetry. Advanced model-alignment, permutation symmetry resolution, or re-basing techniques may further improve merge stability ([zhang2025beyond], [theus2025generalized], [ainsworth2023git]).
Future research can explore the integration of transformer-based architectures, cross-modal pre-training (e.g., combining orbital, atmospheric, and subsurface datasets), and fine-grained control over model fusion operations for sub-task specific customization. The implications extend to other planetary environments, setting the stage for universal foundation models within the planetary science community.
Conclusion
MOMO is the first unified foundation model engineered specifically for Martian orbital remote sensing. By introducing a scalable, model-merging approach underpinned by optimal checkpoint selection (EVL), MOMO efficiently integrates multi-resolution, multi-sensor data to yield robust performance gains over traditional and EO-based foundation models. Its design is inherently modular and extensible, providing a practical and theoretical template for the construction of planetary-scale FMs. This work is poised to drive future developments in planetary science, foundation model research, and automated Martian geospatial analysis, closing the methodological gap between terrestrial and extraterrestrial FM research.