Astronomical Foundation Models

Updated 29 September 2025

Astronomical foundation models are large-scale machine learning systems that create general, reusable representations of astronomical data from multiple modalities.
They use advanced architectures like transformers and contrastive learning to integrate images, spectra, and time series for robust scientific inference.
Their application has led to significant improvements in astrometric mapping, stellar parameter regression, and galaxy morphology classification.

Astronomical foundation models are large-scale machine learning systems—often based on neural architectures such as transformers or vision-language encoders—trained to serve as general, reusable representations for astronomical data across diverse tasks, modalities, and observational regimes. These models are distinguished from traditional task-specific approaches by their capacity to encode broad scientific priors, capture inter-modal and physical relationships, and streamline adaptation to new instruments, wavelength ranges, or scientific questions. The development and application of such models span a spectrum from astrometry-based reference frames to transformer and contrastive frameworks trained on images, spectra, time series, or multi-modal datasets in fields ranging from galaxy morphology and stellar physics to solar forecasting and radio astronomy.

1. Foundations in Astrometry and the Precedent for Generalizable Models

Accurate astrometric data—the measurement of stellar positions, proper motions, and distances via parallax—is the classical foundation upon which all quantitative astrophysical inference is built. ESA's Hipparcos mission (1989–93) and more recently Gaia (launched 2013), have provided the field-defining datasets enabling the precise mapping of the Milky Way's three-dimensional geometry, the kinematics of stars and stellar systems, and the spatial distribution of dark matter. Canonical relationships, such as $d = 1/\pi$ (distance in parsecs from parallax in arcseconds) and $v_\mathrm{t} = 4.74\,(\mu/\pi)$ (tangential velocity from proper motion and parallax), remain central in translating raw measurements into physically meaningful coordinates and velocities. These reference datasets and their derived catalogues (such as Tycho-2 and the Gaia Data Releases) have catalyzed the vision for universal, infrastructure-like astronomical models. The astrometric paradigm demonstrates that maximally utility arises when foundational measurements and models are designed for endurance, cross-disciplinarity, and extensibility, setting the conceptual precedent for modern data-driven foundation models (Høg, 2014).

2. Model Architectures: Representation Learning Across Modalities

Recent astronomical foundation models typically employ deep architectures, notably large transformers, that generalize beyond fixed input types. For instance, models constructed for stellar inference encode "tokens" such as Gaia XP spectral coefficients, photometric colors, stellar labels, and dust extinction parameters as inputs to self-attention-based encoders and decoders. Each observation is nonlinearly embedded (e.g., $y_x = f(w_x M_x) + w_{b,x}$ for input $M_x$ ), and aggregated via multi-head attention mechanisms ( $A = \mathrm{softmax}(QK^\intercal/\sqrt{d_k})$ ) that facilitate complex cross-modal or cross-parameter reasoning (Leung et al., 2023). Architectures in galaxy modeling implement hybrid contrastive and supervised losses, employing frameworks adapted from BYOL or CLIP, and integrating specialized objectives such as the Dirichlet loss over multi-question Galaxy Zoo labels (Walmsley et al., 2022).

Multi-modal approaches such as AstroCLIP align transformer-based image and spectrum encoders into a shared latent space through cross-modal contrastive learning (Parker et al., 2023). These designs often include mask predictive pretraining (for spectra) and cross-attention aggregation heads to compactly represent heterogeneous input structures. In the time-domain, models like FALCO and Astromer 2 use transformer stacks with extensive self-attention across light curve sequences, while causal foundation models explicitly disentangle physical signals from instrumental effects using dual encoders and structured triplet losses (Audenaert et al., 7 Jul 2025).

3. Self-supervision, Contrastive, and Hybrid Training Strategies

Astronomical foundation models are increasingly trained with self-supervised or hybrid training objectives, which circumvent the scarcity of labeled data and enable robust transfer. Typical methods include:

Contrastive Learning: Imposing invariance by maximizing agreement between differently augmented versions of the same observation, e.g., BYOL-style losses for galaxies or InfoNCE losses for cross-modal alignment (Walmsley et al., 2022, Parker et al., 2023).
Supervised or Weakly-supervised Augmentation: Integrating annotated crowd-sourced data via probabilistic or distributional losses (such as Dirichlet loss over Galaxy Zoo crowdsourced responses), enabling the model to both learn canonical features and ignore missing labels natively (Walmsley et al., 2022).
Mask Prediction and Autoencoding: Time-series and spectral models frequently use masked reconstruction objectives, where randomly omitted segments of light curves or spectra must be predicted by the network, enforcing the learning of global and local dependencies (Donoso-Oliva et al., 4 Feb 2025, Zuo et al., 28 Apr 2025).
Causal Disentanglement: Recent causal foundation models use structured training data (“anchor / same-star / same-instrument” triplets) and dual encoder–decoder pipelines to explicitly segregate physical and instrumental factors, delivering representations that are robust under distribution shifts and enable efficient adaptation in low-data regimes (Audenaert et al., 7 Jul 2025).

The hybridization of self-supervised and supervised objectives, combined with dataset designs that maximize diversity and leverage crowdsourcing (e.g., Galaxy Zoo–Evo with millions of volunteering responses), has been shown to yield generalizable representations that outperform purely supervised or purely self-supervised baselines, particularly in label-scarce downstream settings.

4. Cross-Modality, Multilingual, and Multi-Domain Generalization

A distinguishing aspect of contemporary astronomical foundation models is their explicit multi-modality and ability to generalize across domains:

Galaxy Models: AstroCLIP demonstrates that a cross-modal InfoNCE loss can be used to align galaxy image and spectrum representations into a physically meaningful shared latent space, with downstream tasks (such as photometric redshift estimation) achieved via zero-shot regression or minimal fine-tuning (Parker et al., 2023).
Solar Physics: Foundation models for solar data such as SDO-FM and Solaris integrate heterogeneous measurements (multi-wavelength images, magnetic field maps, and spectral irradiances) into unified embedding spaces via autoencoders or 3D Swin transformer architectures, enabling efficient forecasting, instrument fusion, and missing data reconstruction (Walsh et al., 3 Oct 2024, Majid et al., 25 Nov 2024).
Vision-Language Applications: Models like PAPERCLIP, CosmoCLIP, and AstroLLaVA leverage paired image–text datasets and vision–language contrastive training to build systems capable of natural language query, image retrieval, and descriptive captioning across astronomical archives, achieving superior alignment and retrieval metrics relative to baseline CLIP configurations (Mishra-Sharma et al., 13 Mar 2024, Imam et al., 10 Jul 2024, Zaman et al., 11 Apr 2025).
Time-Domain Astronomy: Foundation models trained on light curves (FALCO, Astromer 2) or structured time series with causal separation (dual-encoder models (Audenaert et al., 7 Jul 2025)) excel in classification, regression, and anomaly detection across tasks, with performance that scales favorably with input sequence length and model capacity (Zuo et al., 28 Apr 2025, Donoso-Oliva et al., 4 Feb 2025).

These results collectively show that well-architected foundation models can perform discriminative and generative tasks across survey boundaries and observational modalities, with cross-instrument transfer and robust uncertainty quantification.

5. Performance Benchmarks, Scaling Laws, and Representational Convergence

Empirical evaluation of astronomical foundation models centers on conventional regimes such as:

Classification (F1 Score): For time-domain models, F1 scores for downstream classification rise substantially when using learned embeddings compared to models trained from scratch (e.g., a 15% improvement on ATLAS data with Astromer 2 (Donoso-Oliva et al., 4 Feb 2025)).
Regression (RMSE, Scatter): Stellar parameter inference with Transformer-based models achieves RMSE/O(MAD) metrics as low as 19 K for $T_\mathrm{eff}$ and 0.07 dex for $[\mathrm{M/H}]$ (Leung et al., 2023), while log g estimation from FALCO attains an RMSE of 0.1305 dex (falling to 0.02–0.08 dex in optimal regimes) (Zuo et al., 28 Apr 2025).
Retrieval and Alignment: Cross-modal models are assessed by retrieval accuracy (e.g., image–text retrieval top-k metrics with PAPERCLIP (Mishra-Sharma et al., 13 Mar 2024)) and mutual k-nearest neighbor (MKNN) alignment (UniverseTBD et al., 23 Sep 2025).
Scaling Trends: Across foundation model architectures, representational alignment (measured by MKNN) increases monotonically with model capacity, providing empirical support for the Platonic Representation Hypothesis that sufficiently scaled models, trained on diverse astronomical data, converge toward a shared latent structure representing universal astrophysical phenomena (UniverseTBD et al., 23 Sep 2025).

The table below summarizes representative performance metrics for various foundation models:

Model	Domain	Metric	Value
FALCO	Light curve (Kepler)	Accuracy	95%
SpectraFM	Stellar spectra	[Fe/H] RMSE	0.094
AstroCLIP	Galaxies (image/spec)	Redshift $R^2$	0.97–0.99
CosmoCLIP	Imaging (SpaceNet)	Zero-shot top-1	+64% over CLIP
Astromer 2	Light curve (ATLAS)	F1 Score	+15% over prior

6. Adaptation, Fine-Tuning, and Best Practices

Optimal adaptation of general-purpose foundation models to astronomy requires iterative, task-aware strategies:

Head Replacement and Fine-Tuning: Standard practice involves removing pre-trained classification heads and attaching custom projectors or MLPs, followed by selective fine-tuning of backbones and adaptors. Approaches such as LoRA (Low-Rank Adaptation) allow parameter-efficient adaptation while minimizing catastrophic forgetting (Lastufka et al., 17 Sep 2024, Riggi et al., 31 Mar 2025).
Preprocessing and Augmentation: Domain-specific input preprocessing (e.g., cropping or upsampling to match model patch sizes, augmentation matching the physical symmetries of celestial images) is critical for strong performance, especially in noise-dominated radio or high-dynamic-range imaging (Lastufka et al., 17 Sep 2024).
Label Space Re-definition: Redefining labeling schemes to match the statistical properties of astronomical data (e.g., using “number of bright peaks” instead of complex morphological classes) improves transferability and detection accuracy in heterogeneous regimes (Lastufka et al., 17 Sep 2024).
Cross-Modality and Dataset Curation: Leveraging paired image–text or cross-instrument data (with carefully curated high-quality captions or labels) enhances multi-modal alignment, which is foundational for downstream generalization and usability in science pipelines (Mishra-Sharma et al., 13 Mar 2024, Imam et al., 10 Jul 2024).
Mitigating Synthetic Gaps: For models bridging synthetic and real data (e.g., SpectraFM), careful staged fine-tuning on real observations is required to transition from idealized, noise-free pretraining to robust, physically meaningful inference in real-world settings (Koblischke et al., 7 Nov 2024).

7. Scientific Impact, Open Questions, and Future Prospects

Astronomical foundation models already underpin a wide array of inferential, discovery, and data mining tasks, with significant implications for observational astronomy, instrument design, and scientific reproducibility. Their impact includes:

Enabling rapid, label-efficient adaptation to new instruments, wavelengths, and tasks, especially as new surveys emerge (e.g., Rubin/LSST, JWST, SKA).
Supporting robust science cases ranging from the dynamical mapping of the Milky Way to rapid classification of variable stars and solar event forecasting.
Facilitating multi-modal scientific workflows, including instrument fusion and cross-modal searches, by providing shared embedding spaces.

Key unresolved questions and future research avenues include:

Determining the optimal scaling laws and limits for universal representations, especially as data volumes and diversity expand (UniverseTBD et al., 23 Sep 2025).
Improving causal disentanglement (physics vs. instrumentation) in noisy or complex observational regimes (Audenaert et al., 7 Jul 2025).
Developing architectures and training objectives that naturally integrate multi-modality (imaging, spectroscopy, time series, language) and admit efficient downstream customization.
Open-sourcing pre-trained weights, embeddings, and community datasets to democratize model reuse and mitigate redundant environmental costs (Donoso-Oliva et al., 4 Feb 2025, Walsh et al., 3 Oct 2024, Zaman et al., 11 Apr 2025).
Advancing interpretability and uncertainty quantification methods to better match the decision-critical workflows of survey-scale astronomy.

Taken together, contemporary astronomical foundation models signal a paradigm shift toward general, reusable, and extensible inference engines in astronomy, providing a foundation for decades of scientific discovery and instrument innovation.