AstroCLIP: Cross-Modal Astronomy Learning

Updated 26 February 2026

AstroCLIP is a family of cross-modal contrastive learning models that embed diverse astronomical data (images, spectra, time series, text) into a shared latent space.
The models leverage modality-specific encoders and paired contrastive objectives to achieve state-of-the-art performance in zero-shot classification, regression, and anomaly detection.
Extensive benchmarks and ablation studies highlight the importance of dataset balance, caption quality, and hyperparameter tuning in optimizing cross-modal alignment and downstream tasks.

AstroCLIP refers to a family of cross-modal contrastive learning models extending the Contrastive Language–Image Pretraining (CLIP) paradigm to domain-specific astronomical data. These models embed multiple scientific modalities—such as images, spectra, time series, and text—into a shared latent space using paired (or multimodal) self-supervised or supervised contrastive objectives. The resulting representations support zero- and few-shot semantic transfer, property regression, anomaly detection, and retrieval across both observed and synthetic astronomical datasets. The core implementations and benchmarks of AstroCLIP are detailed in several works, including those by Wang et al. ("CosmoCLIP" (Imam et al., 2024)), Villar et al. ("AstroM $^3$ " (Rizhko et al., 2024)), and Ewen et al. ("AstroCLIP: A Cross-Modal Foundation Model for Galaxies" (Parker et al., 2023)).

1. Model Architectures and Modalities

AstroCLIP systems utilize tailored encoders for each modality, with architecture variations according to data type and scientific objective.

Imaging–Text Foundation Models: CosmoCLIP adapts the vanilla CLIP framework, using a Vision Transformer (ViT-B/16) image encoder and a 12-layer Transformer for caption encoding. Images are $224 \times 224$ cutouts, and text inputs are content-rich captions generated by BLIP.
Imaging–Spectra Foundation Models: The original AstroCLIP model employs a ResNet-50 image encoder alongside a six-layer transformer for 1D galaxy spectra, both individually pretrained and later aligned through contrastive loss. Final embedding sizes are 128-dimensions for each path.
Trimodal and Multimodal Extensions: AstroM $^3$ generalizes CLIP to three modalities (photometric time series, 1D spectra, and astrophysical metadata) via separate backbones: an Informer-based encoder for irregular time series, a four-layer 1D CNN (GalSpecNet variant) for spectra, and a two-layer MLP for metadata. All projected into a 512-dimensional shared embedding space through additional learned linear layers.

Each encoder is followed by projection heads that map raw features into a common embedding space. $\ell_2$ normalization of outputs is standard prior to contrastive similarity computation.

2. Datasets, Curation, and Preprocessing Pipelines

AstroCLIP models require large-scale, domain-specific datasets with cross-modal linking:

CosmoCLIP/SpaceNet (Imam et al., 2024): The SpaceNet dataset, constructed via the FLARE framework, provides 12,900 $224 \times 224$ images from surveys such as SDSS and synthetic augmentations. Images are evenly partitioned among eight object classes to optimize representational balance. Each image is paired with a natural-language caption generated by the BLIP model, yielding a corpus of (image, caption) pairs.
AstroM $^3$ /ASAS-SN–LAMOST (Rizhko et al., 2024): Cross-matches ASAS-SN variable star catalogs with Gaia EDR3 (astrometry and metadata) and LAMOST DR9 spectra, producing a trimodal dataset of 21,440 variable objects over the ten most populous classes. Sequence and scalar features are uniformly preprocessed (median absolute deviation normalization, log transforms, and standardization as appropriate).
AstroCLIP–Legacy/DESEDR (Parker et al., 2023): Galaxies from the DESI Legacy Imaging Survey (g, r, z cutouts; $\sim 41$ million available, $\sim 198$ k cross-matched to spectra) and spectral data from the DESI Early Data Release ( $\sim 7000$ pixels per object). Imaging cutouts center-cropped and normalized; each spectrum z-scored per object and tokenized with appended mean and variance.

Augmentation pipelines include geometrical and photometric transforms on images (random rotations, flips, point-spread function modeling, noise), and temporal random masking of input windows for spectra in self-supervised pretraining.

3. Contrastive Pretraining Objectives

Central to AstroCLIP is multimodal contrastive alignment, typically via the InfoNCE loss. For two modalities (e.g., image–text):

$L = -\frac{1}{2N} \sum_{i=1}^N \Bigg[ \log \frac{e^{S_{ii}/\tau}}{\sum_{j=1}^N e^{S_{ij}/\tau}} + \log \frac{e^{S_{ii}/\tau}}{\sum_{j=1}^N e^{S_{ji}/\tau}} \Bigg]$

where $S_{ij}$ denotes the pairwise cosine similarity between normalized projected embeddings, and $\tau$ is a (typically learnable) temperature parameter. In trimodal and higher-order cases (AstroM $^3$ ), all pairwise contrastive objectives are summed, drawing together matching representations across all modalities and pushing non-matching pairs apart:

$L = L_{PS} + L_{SM} + L_{MP}$

with $L_{XY}$ as the symmetric contrastive loss for modality pair $(X,Y)$ . Encoder and projection head weights are typically updated only during contrastive fine-tuning; initial self-supervised pretraining for single-modality paths may employ other losses, such as mask-filling MSE for spectra or MoCo-style image contrastive objectives.

4. Performance Benchmarks and Downstream Tasks

AstroCLIP variants achieve state-of-the-art results on both in-domain and out-of-domain generalization tasks:

Zero-Shot Classification and Retrieval (Imam et al., 2024): CosmoCLIP notation (here, "AstroCLIP" Editor's term) surpasses the CLIP baseline by dramatic margins: in-domain top-1 accuracy increases from 6.45% (CLIP) to 70.87% (CosmoCLIP\textsubscript{BLIP}); out-of-domain accuracy rises similarly (6.63% to 71.72%). Image–image retrieval cosine@1 jumps from 54.02 to 93.60.
Property Inference (Parker et al., 2023): For galaxy property estimation, cross-modal embeddings enable zero-shot photometric redshift and stellar mass regression. AstroCLIP’s spectral embeddings yield redshift $R^2$ of 0.97–0.99 (vs. 0.69 for (r,g,z) photometry + MLP), and stellar mass $R^2$ of 0.86.
Multimodal Variable Star Classification (Rizhko et al., 2024): AstroM $^3$ pretraining increases photometric classification accuracy from $84.6\% \rightarrow 91.5\%$ , and boosts spectral classification by up to $12.6$ points when labeled data is scarce (Table 3).
Semantic Manifold Structure: UMAP projections of embeddings reveal physically meaningful class separation (e.g., Mira subtypes, rotational variable subclasses) without explicit subtype supervision.
Anomaly and Outlier Detection: Embedding-space methods including DBSCAN and cosine thresholds flag physical outliers and catalog mislabels, supporting robust quality assurance in large surveys.

Model/Dataset	Modality	Zero-shot Top-1 (%)	Redshift $R^2$ (spec/image)	Retrieval (cos@1)
CosmoCLIP/SpaceNet	image–text	70.87	N/A	93.60
AstroM $^3$ /ASAS-SN	trimodal (P/S/M)	91.5 (photometry)	N/A	N/A
AstroCLIP/Legacy	spectrum–image	N/A	0.97 / 0.71	N/A

5. Ablation Studies and Design Choices

Empirical studies highlight critical design parameters:

Captioning Quality (Imam et al., 2024): BLIP-generated captions encode rich semantic detail, yielding $\sim$ 2.5x better accuracy over LLaVA-generated captions for CosmoCLIP. The diversity and morphological specificity of BLIP outputs are essential for alignment.
Dataset Balance and Realism: Balanced, artifact-augmented datasets (SpaceNet/FLARE) outperform raw web-scraped or naively synthetic alternatives (zero-shot accuracy drops from $84.5\%$ to $40.87\%$ on raw data for ViT-B/16 backbone).
Learning Rate/Batch Size: Learning rates below $10^{-5}$ slow convergence, while those above $5 \times 10^{-5}$ lead to overfitting for moderate-sized datasets ( $N \sim 13$ k images). Batch size of 32 maintains gradient stability under hardware constraints.
Projection Head and Temperature Tuning: Shared 512-dimensional projection space (CosmoCLIP, AstroM $^3$ ) and single learnable temperature yield stable optimization. Fixed temperature (AstroCLIP/Legacy) can improve cross-modal alignment.

6. Scientific Impact and Domain Insights

AstroCLIP-style models offer several scientific and practical advantages:

Label Efficiency: Embeddings support zero- and few-shot transfer in settings where labeled data are costly (e.g. rare object classes, new surveys), with substantial gains under label scarcity (Rizhko et al., 2024).
Physical Meaningfulness: Cross-modal latent representations align with physical properties (redshift, stellar mass, morphological class), even without explicit physical labels (Parker et al., 2023).
Community Reusability: Frozen embedding heads may function as public foundation models for diverse analyses, including similarity search, anomaly detection, regression, and conditional generative inference.
Extensibility: Frameworks generalize to $n > 2$ modalities (AstroM $^3$ ), with clear paths to include additional photometric bands, human-annotated texts, gravitational wave triggers, or multiwavelength image cubes.

Domain-specific considerations remain: rare class under-representation (e.g., black holes), caption hallucination by BLIP (e.g., spurious spiral arms), and adaptation to new bands or modalities (requiring new/finetuned captioning or spectral encoders) require ongoing attention.

7. Outlook and Future Directions

AstroCLIP and its variants exemplify the rapid evolution of domain-adapted multimodal foundation models for astronomy:

Temporal and Spectral Expansion: Ongoing efforts extend models to handle temporal stacks (video), time series, and integral field (3D) spectral cubes via additional encoders and contrastive losses (Imam et al., 2024, Rizhko et al., 2024).
Continual Learning: Incorporating adapter modules or periodic retraining strategies enables continual model updates in response to large-scale new data releases (e.g., LSST, Euclid).
Handling Missing Modalities: Extensions to handle missing data at training or inference illustrate the need for robust multimodal representation even with incomplete coverage (Rizhko et al., 2024).
Integration with Simulation and Causality: Future work aims to use latent embeddings as priors for generative or causal inference in galaxy evolution, and for sim-to-real transfer.

AstroCLIP establishes the cross-modal contrastive paradigm as a foundation for data-driven discovery and efficient knowledge transfer in modern astronomical research, providing architecture blueprints and performance baselines for subsequent foundation model development in the physical sciences (Parker et al., 2023, Imam et al., 2024, Rizhko et al., 2024).