AstroLLaVA: Astronomy Vision-Language Model

Updated 14 March 2026

AstroLLaVA is a vision-language model specialized for astronomy that integrates a frozen CLIP ViT-L/14 encoder with a LLaMA language model using learnable projection layers.
It employs a two-stage fine-tuning workflow—first for image captioning and then for visual Q&A using synthetic dialogue—to align visual features with astronomical vocabulary.
The model shows marginal improvements in galaxy morphology classification and offers a roadmap for extending multi-modal applications in astronomy.

AstroLLaVA is a vision-LLM (VLM) for astronomy that extends the LLaVA 1.5 "visual instruction tuning" paradigm to enable natural language interaction with astronomical imagery. By leveraging a two-stage fine-tuning approach on a curated dataset of approximately 30,000 images with captions and synthetic question–answer pairs sourced from NASA's Astronomy Picture of the Day (APOD), the European Southern Observatory (ESO), and the NASA/ESA Hubble Space Telescope (HST), AstroLLaVA is tailored to answer open-ended questions about visually depicted astronomical phenomena. Its architecture, training methodology, evaluation, and prospective roadmaps are detailed below (Zaman et al., 11 Apr 2025).

1. Model Architecture and Visual-Linguistic Integration

AstroLLaVA adopts the LLaVA 1.5 dual-tower design, which consists of a frozen CLIP ViT-L/14 vision encoder, denoted as $g_\phi(\cdot)$ , and a frozen LLaMA 7B LLM, $f_\phi(\cdot)$ . The vision encoder maps an input astronomical image $I$ to a sequence of patch embeddings $v = g_\phi(I) \in \mathbb{R}^{n \times d}$ , where $n$ is the number of image patches and $d$ the feature dimensionality.

A key innovation is the introduction of learnable linear projection layers $(W_p, b_p)$ , which map visual features to the language embedding space, yielding $z = W_p v + b_p \in \mathbb{R}^{n \times d'}$ . These projected visual tokens $z$ are prepended to the token embeddings of a language prompt and passed through the LLaMA transformer. Multi-head attention mechanisms, $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(QK^T/\sqrt{d_k}) V$ , operate over the concatenated sequence of visual and textual embeddings. This forms the multimodal context for subsequent autoregressive text generation, allowing cross-attention layers to learn grounding between linguistic queries and image content.

2. Two-Stage Fine-Tuning Workflow

AstroLLaVA employs a domain-adaptive two-stage fine-tuning protocol designed to specialize the VLM for astronomical tasks:

Stage 1: Image Captioning Training utilizes 9,962 images from NASA APOD, 14,617 from ESO, and 5,204 from HST, each accompanied by human-written captions. The vision and language backbones ( $g_\phi$ , $f_\phi$ ) remain frozen, while (only) the projection layers $(W_p, b_p)$ are optimized to minimize the standard autoregressive cross-entropy loss:

$\mathcal{L}_{\text{caption}} = -\sum_{t=1}^T \log p(y_t|y_{<t}, z; \theta)$

The captioning stage aims to align visual features with astronomical vocabulary. Hyperparameters include an Adam optimizer (learning rate $\approx 1 \times 10^{-4}$ ), batch size $\approx$ 64, and training for roughly three epochs.

Stage 2: Visual Question Answering (VQA) The caption corpus is used to prompt GPT-4, which generates synthetic "conversation" turns—3–5 Q&A pairs per image—resulting in ~100,000 dialog turns. In this phase, the entire model (including CLIP and LLaMA) is unfrozen and fine-tuned end-to-end, again with cross-entropy loss:

$\mathcal{L}_{\text{VQA}} = -\sum_{i=1}^N \sum_{t=1}^{T_i} \log p(a_{i, t}|a_{i, < t}, Q_i, I; \theta)$

The optimizer uses a learning rate of $1 \times 10^{-5}$ , batch size 8–16, and roughly ten epochs.

3. Dataset Composition and Preparation

The fine-tuning resources integrate three prominent public image–caption archives specifically selected for outreach relevance:

Source	Image–Caption Pairs	Description
NASA APOD	9,962	Daily astronomical images with human captions
ESO Public Images	14,617	Scraped astronomical images and English captions
NASA/ESA HST	5,204	Hubble Space Telescope outreach imagery

All captions are normalized (removal of non-image entries, punctuation standardization) and then supplied as GPT-4 prompts to generate Q&A pairs per image, culminating in approximately 100,000 question–answer dialogs. The aggregated dataset is partitioned 80/10/10 for training, validation, and testing, respectively.

4. Evaluation and Comparative Benchmarking

AstroLLaVA's capacity for astronomical visual reasoning is assessed using the Galaxy 10 DECaLS (G10) benchmark, which comprises 1,770 test images annotated for 10 galaxy morphology classes. The following prompt is used for each image: "Describe the following galaxy image in detail. What type of galaxy is it and what are its key features?"

Model outputs are embedded using the all-MiniLM-L6-v2 text encoder. Evaluation metric is the mean cosine similarity between generated descriptions and class-specific keywords (e.g., "spiral arms", "barred structure", "edge-on disk"):

Model	Cosine Similarity (G10)
LLaVA 1.5 7B	0.594
LLaVA 1.5 13B	0.591
AstroLLaVA 7B	0.597

AstroLLaVA exhibits a marginal improvement in semantic alignment over baseline LLaVA models on this morphology classification task. This substantiates that domain-specific adaptation yields enhanced capability in astronomy-relevant VQA.

5. Implementation Details and Resource Utilization

Training is conducted on four NVIDIA A100 (40 GB) GPUs at the ITER Teide HPC facility, powered entirely by renewable energy. The full two-stage fine-tuning procedure consumes approximately 5 hours and ~5 kWh (estimated using the ML CO₂ Impact Calculator). Implementation utilizes PyTorch; optimization employs Adam ( $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , weight decay $\approx 0.01$ ) with minimal hyperparameter tuning, adhering mostly to LLaVA defaults. Released artifacts—model weights, code, and datasets—are licensed under MIT and made available at https://w3id.org/UniverseTBD/AstroLLaVA.

6. Prospective Directions and Extension Roadmap

The AstroLLaVA research outlines a strategic vision for the development of a domain-general, multi-modal astronomical foundation model. Proposed advances include integrating modality-specific encoders for time-domain data (e.g., TESS photometric light curves), spectra (as from SDSS), radio interferometric maps, and N-dimensional data cubes. These modalities would be projected into a unified latent representation via cross-attention interfaced with a central LLM "lingua franca."

Potential scientific and accessibility applications are highlighted, such as galaxy classification, anomaly detection, object detection on sky survey imagery, and multimodal sonification (inspired by projects like herakoi). Community participation is solicited via Discord (discord.gg/PUR2FbFRZ4) and ongoing contributions to the UniverseTBD repository, aiming to democratize and extend the utility of language-aligned astronomical data frameworks (Zaman et al., 11 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AstroLLaVA: towards the unification of astronomical data and natural language (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AstroLLaVA.