Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

AstroLLaVA: towards the unification of astronomical data and natural language (2504.08583v1)

Published 11 Apr 2025 in astro-ph.IM and cs.LG

Abstract: We present AstroLLaVA, a vision LLM for astronomy that enables interaction with astronomical imagery through natural dialogue. By fine-tuning the LLaVA model on a diverse dataset of $\sim$30k images with captions and question-answer pairs sourced from NASA's `Astronomy Picture of the Day', the European Southern Observatory, and the NASA/ESA Hubble Space Telescope, we create a model capable of answering open-ended questions about astronomical concepts depicted visually. Our two-stage fine-tuning process adapts the model to both image captioning and visual question answering in the astronomy domain. We demonstrate AstroLLaVA's performance on an astronomical visual question answering benchmark and release the model weights, code, and training set to encourage further open source work in this space. Finally, we suggest a roadmap towards general astronomical data alignment with pre-trained LLMs, and provide an open space for collaboration towards this end for interested researchers.

Summary

AstroLLaVA: Integrating Astronomy with Vision LLMs

AstroLLaVA represents a focused effort to tailor advanced vision LLMs (VLMs) for the field of astronomy, facilitating interactions with astronomical data through natural language dialogue. The project emerges from the adaptation of the LLaVA model, which combines a vision encoder and LLM, fine-tuned to support natural dialogues concerning astronomical imagery.

Methodological Approach

AstroLLaVA interprets images of astronomical objects by deploying a two-stage fine-tuning strategy within the LLaVA framework. Initially, the model is trained on image captioning using a dataset of approximately 30,000 images with captions sourced from NASA's Astronomy Picture of the Day (APOD), the European Southern Observatory (ESO), and the Hubble Space Telescope archives. This diverse dataset is instrumental in grounding the model's visual comprehension. Subsequently, the model undergoes further instruction tuning using synthetic question-answer pairs generated via unimodal text GPT-4 based on the aforementioned captions. This process enhances AstroLLaVA's capability to handle open-ended inquiries within the domain of astronomy.

Dataset Composition

The datasets leveraged to fine-tune AstroLLaVA predominantly consist of:

APOD Archive: Leveraging NASA's vast collection of annotated images, filtering out non-compatible file formats and accumulating 9,962 image-caption pairs.
ESO and HST Archives: Utilizing web-scraping techniques to compile comprehensive collections from these repositories, yielding 14,617 and 5,204 image-caption pairs, respectively.

Synthetic conversation generation, while not initially incorporating the images, approximates visual content using text captions alone, paving the way for instruction tuning.

Performance Evaluation

AstroLLaVA's effectiveness was assessed through task-based evaluations using the Galaxy 10 DECaLS dataset, aimed at classifying galaxy types via model-generated descriptions. Cosine distance between embeddings from these descriptions and true labels served as a performance metric. AstroLLaVA demonstrated a marginally superior performance with a similarity score of 0.597 compared to baseline LLaVA models, hinting at potential benefits from its astronomy-specific fine-tuning. This reveals that while all models exhibit similar accuracy due to shared vision encoder architectures, AstroLLaVA holds promise in nuanced domain applications with further architectural enrichment targeted at scientific-grade astronomical imagery.

Future Directions

AstroLLaVA initiates a pathway toward constructing more comprehensive multi-modal astronomical foundation models. By expanding the model's capabilities beyond image processing to include varied astronomical data tensors such as spectra, time series, and radio interferometry, AstroLLaVA can begin addressing complex, cross-modal analytical tasks endemic to astronomy. This ambition hinges on community-driven collaborative efforts, which the authors facilitate by releasing model weights, code, and datasets for open-source enhancement.

Moreover, the integration of multi-sensory approaches, such as data sonification and text-to-speech synthesis, enhances accessibility, thereby potentially enriching user engagement and educational outreach programs for astronomy enthusiasts and professionals alike.

In summary, AstroLLaVA lays foundational work towards a versatile astronomical data analysis tool, encouraging interdisciplinary collaboration while aspiring to align sophisticated natural language processing capabilities with astronomical research demands.