AstroLLaVA: Integrating Astronomy with Vision LLMs
AstroLLaVA represents a focused effort to tailor advanced vision LLMs (VLMs) for the field of astronomy, facilitating interactions with astronomical data through natural language dialogue. The project emerges from the adaptation of the LLaVA model, which combines a vision encoder and LLM, fine-tuned to support natural dialogues concerning astronomical imagery.
Methodological Approach
AstroLLaVA interprets images of astronomical objects by deploying a two-stage fine-tuning strategy within the LLaVA framework. Initially, the model is trained on image captioning using a dataset of approximately 30,000 images with captions sourced from NASA's Astronomy Picture of the Day (APOD), the European Southern Observatory (ESO), and the Hubble Space Telescope archives. This diverse dataset is instrumental in grounding the model's visual comprehension. Subsequently, the model undergoes further instruction tuning using synthetic question-answer pairs generated via unimodal text GPT-4 based on the aforementioned captions. This process enhances AstroLLaVA's capability to handle open-ended inquiries within the domain of astronomy.
Dataset Composition
The datasets leveraged to fine-tune AstroLLaVA predominantly consist of:
- APOD Archive: Leveraging NASA's vast collection of annotated images, filtering out non-compatible file formats and accumulating 9,962 image-caption pairs.
- ESO and HST Archives: Utilizing web-scraping techniques to compile comprehensive collections from these repositories, yielding 14,617 and 5,204 image-caption pairs, respectively.
Synthetic conversation generation, while not initially incorporating the images, approximates visual content using text captions alone, paving the way for instruction tuning.
AstroLLaVA's effectiveness was assessed through task-based evaluations using the Galaxy 10 DECaLS dataset, aimed at classifying galaxy types via model-generated descriptions. Cosine distance between embeddings from these descriptions and true labels served as a performance metric. AstroLLaVA demonstrated a marginally superior performance with a similarity score of 0.597 compared to baseline LLaVA models, hinting at potential benefits from its astronomy-specific fine-tuning. This reveals that while all models exhibit similar accuracy due to shared vision encoder architectures, AstroLLaVA holds promise in nuanced domain applications with further architectural enrichment targeted at scientific-grade astronomical imagery.
Future Directions
AstroLLaVA initiates a pathway toward constructing more comprehensive multi-modal astronomical foundation models. By expanding the model's capabilities beyond image processing to include varied astronomical data tensors such as spectra, time series, and radio interferometry, AstroLLaVA can begin addressing complex, cross-modal analytical tasks endemic to astronomy. This ambition hinges on community-driven collaborative efforts, which the authors facilitate by releasing model weights, code, and datasets for open-source enhancement.
Moreover, the integration of multi-sensory approaches, such as data sonification and text-to-speech synthesis, enhances accessibility, thereby potentially enriching user engagement and educational outreach programs for astronomy enthusiasts and professionals alike.
In summary, AstroLLaVA lays foundational work towards a versatile astronomical data analysis tool, encouraging interdisciplinary collaboration while aspiring to align sophisticated natural language processing capabilities with astronomical research demands.