AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model (2505.17592v1)

Published 23 May 2025 in astro-ph.IM and cs.LG

Abstract: General-purpose LLMs, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.

PDF Abstract

Benchmark-Topping Performance in Astronomy Q&A with AstroSage-Llama-3.1-70B

The paper "AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model" presents the development and capabilities of AstroSage-Llama-3.1-70B, a LLM specialized in astronomy and related fields. This model builds upon previous work with AstroSage-Llama-3.1-8B and represents a marked increase in scale to 70 billion parameters, utilizing advanced training methodologies to achieve superior performance on specialized tasks within the domain of astronomy.

Model Development and Training Methodologies

AstroSage-Llama-3.1-70B was derived from the Meta-Llama-3.1-70B foundation model, benefiting from foundational advancements in parameter scaling and architecture. The model underwent extensive training phases including continued pre-training (CPT) and supervised fine-tuning (SFT), which were pivotal in enhancing its domain-specific proficiency. These processes involved meticulous dataset refinement and hyperparameter tuning:

Continued Pre-training (CPT): This phase leveraged a corpus of astronomical literature, such as arXiv preprints and encyclopedic articles, while incorporating general web text samples to balance specialization and general language understanding. The training was executed over 2.5 epochs using advanced computational infrastructure, reflecting significant resource investment.
Supervised Fine-Tuning (SFT): Post-CPT, the model was subjected to SFT to foster instruction-following capabilities and integrate explicit reasoning processes. The mixed datasets included alignment with human preferences and reinforced domain-specific Q&A aptitude, which are crucial for conversational and educational applications in astronomy.

The training implementation employed robust computational strategies with the GPT-NeoX framework, facilitating parallel processing across GPUs to maximize throughput and resource efficiency. This approach underscored the extensive computational demands associated with training a model of this magnitude.

Benchmark Evaluation and Performance

AstroSage-Llama-3.1-70B was evaluated using the AstroMLab-1 benchmark, comprising 4,425 high-quality questions across various subfields of astrophysics. The model achieved an accuracy score of 86.2%, outperforming other leading LLMs, both proprietary and open-weight, including significant counterparts with much larger operational costs. This demonstrates the efficacy of domain specialization, enabling higher performance without proportionate increases in inference cost.

The evaluation highlighted AstroSage-Llama-3.1-70B's ability to provide nuanced understanding and accurate responses to complex questions, validating its utility as a domain-specific AI assistant in scholarly and practical contexts. The findings indicate that targeted training on large-scale models enhances domain-specific capabilities, positioning AstroSage-Llama-3.1-70B as a leading tool in astronomy research.

Implications and Future Directions

AstroSage-Llama-3.1-70B's release under an open-weight model supports widespread access and application, facilitating adoption across educational and research institutions. Its superior performance on astronomy-specific benchmarks underscores the potential of domain-specialized LLMs to complement traditional research methodologies, potentially accelerating the pace of discovery.

Future work may involve expanding its integration with scientific workflows, enhancing the model with real-world tools and datasets, and developing benchmarks to further test its reasoning abilities in dynamic scenarios. Additionally, extending the methodology to other scientific domains could unlock similar specialized capabilities.

AstroSage-Llama-3.1-70B exemplifies the transformative impact of domain-specialized AI in scientific research, suggesting a promising avenue for improving E2E knowledge processes in astronomy and beyond. The model's release sets the stage for collaborative enhancement and adaptation, providing a foundation for future advancements in AI-driven scientific inquiry.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Tijmen de Haan (26 papers)
Yuan-Sen Ting (215 papers)
Tirthankar Ghosal (14 papers)
Tuan Dung Nguyen (17 papers)
Alberto Accomazzi (57 papers)
Emily Herron (3 papers)
Vanessa Lama (1 paper)
Rui Pan (67 papers)
Azton Wells (6 papers)
Nesar Ramachandra (23 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/TingAstro/status/1926889989662474278

https://twitter.com/tijm6140/status/1926805476097859856

YouTube

Show All Videos