A Natural Language Interface for Efficient Data Retrieval in SDSS

Published 29 Oct 2025 in astro-ph.IM | (2510.25953v1)

Abstract: Modern astronomical surveys such as the Sloan Digital Sky Survey (SDSS) provide extensive astronomical databases enabling researchers to access vast amount of diverse data. However, retrieving data from archives requires knowledge of query languages and familiarity with their schema, which presents a barrier for non-experts. This work investigates the use of Microsoft Phi-2, a compact yet powerful transformer-based LLM, fine-tuned on natural language--SQL pairs constructed from SDSS query examples. We develop an interface that translates user queries in natural language into SQL commands compatible with SDSS SkyServer. Preliminary evaluation shows that the fine-tuned model produces syntactically valid and largely semantically correct queries across a variety of astronomy-related requests. Our results show that even small-scale models, when carefully fine-tuned, can provide effective domain-specific natural language interfaces for large scientific databases.

Abstract PDF Upgrade to Chat

Authors (1)

Prathamesh Tamhane

Summary

The paper presents a domain-specific natural language interface using a fine-tuned Microsoft Phi-2 model to convert plain language queries into SQL for SDSS.
It employs LoRA adapter-based fine-tuning on 2,500 NL-SQL pairs, achieving 94% syntactic accuracy in generating executable SQL queries.
The study demonstrates that intuitive NL interfaces can lower technical barriers for astronomers, although complex queries still pose semantic challenges.

A Natural Language Interface for Efficient Data Retrieval in SDSS

This paper presents a domain-specific implementation of a Natural Language Interface for querying the Sloan Digital Sky Survey (SDSS) database. The work utilizes a fine-tuned version of the Microsoft Phi-2 transformer model trained on a dataset of natural language (NL) and SQL pairs tailored to SDSS. Aiming to facilitate data access for astronomers unfamiliar with SQL, the study investigates the effectiveness of small-scale LLM fine-tuning.

Methodology

Dataset Construction

The authors constructed a NL-SQL dataset derived from SDSS SkyServer tutorials. They expanded the dataset using paraphrase generation techniques with larger LLMs to ensure linguistic diversity. Synthetic query examples were also generated through a script that randomly sampled SDSS parameters and constructed matching SQL statements with relevant natural language descriptions.

Model Fine-tuning

The model was fine-tuned using the HuggingFace transformers library with LoRA adapters, which enable parameter-efficient updates by modifying a low-rank subset of the model weights. Training was performed on approximately 2,500 NL-SQL pairs with a 90-10 train-validation split. The primary hardware used was an NVIDIA T4 GPU, and training was designed to be computationally efficient.

Evaluation Metrics

The evaluation focused on syntactic and semantic accuracy. Syntactic accuracy was determined by executing the SQL queries on SDSS SkyServer, while semantic accuracy was manually assessed to verify the correct translation of NL instructions into SQL queries.

Results and Observations

The fine-tuned Phi-2 model achieved a syntactic accuracy of approximately 94% and semantic accuracy in the range of 60-70%. High syntactic accuracy indicates that most outputs are valid SQL commands, although semantic accuracy shows some limitations. Errors typically occurred in constructing complex queries with multiple conditions, where the model would inaccurately represent parameter bounds or intent.

An example included a natural language request for querying galaxies based on redshift and photometric conditions, which the model successfully translated into a semantically accurate SQL query. Nonetheless, the model occasionally struggled with more complex joins or parameter settings, highlighting the need for careful dataset curation and refinement.

Discussion

The work underscores the practical utility of lightweight domain-specific NLIDBs, emphasizing Microsoft Phi-2's adaptability in generating SQL queries for SDSS. By enabling plain language instructions, the interface lowers the technical barrier for retrieving data, supporting both academic research and educational purposes.

Additionally, this approach suggests scalability to larger databases like those from future surveys (e.g., LSST, DESI), positioning NLIDBs as vital tools in the field of astronomical data management. Considerations for future work include refining dataset quality, enhancing schema-awareness, and exploring larger model variants to address semantic limitations.

Conclusion

The study demonstrates the feasibility of adapting small-scale pre-trained LLMs for domain-specific tasks in astronomy. It effectively simplifies interactions with complex databases such as the SDSS, providing an accessible tool for researchers and educators. Future efforts could enhance model performance through improved training data, schema-informed decoding strategies, and leveraging additional support from the LLM community.

Markdown Report Issue