- The paper presents a domain-specific natural language interface using a fine-tuned Microsoft Phi-2 model to convert plain language queries into SQL for SDSS.
- It employs LoRA adapter-based fine-tuning on 2,500 NL-SQL pairs, achieving 94% syntactic accuracy in generating executable SQL queries.
- The study demonstrates that intuitive NL interfaces can lower technical barriers for astronomers, although complex queries still pose semantic challenges.
A Natural Language Interface for Efficient Data Retrieval in SDSS
This paper presents a domain-specific implementation of a Natural Language Interface for querying the Sloan Digital Sky Survey (SDSS) database. The work utilizes a fine-tuned version of the Microsoft Phi-2 transformer model trained on a dataset of natural language (NL) and SQL pairs tailored to SDSS. Aiming to facilitate data access for astronomers unfamiliar with SQL, the study investigates the effectiveness of small-scale LLM fine-tuning.
Methodology
Dataset Construction
The authors constructed a NL-SQL dataset derived from SDSS SkyServer tutorials. They expanded the dataset using paraphrase generation techniques with larger LLMs to ensure linguistic diversity. Synthetic query examples were also generated through a script that randomly sampled SDSS parameters and constructed matching SQL statements with relevant natural language descriptions.
Model Fine-tuning
The model was fine-tuned using the HuggingFace transformers library with LoRA adapters, which enable parameter-efficient updates by modifying a low-rank subset of the model weights. Training was performed on approximately 2,500 NL-SQL pairs with a 90-10 train-validation split. The primary hardware used was an NVIDIA T4 GPU, and training was designed to be computationally efficient.
Evaluation Metrics
The evaluation focused on syntactic and semantic accuracy. Syntactic accuracy was determined by executing the SQL queries on SDSS SkyServer, while semantic accuracy was manually assessed to verify the correct translation of NL instructions into SQL queries.
Results and Observations
The fine-tuned Phi-2 model achieved a syntactic accuracy of approximately 94% and semantic accuracy in the range of 60-70%. High syntactic accuracy indicates that most outputs are valid SQL commands, although semantic accuracy shows some limitations. Errors typically occurred in constructing complex queries with multiple conditions, where the model would inaccurately represent parameter bounds or intent.
An example included a natural language request for querying galaxies based on redshift and photometric conditions, which the model successfully translated into a semantically accurate SQL query. Nonetheless, the model occasionally struggled with more complex joins or parameter settings, highlighting the need for careful dataset curation and refinement.
Discussion
The work underscores the practical utility of lightweight domain-specific NLIDBs, emphasizing Microsoft Phi-2's adaptability in generating SQL queries for SDSS. By enabling plain language instructions, the interface lowers the technical barrier for retrieving data, supporting both academic research and educational purposes.
Additionally, this approach suggests scalability to larger databases like those from future surveys (e.g., LSST, DESI), positioning NLIDBs as vital tools in the field of astronomical data management. Considerations for future work include refining dataset quality, enhancing schema-awareness, and exploring larger model variants to address semantic limitations.
Conclusion
The study demonstrates the feasibility of adapting small-scale pre-trained LLMs for domain-specific tasks in astronomy. It effectively simplifies interactions with complex databases such as the SDSS, providing an accessible tool for researchers and educators. Future efforts could enhance model performance through improved training data, schema-informed decoding strategies, and leveraging additional support from the LLM community.