Fine-tuning OpenLLaMA for Enhanced SPARQL Generation in Life Sciences
Introduction
In the rapidly evolving field of Question Answering Systems (QAS) over Knowledge Graphs (KGs), leveraging LLMs offers a promising avenue for facilitating direct natural language interaction with data. This paper evaluates various fine-tuning strategies for OpenLLaMA, an open-source LLM, aimed at translating natural language questions into SPARQL queries specifically for the life sciences domain. Through an innovative end-to-end data augmentation approach, this research extends the utility of SPARQL in querying the Bgee gene expression knowledge graph, a pivotal resource in life sciences.
Background and Related Works
The intersection of LLMs and KGs has seen notable advancements, albeit with challenges in SPARQL query generation due to the specificity and complexity involved. Previous works have identified limitations in LLMs' ability to accurately generate semantically correct SPARQL queries. The deployment of models like ChatGPT has indeed demonstrated potential, yet the intricate domain-specific requirements of scientific knowledge bases, such as Bgee, demand highly accurate query translations, underscoring the importance of fine-tuning strategies and dataset augmentations in overcoming these challenges.
Methodology
The core methodology centers around two primary objectives: augmenting the existing set of question-to-SPARQL query pairs and fine-tuning OpenLLaMA to improve SPARQL query generation accuracy. Through a systematic approach, the paper extends the range and semantic richness of the training data, incorporating variable names with semantic "clues" and inline comments to explore their impact on model performance. The fine-tuning leverages techniques like QLoRA and PEFT, without extensive hyperparameter optimization, showcasing the practical feasibility of the approach.
Evaluation and Discussion
The paper's evaluation, grounded in the life sciences domain with the Bgee gene expression knowledge graph, employs robust metrics including BLEU, SP-BLEU, METEOR, and ROUGE-L. The findings reveal that incorporating context through meaningful variable names and inline comments significantly enhances model performance across all metrics. Interestingly, pre-fine-tuning the model with a diverse dataset like KQA Pro does not conclusively improve, and in some instances, may even detract from performance when subsequently fine-tuned on a domain-specific dataset such as Bgee. This highlights the nuanced interplay between domain-general and domain-specific knowledge in LLM fine-tuning.
Conclusion
This paper presents a comprehensive approach to fine-tuning LLMs for SPARQL query generation within the life science domain, particularly over the Bgee knowledge graph. The results underscore the value of dataset augmentation and strategic fine-tuning in achieving notable improvements in query generation accuracy. The nuanced findings regarding pre-fine-tuning with general datasets versus direct domain-specific fine-tuning offer valuable insights for future research. Moving forward, expanding the dataset augmentation techniques and exploring their applicability across a broader range of scientific knowledge bases remains a promising direction, with the potential to significantly advance the capabilities of QAS over KGs in the life sciences and beyond.
Acknowledging the complexity and the critical necessity for accurate data querying in life sciences research, this paper provides a seminal step towards harnessing the full potential of LLMs in bridging natural language queries and SPARQL, thus enhancing the accessibility and utility of invaluable data resources in the domain.