Fine-tuning Large Enterprise Language Models via Ontological Reasoning (2306.10723v2)

Published 19 Jun 2023 in cs.CL, cs.DB, and cs.LO

Abstract: LLMs exploit fine-tuning as a technique to adapt to diverse goals, thanks to task-specific training data. Task specificity should go hand in hand with domain orientation, that is, the specialization of an LLM to accurately address the tasks of a given realm of interest. However, models are usually fine-tuned over publicly available data or, at most, over ground data from databases, ignoring business-level definitions and domain experience. On the other hand, Enterprise Knowledge Graphs (EKGs) are able to capture and augment such domain knowledge via ontological reasoning. With the goal of combining LLM flexibility with the domain orientation of EKGs, we propose a novel neurosymbolic architecture that leverages the power of ontological reasoning to build task- and domain-specific corpora for LLM fine-tuning.

References (1)

Google: T5 large. https://huggingface.co/t5-large, accessed: 2023-06-17

Citations (17)

View on Semantic Scholar

Summary

The paper demonstrates a neurosymbolic pipeline that augments enterprise LLMs by embedding domain-specific knowledge from EKGs.
It introduces a reasoning verbalization technique that transforms logical rules into natural language for secure and efficient fine-tuning.
Proof-of-concept tests on a T5-large model show significant improvements in handling complex domain queries, validating the approach.

Fine-tuning Large Enterprise LLMs via Ontological Reasoning

Introduction to Ontological Reasoning in LLMs

The paper "Fine-tuning Large Enterprise LLMs via Ontological Reasoning" addresses the significant limitations of current LLMs in specialized enterprise domains. Despite the adaptability afforded by task-specific fine-tuning, models often fall short in integrating detailed domain-specific knowledge, which is crucial for enterprise applications ranging from finance to genomics. Enterprise Knowledge Graphs (EKGs) offer a way to remedy this by embedding sophisticated domain knowledge within a structured ontology, which this study leverages through a novel neurosymbolic framework to enhance LLM performance.

Motivation and Context

The proliferation of generative AI tools such as ChatGPT has spotlighted the need for refined NLP strategies that are not only task-specific but also domain-specific. While general-purpose LLMs like T5 and GPT deliver impressive generalization and human-like interaction, they struggle in areas requiring intricate domain expertise. The paper argues that traditional fine-tuning methods, which predominantly utilize publicly available or factual data from databases, fail to capture business-level definitions and domain-specific reasoning that EKGs can provide.

A crucial example highlighted is BloombergGPT, which combines internal data with public datasets but is constrained to questions answerable by existing dataset facts. This scenario illustrates the demand for LLMs capable of augmented reasoning to address questions and tasks involving complex logical deductions beyond mere data retrieval.

The Proposed Neurosymbolic Pipeline

The core contribution of the paper is the introduction of a neurosymbolic pipeline that integrates ontological reasoning within the fine-tuning process for LLMs. The architecture capitalizes on EKGs constructed using the Vadalog system, a Datalog-based reasoning engine. This pipeline operates by synthesizing a fine-tuning corpus derived from ontological reasoning, thus providing a structured and domain-specific augmentation to the base LLM training data.

Figure 1: Neurosymbolic pipeline for reasoning-based LLM fine-tuning.

Pipeline Details

The pipeline begins with the generation of a "chase," expanding base database facts with the derivations imperative by domain rules specified in Vadalog. This results in a comprehensive set integrating both extensional and intensional knowledge, which guides the verbalization of domain-specific data. The process employs a novel reasoning verbalization technique transforming logical rules into text format, ready for LLM fine-tuning.

Figure 2: From plans to fine-tuning corpus, in our running example.

A second critical component is the corpus generation phase, which leverages a tokenized representation of logical plans to minimize direct interactions with pre-trained LLMs, thus ensuring data privacy and cost-efficiency. The paper outlines an optimization strategy to enhance the corpus through NLP-driven paraphrasing and quality checks, securing high fidelity and relevance of the data for fine-tuning.

Proof-of-Concept and Validation

A proof-of-concept implemented within the Vadalog system demonstrates the pipeline's efficacy in producing LLMs with improved performance for domain-specific tasks. The study employs a T5-large model fine-tuned using both standard ground facts and the expanded chase methodology, observing improved capabilities in answering domain-specific queries.

Figure 3: Proof-of-concept for our fine-tuning pipeline.

The ablation study reveals significant enhancements in model responses for complex queries involving trader activities—a testament to the advantage offered by chase-derived fine-tuning. This substantiates the claim that incorporating logical reasoning mechanisms can augment the predictive and reasoning capacities of LLMs beyond conventional methodologies.

Implications and Future Directions

The paper delineates profound implications for enterprise AI deployment, suggesting that integrating ontological reasoning into LLM fine-tuning can yield substantial upgrades in comprehension and performance for domain-specific tasks. This neurosymbolic approach not only bolsters LLM functionality but also fosters a pathway towards more holistic AI systems capable of versatile reasoning across diverse applications.

The findings encourage further exploration into hybrid neurosymbolic models, potentially merging explicit logic reasoning with deep learning to achieve enhanced cognitive tasks. Future research could explore broadening the scope of ontological datasets and refining verbalization techniques to cover more complex business scenarios.

Conclusion

This work offers critical insights into improving LLM capabilities through domain-specific ontological reasoning, utilizing EKGs. The neurosymbolic pipeline presents a compelling alternative to traditional fine-tuning, addressing the need for higher accuracy and richer domain integration. The implications for AI in enterprise applications are vast, highlighting the necessity for continued development of neurosymbolic models in bridging the divide between raw data processing and deep semantic understanding.