Golden-Retriever: High-Fidelity Agentic Retrieval Augmented Generation for Industrial Knowledge Base (2408.00798v1)

Published 20 Jul 2024 in cs.IR, cs.AI, cs.CL, and cs.DL

Abstract: This paper introduces Golden-Retriever, designed to efficiently navigate vast industrial knowledge bases, overcoming challenges in traditional LLM fine-tuning and RAG frameworks with domain-specific jargon and context interpretation. Golden-Retriever incorporates a reflection-based question augmentation step before document retrieval, which involves identifying jargon, clarifying its meaning based on context, and augmenting the question accordingly. Specifically, our method extracts and lists all jargon and abbreviations in the input question, determines the context against a pre-defined list, and queries a jargon dictionary for extended definitions and descriptions. This comprehensive augmentation ensures the RAG framework retrieves the most relevant documents by providing clear context and resolving ambiguities, significantly improving retrieval accuracy. Evaluations using three open-source LLMs on a domain-specific question-answer dataset demonstrate Golden-Retriever's superior performance, providing a robust solution for efficiently integrating and querying industrial knowledge bases.

PDF HTML Abstract

Golden-Retriever: Enhancing Retrieval Augmented Generation for Industrial Knowledge Bases

The paper presents a novel framework named Golden-Retriever, designed to optimize Retrieval Augmented Generation (RAG) workflows specifically for navigating extensive industrial knowledge bases. This research addresses the significant challenges associated with using LLMs in domains laden with specialized jargon and context-specific interpretations. Golden-Retriever proposes a reflection-based question augmentation method that significantly ameliorates these challenges, particularly when compared to conventional LLM fine-tuning or RAG methods.

Methodology

Golden-Retriever incorporates an essential pre-retrieval step focused on reflecting upon and augmenting user inquiries to clarify domain-specific jargon and context. The framework operates through both offline and online processes:

Offline LLM-Driven Document Augmentation: Initially, industrial documents, often in varied formats, are processed through Optical Character Recognition (OCR) to extract text. The text is then split into smaller, processable chunks fed into an LLM, which generates summaries enriched with domain-expert perspectives. This ensures the document database is well-structured and contextually relevant, increasing retrieval accuracy and efficiency.
Real-time Jargon and Context Identification: Upon a user query, the system identifies and lists jargon within the query using an LLM-guided prompt template, ensuring the capture of even imprecise or previously unknown terms. Simultaneously, the context of the question is evaluated based on predefined context descriptions, enhancing understanding and accurate data retrieval.
Augmented Question Generation: The identified jargon and context are queried against a jargon dictionary to clarify terms further, resulting in an augmented user query. This refined query is then used within the RAG framework to retrieve more relevant documents efficiently.

Evaluation and Results

The framework's efficacy was validated using a bespoke dataset composed of domain-specific multiple-choice questions derived from training materials for engineers. Golden-Retriever demonstrated superiority over both standalone LLMs and traditional RAG frameworks across several contemporary models, including Meta-Llama-3, Mistral, and Shisa. Quantitatively, the method yielded a 57.3% improvement over vanilla LLM implementations and a 35.0% enhancement over standard RAG results, underscoring its capability to navigate ambiguities in jargon and provide contextually accurate answers reliably.

Additionally, the paper includes a focused experiment evaluating LLMs' capability to identify new or random abbreviations accurately. Meta-Llama-3 and Mistral displayed robust performance, further validating the methodology's effectiveness in the jargon identification phase.

Implications and Future Directions

Golden-Retriever optimizes industrial knowledge navigation by leveraging precise jargon identification and context-based query augmentation. The proposed system holds significant implications for enhancing the retrieval efficacy of industrial databases without the need for elaborate and costly LLM fine-tuning. This approach enables the seamless integration of dynamic and complex knowledge bases into LLM queries, maintaining high fidelity and relevance.

Future work should explore the scalability of Golden-Retriever across varied industrial applications and its adaptability to more comprehensive data management systems. Additionally, expanding the framework to incorporate real-time learning from user feedback could further refine the retrieval process, enhancing its applicability and accuracy in diverse, evolving industrial domains.

In summary, Golden-Retriever presents an innovative and practical approach to overcoming traditional RAG shortcomings in knowledge-dense industrial environments, paving the way for more efficient and accurate data retrieval systems in AI-driven applications.