- The paper introduces a unified one-pass approach that simultaneously handles generation and retrieval in large language models.
- OneGen employs autoregressive retrieval tokens and contrastive learning, achieving notable gains in both QA and entity linking tasks.
- The framework reduces computational costs and simplifies deployment, making it ideal for real-time and low-resource applications.
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
The paper "OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs" authored by researchers from Zhejiang University and Ant Group, presents an innovative framework—OneGen—for enhancing LLMs with the capability to seamlessly integrate generation and retrieval tasks within a unified forward pass. The paper addresses existing limitations in LLMs' abilities to handle retrieval tasks efficiently when generating contextually accurate and factual responses.
Overview
In contemporary NLP, LLMs are predominantly leveraged for generation tasks across various applications due to their exceptional language generation capabilities. However, such models often stumble with hallucinations and factual inaccuracies stemming from their reliance on implicit, parametric knowledge. Retrieval-Augmented Generation (RAG) stands out as a prominent solution, wherein retrieval enriches the input with relevant passages. Prior works, however, often employ a separate retriever, necessitating multiple forward passes and resulting in increased computational overhead.
The OneGen framework proposed in this paper bridges the traditionally separate training paradigms for generation and retrieval by incorporating autoregressively generated retrieval tokens within a single LLM context. This approach enables a single LLM to manage both tasks simultaneously in a unified forward pass, optimizing both training and inference processes.
Methodology
OneGen introduces retrieval tokens generated in an autoregressive manner, which are used to represent retrieval queries within the same context used for generation. During training, these tokens participate in a specialized training procedure employing contrastive learning for retrieval tasks and LLM objectives for generative tasks. At inference, these retrieval tokens facilitate efficient on-demand retrieval by applying cosine similarity upon token generation, thus integrating the retrieved content seamlessly into the generative context.
Experimental Validation
The proposed framework is validated across two primary types of composite tasks: RAG (including both single-hop and multi-hop QA) and Entity Linking (EL). Experiments illustrate that OneGen outperforms existing state-of-the-art methods in terms of both effectiveness and efficiency. Notably, OneGen achieves:
- A 1.5-point average improvement across four Single-hop QA datasets.
- A 3.3-point F1 improvement across two Multi-hop QA datasets.
- A 3.2-point average accuracy improvement across six out-of-domain EL datasets.
These results underscore OneGen’s capability to enhance retrieval within generation tasks without compromising the generative performance of LLMs.
Implications and Future Work
Practical Implications: The ability to unify generation and retrieval in a single model simplifies deployment and reduces computational costs by eliminating the need for separate retrieval models and query rewriting. This enables more efficient inference, particularly beneficial in low-resource environments or applications requiring real-time processing.
Theoretical Implications: From a theoretical perspective, OneGen opens the door to more sophisticated task integrations within LLM architectures. It substantiates that retrieval capabilities can be inherently embedded within LLMs, allowing for more natural and coherent generation.
Conclusions
The OneGen framework presents a compelling enhancement to the capabilities of LLMs, efficiently integrating retrieval and generation tasks within a single unified forward pass. This innovation offers significant advancements in both computational efficiency and performance across various NLP tasks, emphasizing the benefits of joint training paradigms and the potential for further research in seamless task integration within LLM contexts.
Future Directions: The paper suggests several avenues for future research:
- Extending OneGen to multimodal domains to handle tasks such as multimodal RAG and multimodal EL.
- Enhancing the framework with additional datasets to improve the LLMs’ abilities to manage complex retrieval and generation tasks.
- Investigating parameter-efficient fine-tuning methods like LoRA and QLoRA and their benefits for OneGen.
- Exploring the application of OneGen within Mixture of Experts (MoE) models to leverage dynamic routing for further efficiency gains.
In sum, this paper marks a significant stride in the efficient integration of retrieval within generative models, highlighting transformative implications for LLM deployment and performance in diverse NLP applications.