Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs (2409.05152v2)

Published 8 Sep 2024 in cs.CL, cs.AI, cs.DB, cs.IR, and cs.LG

Abstract: Despite the recent advancements in LLMs, which have significantly enhanced the generative capabilities for various NLP tasks, LLMs still face limitations in directly handling retrieval tasks. However, many practical applications demand the seamless integration of both retrieval and generation. This paper introduces a novel and efficient One-pass Generation and retrieval framework (OneGen), designed to improve LLMs' performance on tasks that require both generation and retrieval. The proposed framework bridges the traditionally separate training approaches for generation and retrieval by incorporating retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass. We conduct experiments on two distinct types of composite tasks, RAG and Entity Linking, to validate the pluggability, effectiveness, and efficiency of OneGen in training and inference. Furthermore, our results show that integrating generation and retrieval within the same context preserves the generative capabilities of LLMs while improving retrieval performance. To the best of our knowledge, OneGen is the first to enable LLMs to conduct vector retrieval during the generation.

Citations (2)

Summary

  • The paper introduces a unified one-pass approach that simultaneously handles generation and retrieval in large language models.
  • OneGen employs autoregressive retrieval tokens and contrastive learning, achieving notable gains in both QA and entity linking tasks.
  • The framework reduces computational costs and simplifies deployment, making it ideal for real-time and low-resource applications.

OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

The paper "OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs" authored by researchers from Zhejiang University and Ant Group, presents an innovative framework—OneGen—for enhancing LLMs with the capability to seamlessly integrate generation and retrieval tasks within a unified forward pass. The paper addresses existing limitations in LLMs' abilities to handle retrieval tasks efficiently when generating contextually accurate and factual responses.

Overview

In contemporary NLP, LLMs are predominantly leveraged for generation tasks across various applications due to their exceptional language generation capabilities. However, such models often stumble with hallucinations and factual inaccuracies stemming from their reliance on implicit, parametric knowledge. Retrieval-Augmented Generation (RAG) stands out as a prominent solution, wherein retrieval enriches the input with relevant passages. Prior works, however, often employ a separate retriever, necessitating multiple forward passes and resulting in increased computational overhead.

The OneGen framework proposed in this paper bridges the traditionally separate training paradigms for generation and retrieval by incorporating autoregressively generated retrieval tokens within a single LLM context. This approach enables a single LLM to manage both tasks simultaneously in a unified forward pass, optimizing both training and inference processes.

Methodology

OneGen introduces retrieval tokens generated in an autoregressive manner, which are used to represent retrieval queries within the same context used for generation. During training, these tokens participate in a specialized training procedure employing contrastive learning for retrieval tasks and LLM objectives for generative tasks. At inference, these retrieval tokens facilitate efficient on-demand retrieval by applying cosine similarity upon token generation, thus integrating the retrieved content seamlessly into the generative context.

Experimental Validation

The proposed framework is validated across two primary types of composite tasks: RAG (including both single-hop and multi-hop QA) and Entity Linking (EL). Experiments illustrate that OneGen outperforms existing state-of-the-art methods in terms of both effectiveness and efficiency. Notably, OneGen achieves:

  • A 1.5-point average improvement across four Single-hop QA datasets.
  • A 3.3-point F1 improvement across two Multi-hop QA datasets.
  • A 3.2-point average accuracy improvement across six out-of-domain EL datasets.

These results underscore OneGen’s capability to enhance retrieval within generation tasks without compromising the generative performance of LLMs.

Implications and Future Work

Practical Implications: The ability to unify generation and retrieval in a single model simplifies deployment and reduces computational costs by eliminating the need for separate retrieval models and query rewriting. This enables more efficient inference, particularly beneficial in low-resource environments or applications requiring real-time processing.

Theoretical Implications: From a theoretical perspective, OneGen opens the door to more sophisticated task integrations within LLM architectures. It substantiates that retrieval capabilities can be inherently embedded within LLMs, allowing for more natural and coherent generation.

Conclusions

The OneGen framework presents a compelling enhancement to the capabilities of LLMs, efficiently integrating retrieval and generation tasks within a single unified forward pass. This innovation offers significant advancements in both computational efficiency and performance across various NLP tasks, emphasizing the benefits of joint training paradigms and the potential for further research in seamless task integration within LLM contexts.

Future Directions: The paper suggests several avenues for future research:

  1. Extending OneGen to multimodal domains to handle tasks such as multimodal RAG and multimodal EL.
  2. Enhancing the framework with additional datasets to improve the LLMs’ abilities to manage complex retrieval and generation tasks.
  3. Investigating parameter-efficient fine-tuning methods like LoRA and QLoRA and their benefits for OneGen.
  4. Exploring the application of OneGen within Mixture of Experts (MoE) models to leverage dynamic routing for further efficiency gains.

In sum, this paper marks a significant stride in the efficient integration of retrieval within generative models, highlighting transformative implications for LLM deployment and performance in diverse NLP applications.