Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models (2405.19670v4)

Published 30 May 2024 in cs.CL

Abstract: Retrieval-augmented generation (RAG) is a promising way to improve LLMs for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capabilities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across 12 question-answering tasks demonstrate the superiority of our approach.

This paper introduces SPRING (Scalable and Pluggable virtual Tokens for Retrieval-augmented Generation), a novel method to enhance the performance of LLMs in Retrieval-Augmented Generation (RAG) scenarios without compromising their general capabilities (Zhu et al., 30 May 2024 ). Current RAG approaches either use prompt engineering, which can be suboptimal, or fine-tune the LLM (e.g., using LoRA), which improves RAG performance but often degrades performance on non-RAG tasks by altering the model's parameters.

SPRING addresses this by introducing a small number of trainable "virtual tokens" into the input sequence. Specifically, these virtual tokens are inserted between the retrieved documents (RR) and the user query (QQ). During training, only the embeddings of these virtual tokens (δ\delta) are updated, while the parameters of the backbone LLM (θ\theta) remain frozen. This makes the method highly parameter-efficient (e.g., adding 50 tokens to Mistral-7b only adds 0.2M trainable parameters). The input format becomes [R;t1,t2,,tn;Q][R; t_1, t_2, \dots, t_n; Q], where tit_i are the virtual tokens.

Key features of SPRING include:

  1. Scalability: A unique training strategy is proposed where, for each training sample, a random number kk (less than or equal to the total number of virtual tokens nn) is chosen, and only the first kk virtual tokens (t1:kt_{1:k}) are used. This allows the trained tokens to be effective even when only a subset is used during inference, enabling dynamic adjustment based on context length constraints or desired performance trade-offs. Experiments show that even a single virtual token (k=1k=1) can significantly improve RAG performance.
  2. Pluggability: Since the base LLM's parameters are untouched, the learned virtual tokens act as a plug-and-play module. For RAG tasks, the tokens (represented as special tokens like [r1], [r2], etc., added to the vocabulary with the learned embeddings) are included in the input. For non-RAG tasks, they are simply omitted, preserving the LLM's original performance on general tasks.
  3. Effectiveness: Experiments conducted on nine Question Answering (QA) datasets (including TQA, NQ, HQA, SQuAD, PopQA) using various LLMs (Mistral-7b, LLaMA-2-7b, Phi-3-4b, Qwen-1.8b) demonstrate significant improvements in RAG performance (average +33% EM, +12% F1 over prompt-based methods for Mistral-7b). While LoRA achieves slightly higher RAG scores, it drastically degrades performance on non-RAG and general capability benchmarks (BoolQ, CommonsenseQA, GSM8K, MMLU), unlike SPRING which preserves the original LLM's performance.
  4. Generalizability: SPRING shows robustness across different retrievers (BM25, BGE-base, E5-base, E5-large) and varying numbers of retrieved passages. Training is performed by mixing data from multiple QA datasets and randomly varying the number of retrieved passages used (m[1,5]m \in [1,5]), enhancing adaptability. The method also generalizes well to unseen datasets (PopQA).

The authors position SPRING as a lightweight, efficient, and practical solution for enhancing deployed LLMs with RAG capabilities without disrupting their existing functionalities. The code and trained virtual tokens are made publicly available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yutao Zhu (63 papers)
  2. Zhaoheng Huang (3 papers)
  3. Zhicheng Dou (113 papers)
  4. Ji-Rong Wen (299 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com