Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token (2405.13792v2)

Published 22 May 2024 in cs.CL, cs.AI, and cs.IR
xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

Abstract: This paper introduces xRAG, an innovative context compression method tailored for retrieval-augmented generation. xRAG reinterprets document embeddings in dense retrieval--traditionally used solely for retrieval--as features from the retrieval modality. By employing a modality fusion methodology, xRAG seamlessly integrates these embeddings into the LLM representation space, effectively eliminating the need for their textual counterparts and achieving an extreme compression rate. In xRAG, the only trainable component is the modality bridge, while both the retriever and the LLM remain frozen. This design choice allows for the reuse of offline-constructed document embeddings and preserves the plug-and-play nature of retrieval augmentation. Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks, adaptable to various LLM backbones, ranging from a dense 7B model to an 8x7B Mixture of Experts configuration. xRAG not only significantly outperforms previous context compression methods but also matches the performance of uncompressed models on several datasets, while reducing overall FLOPs by a factor of 3.53. Our work pioneers new directions in retrieval-augmented generation from the perspective of multimodality fusion, and we hope it lays the foundation for future efficient and scalable retrieval-augmented systems

Efficient Retrieval-Augmented Generation with xRAG

Introduction

Retrieval-Augmented LLMs (RALMs) have become quite effective at tackling knowledge-intensive tasks by pulling in relevant information that isn't explicitly stored within the model. But there's a catch—packing entire documents into the context for generation can be computationally intense and might even blow past LLMs' context limits. This is where the new approach, xRAG, steps in. The idea? Compress the context without losing the valuable information.

How It Works

Traditional vs. xRAG Approach

In typical retrieval-augmented generation, the model retrieves a document relevant to the query and concatenates this document with the query before feeding it into the LLM, creating a long input. This can be pretty expensive in terms of computation and memory. Enter xRAG, which tackles this problem from a fresh angle—modality fusion.

Instead of jamming the entire document after the query, xRAG reinterprets document embeddings (those high-dimensional vectors used in dense retrieval) as features directly usable by the LLM. Essentially, xRAG bypasses the need to include the actual text of the document in the context by leveraging these embeddings as a compact representation. It adds just one extra token to the input, compared to potentially hundreds with traditional methods.

The Architecture of xRAG

xRAG uses a modality bridge—a simple two-layer MLP (Multi-Layer Perceptron)—which is the only trainable component, leaving both the retriever and the LLM frozen. This bridge maps the dense embeddings into the LLM's representation space. This design keeps things lightweight and avoids a full-parameter tuning that might disrupt the LLM's other abilities.

Training Strategy

  1. Paraphrase Pretraining: xRAG starts by learning a paraphrasing task using an unlabeled corpus. The model gets trained to recover the original document from its dense embedding, helping align the dense features with textual content.
  2. Context-aware Instruction Tuning: Next, xRAG undergoes instruction tuning with labeled data—stuff like reading comprehension, summarization, and open-domain QA. This phase ensures that the model effectively utilizes the dense embedding during generation.

Experimental Results

Strong Numerical Results: xRAG showed impressive results in experiments across six knowledge-intensive tasks—yielding an average improvement of over 10%. It also achieved performance close to that of uncompressed models while significantly reducing the computation overhead.

Noteworthy Efficiency: In terms of efficiency, xRAG reduces FLOPs (Floating Point Operations per Second) by a whopping factor of 3.53 compared to the traditional approach. This means faster processing and less computational load, a crucial factor for scaling up.

Practical Implications

Plug-and-Play Flexibility: Since the only trainable component is the modality bridge, xRAG maintains the plug-and-play advantage of existing retrieval-augmented systems. You can reuse precomputed document embeddings, making integration straightforward with minimal retraining required.

Lightweight and Scalable: No need to store large activation memories for compressed tokens as seen in some other context compression methods. This makes xRAG suitable for systems with vast retrieval corpora.

Future Directions

Given its efficient and effective performance, xRAG sets the stage for further advancements in retrieval-augmented systems. Here's what the future might hold:

  • Enhanced Multi-Hop Reasoning: While xRAG shows great promise, multi-hop reasoning tasks still present a challenge. Future developments could focus on refining modality fusion techniques to boost performance in these complex tasks.
  • Adaptive Retrieval: Integrating adaptive retrieval techniques might further optimize the relevance of the retrieved documents, minimizing the noise and improving the robustness of the generation.
  • Expanded Modality Integration: Extending xRAG to handle multiple modalities could be another exciting direction, merging not just text but also images, sounds, and more for richer, more informative outputs.

Conclusion

xRAG is a sleek, efficient take on retrieval-augmented generation. By using dense embeddings and a modality fusion strategy, it achieves impressive compression rates and retains high performance across various tasks. It’s a promising approach for those looking to scale their LLM systems without breaking the bank on computational costs. Whether you're tuning into dense embeddings or looking for a more efficient context compression method, xRAG is worth paying attention to.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xin Cheng (89 papers)
  2. Xun Wang (96 papers)
  3. Xingxing Zhang (65 papers)
  4. Tao Ge (53 papers)
  5. Si-Qing Chen (22 papers)
  6. Furu Wei (291 papers)
  7. Huishuai Zhang (64 papers)
  8. Dongyan Zhao (144 papers)
Citations (16)