The paper introduces MemoRAG, a new retrieval-augmented generation paradigm designed to enhance the ability of LLMs to handle complex tasks involving ambiguous information needs and unstructured knowledge. Traditional RAG systems often struggle with such tasks, as they are primarily effective for straightforward question-answering scenarios. MemoRAG addresses these limitations by incorporating a long-term memory component that enables the system to form a global understanding of the database and generate retrieval clues to locate relevant information.
MemoRAG employs a dual-system architecture comprising a light, long-range LLM for global memory formation and a more expressive LLM for final answer generation. The light LLM generates draft answers that serve as clues for the retrieval tools, while the heavy LLM refines these clues using retrieved information. This framework is optimized through enhancements to the cluing mechanism and memorization capacity.
Key aspects of the MemoRAG framework:
- Memory Module: A light LLM that memorizes the global information of the database, providing retrieval clues. This module is designed to be both retentive and instructive.
- Dual-System Architecture: Uses a light LLM for memory and a heavy LLM for generation, balancing cost-effectiveness with expressiveness.
- Fine-tuning of Memory: The memory module is fine-tuned to generate clues that optimize retrieval quality.
The authors define standard RAG as:
=Θ(q,∣θ),
=Γ(q,∣γ),
where
- q is the input query
- isthecontextretrievedfromadatabase∗ is the final answer
- Θ(⋅) is the generation model
- Γ(⋅) is the retrieval model
MemoRAG is then formally defined as:
=Θ(q,∣θ),
=Γ(y,∣γ),
$y = \Theta_{\text{mem}(q, \mid \theta_{\text{mem}})$,
where
- y represents the staging answer (clues)
- $\Theta_{\text{mem}(\cdot)$ is the memory model
- is the database
The memory model progressively compresses input tokens into memory tokens using a transformer-based model Θ(⋅). The attentive interaction at each layer is defined as:
$= _$,
$= _$,
$= _$,
$\text{Attention}(, , ) = \text{softmax}\left(\frac{ ^T}{\sqrt{d_k}\right)$,
Θ()=Attention(,,),
where
- $_, _, _$ are weight matrices for query, key, and value projections
- dk is the dimension of the key vectors
Memory tokens xm are introduced to serve as information carriers for long-term memory. After each context window l, k memory tokens are appended:
={x1,⋯,xl,x1m,⋯,xkm,xl+1,⋯},k≪l.
The attentive interactions for memory formation are defined as:
m=m,
m=m,
m=m,
$\text{Attention}(, , ) = \text{softmax}\left(\frac{[;^m] [;^m;^m_{\text{cache}]^T}{\sqrt{d_k}\right) [, ^m, ^m_{\text{cache}]$,
where
- m,m,m are the query, key, and value for memory tokens xm
- $^m_{\text{cache}$ and $^m_{\text{cache}$ refer to the KV cache of previous memory tokens
The training of the memory module involves pre-training on randomly sampled long contexts from the RedPajama dataset and supervised fine-tuning (SFT (Supervised Fine-Tuning)) using task-specific data. The training objective maximizes the generation probability of the next token given the KV (Key-Value) cache of previous memory tokens and recent raw tokens. The training objective is:
$\max_{\Theta_{\text{mem}(x_{i,j} \mid x^m_{1,1}, \cdots, x^m_{i-1, k_{i-1}, x_{i,1}, \cdots, x_{i,j-1})$.
MemoRAG addresses ambiguous information needs by creating a global memory across the relevant database, enabling it to infer the underlying intent of implicit queries. For information seeking with distributed evidence, MemoRAG connects and integrates relevant information across multiple steps within the database.
The authors developed a benchmark called UltraDomain to evaluate the effectiveness of MemoRAG. UltraDomain consists of complex RAG tasks with long input contexts drawn from diverse domains, including law, finance, education, and healthcare. The tasks involve implicit information needs, distributed evidence gathering, and high-level understanding of the entire database.
The system implementation of MemoRAG is available at a public repository. Two memory models have been released: memorag-qwen2-7b-inst and memorag-mistral-7b-inst, based on Qwen2-7B-Instruct and Mistral-7B-Instruct-v0.2, respectively. The memory models support compression ratios from 2 to 16, managing different context lengths. The system can integrate sparse retrieval, dense retrieval, and reranking methods, with dense retrieval as the default. MemoRAG can also integrate any generative LLM as the generator, supporting initialization from HuggingFace models or commercial APIs (Application Programming Interfaces).
Experiments were conducted on UltraDomain and other benchmarks, comparing MemoRAG against baselines such as Full context input, BGE-M3, Stella-v5, RQ-RAG, and HyDE. The generators used were Llama3-8B-Instruct-8K, Mistral-7B-Instruct-v0.2-32K, and Phi-3-mini-128K.
The results indicated that MemoRAG outperforms all baselines across most datasets, demonstrating strong domain generalization capabilities. MemoRAG consistently surpasses the performance of directly using the full context, illustrating its ability to bridge the gap between processing super-long contexts and addressing complex tasks. Specifically, MemoRAG showed significant improvements in domain-specific tasks and tasks requiring information aggregation.