MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery (2409.05591v2)

Published 9 Sep 2024 in cs.CL and cs.AI

Abstract: Retrieval-Augmented Generation (RAG) leverages retrieval tools to access external databases, thereby enhancing the generation quality of LLMs through optimized context. However, the existing retrieval methods are constrained inherently, as they can only perform relevance matching between explicitly stated queries and well-formed knowledge, but unable to handle tasks involving ambiguous information needs or unstructured knowledge. Consequently, existing RAG systems are primarily effective for straightforward question-answering tasks. In this work, we propose MemoRAG, a novel retrieval-augmented generation paradigm empowered by long-term memory. MemoRAG adopts a dual-system architecture. On the one hand, it employs a light but long-range LLM to form the global memory of database. Once a task is presented, it generates draft answers, cluing the retrieval tools to locate useful information within the database. On the other hand, it leverages an expensive but expressive LLM, which generates the ultimate answer based on the retrieved information. Building on this general framework, we further optimize MemoRAG's performance by enhancing its cluing mechanism and memorization capacity. In our experiment, MemoRAG achieves superior performance across a variety of evaluation tasks, including both complex ones where conventional RAG fails and straightforward ones where RAG is commonly applied.

Authors (5)

Hongjin Qian (23 papers)
Peitian Zhang (23 papers)
Zheng Liu (312 papers)
Kelong Mao (23 papers)
Zhicheng Dou (113 papers)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces MemoRAG, a new retrieval-augmented generation paradigm designed to enhance the ability of LLMs to handle complex tasks involving ambiguous information needs and unstructured knowledge. Traditional RAG systems often struggle with such tasks, as they are primarily effective for straightforward question-answering scenarios. MemoRAG addresses these limitations by incorporating a long-term memory component that enables the system to form a global understanding of the database and generate retrieval clues to locate relevant information.

MemoRAG employs a dual-system architecture comprising a light, long-range LLM for global memory formation and a more expressive LLM for final answer generation. The light LLM generates draft answers that serve as clues for the retrieval tools, while the heavy LLM refines these clues using retrieved information. This framework is optimized through enhancements to the cluing mechanism and memorization capacity.

Key aspects of the MemoRAG framework:

Memory Module: A light LLM that memorizes the global information of the database, providing retrieval clues. This module is designed to be both retentive and instructive.
Dual-System Architecture: Uses a light LLM for memory and a heavy LLM for generation, balancing cost-effectiveness with expressiveness.
Fine-tuning of Memory: The memory module is fine-tuned to generate clues that optimize retrieval quality.

The authors define standard RAG as:

$= \Theta(q, \mid \theta)$ ,

$= \Gamma(q, |\gamma)$ ,

where

$q$ is the input query
$is the context retrieved from a database *$ is the final answer
$\Theta(\cdot)$ is the generation model
$\Gamma(\cdot)$ is the retrieval model

MemoRAG is then formally defined as:

$= \Theta(q, \mid \theta)$ ,

$= \Gamma(y, \mid \gamma)$ ,

$y = \Theta_{\text{mem}(q, \mid \theta_{\text{mem}})$,

where

$y$ represents the staging answer (clues)
$\Theta_{\text{mem}(\cdot)$ is the memory model
$\text{ is the database}$

The memory model progressively compresses input tokens into memory tokens using a transformer-based model $\Theta(\cdot)$ . The attentive interaction at each layer is defined as:

$= _$,

$\text{Attention}(, , ) = \text{softmax}\left(\frac{ ^T}{\sqrt{d_k}\right)$,

$\Theta() = \text{Attention}(, , )$ ,

where

$_, _, _$ are weight matrices for query, key, and value projections
$d_k$ is the dimension of the key vectors

Memory tokens $x^m$ are introduced to serve as information carriers for long-term memory. After each context window $l$ , $k$ memory tokens are appended:

$= \{x_1, \cdots, x_l, x^m_1, \cdots, x^m_k, x_{l+1}, \cdots\}, \quad k \ll l$ .

The attentive interactions for memory formation are defined as:

$^m = _{^m}$ ,

$\text{Attention}(, , ) = \text{softmax}\left(\frac{[;^m] [;^m;^m_{\text{cache}]^T}{\sqrt{d_k}\right) [, ^m, ^m_{\text{cache}]$,

where

$^m, ^m, ^m$ are the query, key, and value for memory tokens $x^m$
$^m_{\text{cache}$ and $^m_{\text{cache}$ refer to the KV cache of previous memory tokens

The training of the memory module involves pre-training on randomly sampled long contexts from the RedPajama dataset and supervised fine-tuning (SFT (Supervised Fine-Tuning)) using task-specific data. The training objective maximizes the generation probability of the next token given the KV (Key-Value) cache of previous memory tokens and recent raw tokens. The training objective is:

$\max_{\Theta_{\text{mem}(x_{i,j} \mid x^m_{1,1}, \cdots, x^m_{i-1, k_{i-1}, x_{i,1}, \cdots, x_{i,j-1})$.

MemoRAG addresses ambiguous information needs by creating a global memory across the relevant database, enabling it to infer the underlying intent of implicit queries. For information seeking with distributed evidence, MemoRAG connects and integrates relevant information across multiple steps within the database.

The authors developed a benchmark called UltraDomain to evaluate the effectiveness of MemoRAG. UltraDomain consists of complex RAG tasks with long input contexts drawn from diverse domains, including law, finance, education, and healthcare. The tasks involve implicit information needs, distributed evidence gathering, and high-level understanding of the entire database.

The system implementation of MemoRAG is available at a public repository. Two memory models have been released: memorag-qwen2-7b-inst and memorag-mistral-7b-inst, based on Qwen2-7B-Instruct and Mistral-7B-Instruct-v0.2, respectively. The memory models support compression ratios from 2 to 16, managing different context lengths. The system can integrate sparse retrieval, dense retrieval, and reranking methods, with dense retrieval as the default. MemoRAG can also integrate any generative LLM as the generator, supporting initialization from HuggingFace models or commercial APIs (Application Programming Interfaces).

Experiments were conducted on UltraDomain and other benchmarks, comparing MemoRAG against baselines such as Full context input, BGE-M3, Stella-v5, RQ-RAG, and HyDE. The generators used were Llama3-8B-Instruct-8K, Mistral-7B-Instruct-v0.2-32K, and Phi-3-mini-128K.

The results indicated that MemoRAG outperforms all baselines across most datasets, demonstrating strong domain generalization capabilities. MemoRAG consistently surpasses the performance of directly using the full context, illustrating its ability to bridge the gap between processing super-long contexts and addressing complex tasks. Specifically, MemoRAG showed significant improvements in domain-specific tasks and tasks requiring information aggregation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1833369272971890965

https://twitter.com/gm8xx8/status/1833345324640788669

https://twitter.com/jkumarsharma998/status/1835204564447826119

https://twitter.com/javaeeeee1/status/1833629158511153622

https://twitter.com/BhoopSi34279675/status/1834917023773548626

YouTube

Show All Videos