Papers
Topics
Authors
Recent
Search
2000 character limit reached

HippoRAG 2: Enhanced Memory for LLMs

Updated 14 December 2025
  • The paper introduces HippoRAG 2, a framework that unifies dense and sparse retrieval to boost factual recall, associative reasoning, and sense-making in LLMs.
  • It employs a dual-node knowledge graph with passage and phrase nodes, enhanced by Personalized PageRank and LLM-based triple filtering for deep contextual retrieval.
  • Benchmark experiments demonstrate a 7-point F1 gain over embedding retrievers on associative tasks while significantly reducing LLM token usage.

HippoRAG 2 is a non-parametric continual learning framework for LLMs that augments retrieval-augmented generation (RAG) with explicit, context-rich memory mechanisms, specifically constructed knowledge graphs (KGs) and Personalized PageRank (PPR), achieving superior performance in factual recall, sense-making, and associative retrieval. Designed to mimic the dynamic, interconnected character of human long-term memory, HippoRAG 2 builds upon its predecessor HippoRAG by incorporating deeper passage integration, more contextualized query–triple linking, and an online LLM loop both for knowledge extraction/filtering and answer generation. The system addresses deficiencies in previous graph-augmented RAG architectures, which often compromised factual recall or sense-making for associativity, by unifying dense and sparse representations, adding passage nodes to the KG, and leveraging LLM-powered recognition filtering. In benchmark experiments, HippoRAG 2 lifts associative QA F1 by 7 points over state-of-the-art embedding retrievers, while also excelling in factual and discourse-oriented tasks (Gutiérrez et al., 20 Feb 2025).

1. Motivation and Design Principles

The core motivation for HippoRAG 2 is to endow LLMs with a non-parametric continual memory system capable of context-sensitive knowledge acquisition, recall, and integration—key features of human memory. Standard RAG workflows, reliant on nearest-neighbor vector retrieval, struggle with catastrophic forgetting in fine-tuning and lack the capacity for multi-hop associations (“associativity”) and deep context interpretation (“sense-making”). Structure-augmented approaches with knowledge graphs partly address associativity but have yielded trade-offs, usually reducing performance on basic factual QA. HippoRAG 2 aims to unify memory recall mechanisms to optimize all three memory modalities—factual, associative, and sense-making—through innovations including:

  • Dense–sparse integration: Passage nodes and phrase nodes co-exist in the KG.
  • Deep contextualization: Queries directly link to full triples.
  • Recognition memory: LLM-based filtering of relevant triples.
  • Online LLM loop: LLM handles both knowledge graph maintenance and final-answer reading.

2. Memory Graph Construction

The HippoRAG 2 knowledge graph includes both phrase nodes pPp \in P (text spans from OpenIE triple extraction) and passage nodes dDd \in D (full passages or documents from the corpus). This dual-node structure enables dense–sparse integration. Edges fall into three categories:

  • Relation edges: Connect phrase nodes for each KG triple (s,r,o)(s, r, o), with undirected weight wso=1w_{so} = 1.
  • Synonym edges: Between phrase nodes if their embeddings ei,eje_i, e_j satisfy similarity sim(ei,ej)τ\textrm{sim}(e_i, e_j) \geq \tau, τ=0.8\tau=0.8; wij=sim(ei,ej)w_{ij} = \textrm{sim}(e_i, e_j).
  • Context edges: Link every phrase node extracted from a passage to that passage node; wdp=1w_{dp} = 1.

The graph is represented as an adjacency matrix AA over the full node set dDd \in D0, normalized row-wise:

dDd \in D1

where dDd \in D2. The KG is static offline; online, only the personalization vector for PPR changes following LLM-driven triple filtering.

3. Personalized PageRank and Retrieval

HippoRAG 2 applies PPR over the normalized adjacency matrix dDd \in D3, producing a contextually ranked retrieval:

dDd \in D4

where dDd \in D5 is fixed and dDd \in D6 is a personalization vector defined via relevant phrase and passage seed nodes informed by the query and triple scores. The vector dDd \in D7 is:

dDd \in D8

with dDd \in D9 being the average retrieval score for triple-generating phrase nodes or weighted embedding similarity for passage nodes. PPR is solved by power iteration. The top-ranked passage nodes select the contextual passages for the LLM reader downstream.

4. Deep Passage Integration and Prompt Construction

“Deep” passage integration involves encoding both passage and query into dense vectors (s,r,o)(s, r, o)0, concatenated to form the final context-aware prompt embedding:

(s,r,o)(s, r, o)1

In transformer-based LLMs, this concatenation is consumed in encoder layers or injected into cross-attention for the memory bank at each generation step:

(s,r,o)(s, r, o)2

Practically, passages are prepended (delimited) in natural language before the query for answering.

5. Online Retrieval and Generation Workflow

The online operational logic comprises four sequential steps:

a. Query–Triple Linking and Passage Ranking: Compute embeddings for the query and match against KG triple texts and passage embeddings; retrieve top-(s,r,o)(s, r, o)3 triples (s,r,o)(s, r, o)4 and passages (s,r,o)(s, r, o)5.

b. Recognition Memory (Triple Filtering): Feed the query plus candidate triples (s,r,o)(s, r, o)6 into a secondary LLM (e.g., Llama-3.3-70B-Instruct) with a prompt designed for filtering out triples irrelevant to the query (see paper Appendix A). Resulting set is (s,r,o)(s, r, o)7.

c. Seed Node Selection and PPR: Extract up to five phrase nodes from (s,r,o)(s, r, o)8, score based on triple score, plus all passage nodes with scaled embedding similarity; construct (s,r,o)(s, r, o)9 for PPR, returning top-wso=1w_{so} = 10 passages wso=1w_{so} = 11.

d. Final Generation: Concatenate wso=1w_{so} = 12 as context and prompt the LLM for the answer output.

Optional post-processing allows addition of new high-confidence facts back into the KG via OpenIE and synonym detection, supporting continual learning.

6. Experimental Protocol and Evaluation

HippoRAG 2 was empirically validated on three major task types:

Task Type Benchmarks Metrics
Factual Recall NaturalQuestions (NQ), PopQA Recall@5, EM, F1
Associativity MuSiQue, 2Wiki, HotpotQA, LV-Eval Recall@5, EM, F1
Sense-making NarrativeQA EM, F1

Passage Recall@5 measures percentage of queries with supporting passage retrieved in top 5. Exact Match (EM) and F1 reflect generation accuracy. Key result: on associative benchmarks, HippoRAG 2 achieves a mean +7 F1 gain over NV-Embed-v2, the embedding retriever baseline. Factual and sense-making tasks also show slight improvements.

7. Comparative Evaluation and Limitations

Compared to state-of-the-art pure-embedding RAG (NV-Embed-v2 + LLM), HippoRAG 2 lifts multi-hop QA F1 (e.g., MuSiQue: 44.8→51.9) and Recall@5 (e.g., MuSiQue: 69.7%→74.7%, 2Wiki: 76.5%→90.4%). Structure-based approaches (RAPTOR, GraphRAG, LightRAG, HippoRAG) may improve associativity or sense-making, but generally reduce performance on simple QA by 5–10 F1, which HippoRAG 2 avoids. It also requires significantly fewer LLM tokens for indexing (e.g., 9M versus 115M for MuSiQue). Nevertheless, the LLM triple filter exhibits wso=1w_{so} = 137% miss rate, and sparse seeds can limit PPR effectiveness.

8. Future Directions

Future work concentrates on:

  • Integrating episodic memory for extended dialogue contexts.
  • Automatic consolidation/pruning of memory over large document collections.
  • Dynamic graph adaptation reflecting ongoing conversation context.

These directions aim to further approximate human-like conversational memory and scalability in continual learning (Gutiérrez et al., 20 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HippoRAG 2.