HaluMem: Hallucination in Memory Benchmark
- The paper introduces formal definitions, metrics, and gold standards to systematically quantify memory hallucinations at each operational stage in LLMs.
- It decomposes memory processing into extraction, updating, and QA tasks, revealing how upstream errors propagate to downstream failures.
- Evaluations on HaluMem-Medium and HaluMem-Long datasets highlight significant challenges in scaling memory accuracy and underscore the need for robust memory validation.
The Hallucination in Memory Benchmark (HaluMem) is an operation-level evaluation framework for external memory modules in LLMs and AI agents. It targets the characterization and quantification of memory hallucinations—not only via the system’s end outputs, but at each internal stage of long-term memory processing: extraction, updating, and question answering. HaluMem introduces formal definitions, metrics, and gold standards for these stages and provides large-scale, user-centric multi-turn datasets enabling systematic paper of hallucination accumulation and propagation in agent memory systems.
1. Motivation and Formal Framework
HaluMem’s core objective is to pinpoint and measure memory hallucinations—fabrication, errors, conflicts, omissions—by directly inspecting the outputs of each internal operation, in contrast to end-to-end QA-only benchmarks such as LoCoMo, LongMemEval, and PersonaMem. Memory hallucinations frequently originate in upstream operations (e.g., extraction or update), then propagate and amplify, leading to cascading failures in downstream tasks. HaluMem’s explicit operational decomposition reveals the precise locus and nature of such errors, supporting targeted architectural improvements.
The memory system receives a multi-turn dialogue
$D = \bigl(u_1, a_1), (u_2, a_2), \dots, (u_N, a_N)\bigr$
and executes the following operations: where denotes extraction of candidate memories, denotes memory updating, stands for memory retrieval, and for QA generation.
A memory hallucination is any incorrect or unsupported memory operation in or , categorized as:
- Fabrication: Extracted memory not present in gold set.
- Error: Semantic deviation from gold memory or gold update.
- Conflict: Inconsistent or logically incoherent coexisting memories.
- Omission: Failure to extract required memories or apply required updates.
The core error metrics include: and the total memory-operation error rate over operation set :
2. Benchmark Task Design
HaluMem decomposes evaluation into three distinct operational tasks, each with dedicated inputs, outputs, gold standards, and metrics:
2.1 Memory Extraction
- Input: Session dialogue
- Gold: Gold extraction memories
- System output:
- Metrics: Precision, Recall, F1, Memory Recall (anti-omission), Memory Accuracy (anti-fabrication), Target Precision, False Memory Resistance (FMR), and Weighted Memory Recall.
2.2 Memory Updating
- Input: Pre-extracted memories , new dialogue
- Gold: Gold updates
- Output: Updated memories
- Metrics: Update Accuracy (), Update Hallucination Rate (), Update Omission ()
2.3 Memory Question Answering
- Input: Question , retrieved memories
- Gold: Reference answer
- Output: Generated answer
- Metrics: QA Accuracy (), QA Hallucination Rate (), QA Omission ()
3. Dataset Composition and Annotation
HaluMem makes use of two user-centric, multi-turn interaction datasets at different scales: HaluMem-Medium and HaluMem-Long. Both support operation-level memory hallucination analysis, enabling scaling studies and ablation by memory type.
| Metric | HaluMem-Medium | HaluMem-Long |
|---|---|---|
| Avg. context length (tokens/user) | 159,910.95 | 1,007,264.65 |
| Avg. sessions per user | 69.35 | 120.85 |
| Avg. turns per session | 21.68 | 22.14 |
| Total turns | 30,073 | 53,516 |
| Total memory points | 14,948 | 14,948 |
| Total QA pairs | 3,467 | 3,467 |
The dataset construction pipeline comprises six stages:
- Persona seeds from PersonaHub, structured and refined via GPT-4o.
- Construction of life skeletons (career anchors, evolving preferences).
- Event flow planning with before/after states mapped.
- Session summaries and explicit memory points with metadata.
- Multi-turn dialogues augmented with adversarial distractors and self-verification.
- QA pair generation, with six QA types (Basic Recall, Multi-hop, Dynamic Update, Memory Boundary, Memory Conflict, Generalization & Application), each traceable to precise evidence.
A subset of 700 sessions (≈50% of Medium) was annotated by eight human raters with high reliability: 95.70% correctness, average relevance 9.58/10, and consistency 9.45/10.
4. Protocol, Metrics, and Error Propagation
HaluMem standardizes an evaluation workflow:
- Sessions are processed sequentially.
- After each evaluation-relevant session, the appropriate operation (Extraction, Update, QA) is executed, metrics are computed, and errors are recorded.
- Key API abstractions: AddDialogue, GetDialogueMemory, and RetrieveMemory for pipeline orchestration.
HaluMem uniquely enables error propagation analysis by expressing downstream QA hallucination in terms of upstream errors. Given extraction error and update error , empirically
demonstrating that reduction of hallucinations in extraction and updating stages directly mitigates QA-level hallucinations.
5. Experimental Results and Analysis
Four memory systems were evaluated: Mem0, Mem0-Graph, Memobase, and Supermemory. Scoring for extraction and update steps used GPT-4o; for QA, GPT-4o generated answers using retrieved memoranda.
Key results on overall metrics:
| Dataset | System | MemIntegrity (R↑) | Weighted R↑ | Target P↑ | MemAcc ↑ | FMR↑ | Upd C↑ | Upd H↓ | Upd O↓ | QA C↑ | QA H↓ | QA O↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Medium | Mem0 | 42.91% | 65.03% | 86.26% | 60.86% | 56.80% | 25.50% | 0.45% | 74.02% | 53.02% | 19.17% | 27.81% |
| Medium | Mem0-Graph | 43.28% | 65.52% | 87.20% | 61.86% | 55.70% | 24.50% | 0.26% | 75.24% | 54.66% | 19.28% | 26.06% |
| Medium | Memobase | 14.55% | 25.88% | 92.24% | 32.29% | 80.78% | 5.20% | 0.55% | 94.25% | 35.33% | 29.97% | 34.71% |
| Medium | Supermemory | 41.53% | 64.76% | 90.32% | 60.83% | 51.77% | 16.37% | 1.15% | 82.47% | 54.07% | 22.24% | 23.69% |
| Long | Mem0 | 3.23% | 11.89% | 88.01% | 46.01% | 87.65% | 1.45% | 0.03% | 98.51% | 28.11% | 17.29% | 54.60% |
| Long | Mem0-Graph | 2.24% | 10.76% | 87.32% | 41.26% | 88.36% | 1.47% | 0.04% | 98.40% | 32.44% | 21.82% | 45.74% |
| Long | Memobase | 6.18% | 14.68% | 88.56% | 25.61% | 85.39% | 4.10% | 0.36% | 95.38% | 33.60% | 29.46% | 36.96% |
| Long | Supermemory | 53.02% | 70.73% | 85.82% | 29.71% | 36.86% | 17.01% | 0.58% | 82.42% | 53.77% | 22.21% | 24.02% |
Extraction accuracy by memory type (Medium / Long):
| System | Event | Persona | Relationship |
|---|---|---|---|
| Mem0 | 29.7% / 0.9% | 33.7% / 3.0% | 27.8% / 2.2% |
| Mem0-Graph | 30.0% / 1.1% | 33.7% / 2.0% | 26.6% / 1.6% |
| Memobase | 5.1% / 4.1% | 13.4% / 5.3% | 6.8% / 4.2% |
| Supermemory | 28.7% / 38.5% | 32.1% / 40.9% | 20.7% / 32.6% |
Notable observations:
- All systems experience sharp declines in coverage and accuracy when scaling from Medium to Long.
- Extraction failures are primary; poor recall in extraction severely limits update candidates, which in turn damages QA performance.
- Update accuracy remains below 26% in Medium and approaches zero in Long.
- QA hallucination rates exceed 19% in Medium; omission and hallucination error rates are substantial.
Time-performance trade-offs are also notable, with evaluation on the longest contexts sometimes exceeding 1,800 minutes in aggregate.
6. Recommendations and Directions
Advancing external memory robustness for LLMs and agents requires specifically targeted design interventions, as revealed by HaluMem:
- Interpretable and Constrained Memory Operations: Promoting reliability through explicit validation checks in extraction and update modules, rule-based conflict detection (e.g., version control, temporal constraints), and soft extraction constraints (e.g., fact confidence thresholds).
- Stage Linkage: Restricting updates to memories already present in the extracted set and employing “memory watchdog” modules to block questionable or unsupported updates.
- Algorithmic Protocol: The evaluation protocol can be summarized as the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
initialize M = ∅ for each session D^s: M_ext = ExtractMemories(D^s) scoreExtraction(M_ext, G^ext_s) for each gold update (old→new) in G^upd_s: if old ∈ M_ext: M = UpdateMemory(M, old→new) # only update if old exists else: recordOmission(old→new) scoreUpdates(M, G^upd_s) for each question q: R = Retrieve(M, q, k) ŷ = Answer(q, R) scoreQA(ŷ, y^*) aggregate all metrics and analyze error propagation
- Error Propagation Control: Optimization objectives should minimize a weighted sum of error rates in sequential operations, and uncertainty estimates should be propagated to both update and QA modules, so low-confidence memories are flagged (e.g., “I’m not sure, please confirm”).
A plausible implication is that, by making memory operations transparent and stage-wise measurable, future memory architectures can directly suppress spurious or unsupported content and improve long-term reliability in AI agents. HaluMem sets a precedent for rigorous operational evaluation, supporting methodical progress in robust memory system design for agentic LLM deployments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free