Papers
Topics
Authors
Recent
2000 character limit reached

HaluMem: Hallucination in Memory Benchmark

Updated 11 November 2025
  • The paper introduces formal definitions, metrics, and gold standards to systematically quantify memory hallucinations at each operational stage in LLMs.
  • It decomposes memory processing into extraction, updating, and QA tasks, revealing how upstream errors propagate to downstream failures.
  • Evaluations on HaluMem-Medium and HaluMem-Long datasets highlight significant challenges in scaling memory accuracy and underscore the need for robust memory validation.

The Hallucination in Memory Benchmark (HaluMem) is an operation-level evaluation framework for external memory modules in LLMs and AI agents. It targets the characterization and quantification of memory hallucinations—not only via the system’s end outputs, but at each internal stage of long-term memory processing: extraction, updating, and question answering. HaluMem introduces formal definitions, metrics, and gold standards for these stages and provides large-scale, user-centric multi-turn datasets enabling systematic paper of hallucination accumulation and propagation in agent memory systems.

1. Motivation and Formal Framework

HaluMem’s core objective is to pinpoint and measure memory hallucinations—fabrication, errors, conflicts, omissions—by directly inspecting the outputs of each internal operation, in contrast to end-to-end QA-only benchmarks such as LoCoMo, LongMemEval, and PersonaMem. Memory hallucinations frequently originate in upstream operations (e.g., extraction or update), then propagate and amplify, leading to cascading failures in downstream tasks. HaluMem’s explicit operational decomposition reveals the precise locus and nature of such errors, supporting targeted architectural improvements.

The memory system SS receives a multi-turn dialogue

$D = \bigl(u_1, a_1), (u_2, a_2), \dots, (u_N, a_N)\bigr$

and executes the following operations: E:DM^ext,U:(M^ext,D)M^,R:(M^,q)R^,Q:(R^,q)y^,E: D \to \widehat{M}^{\mathrm{ext}},\quad U: (\widehat{M}^{\mathrm{ext}}, D) \to \widehat{M},\quad R: (\widehat{M}, q) \to \widehat{R},\quad Q: (\widehat{R}, q) \to \hat y, where EE denotes extraction of candidate memories, UU denotes memory updating, RR stands for memory retrieval, and QQ for QA generation.

A memory hallucination is any incorrect or unsupported memory operation in EE or UU, categorized as:

  • Fabrication: Extracted memory not present in gold set.
  • Error: Semantic deviation from gold memory or gold update.
  • Conflict: Inconsistent or logically incoherent coexisting memories.
  • Omission: Failure to extract required memories or apply required updates.

The core error metrics include: HallucE=#{fabrications}+#{errors}M^ext,OmissionE=#{omitted gold}Gext\mathrm{Halluc}_E = \frac{\#\{\text{fabrications}\} + \#\{\text{errors}\}}{|\widehat{M}^{\text{ext}}|}, \quad \mathrm{Omission}_E = \frac{\#\{\text{omitted gold}\}}{|G^{\text{ext}}|} and the total memory-operation error rate over operation set O\mathcal{O}: ErrorRateO=#{hallucinated points over O}#{total memory operations O}\mathrm{ErrorRate}_\mathcal{O} = \frac{\#\{\text{hallucinated points over } \mathcal{O}\}}{\#\{\text{total memory operations } \mathcal{O}\}}

2. Benchmark Task Design

HaluMem decomposes evaluation into three distinct operational tasks, each with dedicated inputs, outputs, gold standards, and metrics:

2.1 Memory Extraction

  • Input: Session dialogue DsD^s
  • Gold: Gold extraction memories Gsext={mis}i=1KsG_s^{\rm ext} = \{m_i^s\}_{i=1}^{K_s}
  • System output: M^sext={m^js}j=1K^s\widehat{M}_s^{\rm ext} = \{\hat m_j^s\}_{j=1}^{\widehat{K}_s}
  • Metrics: Precisionext_{\rm ext}, Recallext_{\rm ext}, F1ext_{\rm ext}, Memory Recall (anti-omission), Memory Accuracy (anti-fabrication), Target Precision, False Memory Resistance (FMR), and Weighted Memory Recall.

2.2 Memory Updating

  • Input: Pre-extracted memories M^ext\widehat{M}^{\rm ext}, new dialogue DsD^s
  • Gold: Gold updates Gsupd={(moldmnew)}G_s^{\rm upd} = \{(m^{\rm old} \to m^{\rm new})\}
  • Output: Updated memories G^supd\widehat{G}_s^{\rm upd}
  • Metrics: Update Accuracy (UpdAcc\mathrm{UpdAcc}), Update Hallucination Rate (UpdHall\mathrm{UpdHall}), Update Omission (UpdOmit\mathrm{UpdOmit})

UpdAcc=Ncorrect-updNtarget-upd, UpdHall=Nwrong-updNtarget-upd, UpdOmit=Nmissed-updNtarget-upd.\begin{aligned} \mathrm{UpdAcc} &= \frac{N_{\rm correct\text{-}upd}}{N_{\rm target\text{-}upd}}, \ \mathrm{UpdHall} &= \frac{N_{\rm wrong\text{-}upd}}{N_{\rm target\text{-}upd}}, \ \mathrm{UpdOmit} &= \frac{N_{\rm missed\text{-}upd}}{N_{\rm target\text{-}upd}}. \end{aligned}

2.3 Memory Question Answering

  • Input: Question qjq_j, retrieved memories R^(qj)\widehat{R}(q_j)
  • Gold: Reference answer yjy^*_j
  • Output: Generated answer y^j\hat y_j
  • Metrics: QA Accuracy (QA-Acc\mathrm{QA\text{-}Acc}), QA Hallucination Rate (QA-Hall\mathrm{QA\text{-}Hall}), QA Omission (QA-Omit\mathrm{QA\text{-}Omit})

QA-Acc=NcorrectNtotal, QA-Hall=NhallucinatedNtotal, QA-Omit=NomittedNtotal.\begin{aligned} \mathrm{QA\text{-}Acc} &= \frac{N_{\rm correct}}{N_{\rm total}}, \ \mathrm{QA\text{-}Hall} &= \frac{N_{\rm hallucinated}}{N_{\rm total}}, \ \mathrm{QA\text{-}Omit} &= \frac{N_{\rm omitted}}{N_{\rm total}}. \end{aligned}

3. Dataset Composition and Annotation

HaluMem makes use of two user-centric, multi-turn interaction datasets at different scales: HaluMem-Medium and HaluMem-Long. Both support operation-level memory hallucination analysis, enabling scaling studies and ablation by memory type.

Metric HaluMem-Medium HaluMem-Long
Avg. context length (tokens/user) 159,910.95 1,007,264.65
Avg. sessions per user 69.35 120.85
Avg. turns per session 21.68 22.14
Total turns 30,073 53,516
Total memory points 14,948 14,948
Total QA pairs 3,467 3,467

The dataset construction pipeline comprises six stages:

  1. Persona seeds from PersonaHub, structured and refined via GPT-4o.
  2. Construction of life skeletons (career anchors, evolving preferences).
  3. Event flow planning with before/after states mapped.
  4. Session summaries and explicit memory points with metadata.
  5. Multi-turn dialogues augmented with adversarial distractors and self-verification.
  6. QA pair generation, with six QA types (Basic Recall, Multi-hop, Dynamic Update, Memory Boundary, Memory Conflict, Generalization & Application), each traceable to precise evidence.

A subset of 700 sessions (≈50% of Medium) was annotated by eight human raters with high reliability: 95.70% correctness, average relevance 9.58/10, and consistency 9.45/10.

4. Protocol, Metrics, and Error Propagation

HaluMem standardizes an evaluation workflow:

  • Sessions D1,...,DSD^1, ..., D^S are processed sequentially.
  • After each evaluation-relevant session, the appropriate operation (Extraction, Update, QA) is executed, metrics are computed, and errors are recorded.
  • Key API abstractions: AddDialogue, GetDialogueMemory, and RetrieveMemory for pipeline orchestration.

HaluMem uniquely enables error propagation analysis by expressing downstream QA hallucination in terms of upstream errors. Given extraction error Eex=1MemAccE_{\rm ex} = 1 - \mathrm{MemAcc} and update error Eupd=UpdHallE_{\rm upd} = \mathrm{UpdHall}, empirically

EQA=1QA-AccαEex+βEupd,α,β>0,E_{\rm QA} = 1-\mathrm{QA\text{-}Acc}\approx \alpha\,E_{\rm ex} + \beta\,E_{\rm upd},\quad \alpha,\beta > 0,

demonstrating that reduction of hallucinations in extraction and updating stages directly mitigates QA-level hallucinations.

5. Experimental Results and Analysis

Four memory systems were evaluated: Mem0, Mem0-Graph, Memobase, and Supermemory. Scoring for extraction and update steps used GPT-4o; for QA, GPT-4o generated answers using retrieved memoranda.

Key results on overall metrics:

Dataset System MemIntegrity (R↑) Weighted R↑ Target P↑ MemAcc ↑ FMR↑ Upd C↑ Upd H↓ Upd O↓ QA C↑ QA H↓ QA O↓
Medium Mem0 42.91% 65.03% 86.26% 60.86% 56.80% 25.50% 0.45% 74.02% 53.02% 19.17% 27.81%
Medium Mem0-Graph 43.28% 65.52% 87.20% 61.86% 55.70% 24.50% 0.26% 75.24% 54.66% 19.28% 26.06%
Medium Memobase 14.55% 25.88% 92.24% 32.29% 80.78% 5.20% 0.55% 94.25% 35.33% 29.97% 34.71%
Medium Supermemory 41.53% 64.76% 90.32% 60.83% 51.77% 16.37% 1.15% 82.47% 54.07% 22.24% 23.69%
Long Mem0 3.23% 11.89% 88.01% 46.01% 87.65% 1.45% 0.03% 98.51% 28.11% 17.29% 54.60%
Long Mem0-Graph 2.24% 10.76% 87.32% 41.26% 88.36% 1.47% 0.04% 98.40% 32.44% 21.82% 45.74%
Long Memobase 6.18% 14.68% 88.56% 25.61% 85.39% 4.10% 0.36% 95.38% 33.60% 29.46% 36.96%
Long Supermemory 53.02% 70.73% 85.82% 29.71% 36.86% 17.01% 0.58% 82.42% 53.77% 22.21% 24.02%

Extraction accuracy by memory type (Medium / Long):

System Event Persona Relationship
Mem0 29.7% / 0.9% 33.7% / 3.0% 27.8% / 2.2%
Mem0-Graph 30.0% / 1.1% 33.7% / 2.0% 26.6% / 1.6%
Memobase 5.1% / 4.1% 13.4% / 5.3% 6.8% / 4.2%
Supermemory 28.7% / 38.5% 32.1% / 40.9% 20.7% / 32.6%

Notable observations:

  • All systems experience sharp declines in coverage and accuracy when scaling from Medium to Long.
  • Extraction failures are primary; poor recall in extraction severely limits update candidates, which in turn damages QA performance.
  • Update accuracy remains below 26% in Medium and approaches zero in Long.
  • QA hallucination rates exceed 19% in Medium; omission and hallucination error rates are substantial.

Time-performance trade-offs are also notable, with evaluation on the longest contexts sometimes exceeding 1,800 minutes in aggregate.

6. Recommendations and Directions

Advancing external memory robustness for LLMs and agents requires specifically targeted design interventions, as revealed by HaluMem:

  • Interpretable and Constrained Memory Operations: Promoting reliability through explicit validation checks in extraction and update modules, rule-based conflict detection (e.g., version control, temporal constraints), and soft extraction constraints (e.g., fact confidence thresholds).
  • Stage Linkage: Restricting updates to memories already present in the extracted set and employing “memory watchdog” modules to block questionable or unsupported updates.
  • Algorithmic Protocol: The evaluation protocol can be summarized as the following pseudocode:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
    initialize M = 
    for each session D^s:
        M_ext = ExtractMemories(D^s)
        scoreExtraction(M_ext, G^ext_s)
        for each gold update (oldnew) in G^upd_s:
            if old  M_ext:
                M = UpdateMemory(M, oldnew)  # only update if old exists
            else:
                recordOmission(oldnew)
        scoreUpdates(M, G^upd_s)
        for each question q:
            R = Retrieve(M, q, k)
            ŷ = Answer(q, R)
            scoreQA(ŷ, y^*)
    aggregate all metrics and analyze error propagation
  • Error Propagation Control: Optimization objectives should minimize a weighted sum of error rates in sequential operations, and uncertainty estimates should be propagated to both update and QA modules, so low-confidence memories are flagged (e.g., “I’m not sure, please confirm”).

A plausible implication is that, by making memory operations transparent and stage-wise measurable, future memory architectures can directly suppress spurious or unsupported content and improve long-term reliability in AI agents. HaluMem sets a precedent for rigorous operational evaluation, supporting methodical progress in robust memory system design for agentic LLM deployments.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hallucination in Memory Benchmark (HaluMem).