HaluMem: Hallucination in Memory Benchmark

Updated 11 November 2025

The paper introduces formal definitions, metrics, and gold standards to systematically quantify memory hallucinations at each operational stage in LLMs.
It decomposes memory processing into extraction, updating, and QA tasks, revealing how upstream errors propagate to downstream failures.
Evaluations on HaluMem-Medium and HaluMem-Long datasets highlight significant challenges in scaling memory accuracy and underscore the need for robust memory validation.

The Hallucination in Memory Benchmark (HaluMem) is an operation-level evaluation framework for external memory modules in LLMs and AI agents. It targets the characterization and quantification of memory hallucinations—not only via the system’s end outputs, but at each internal stage of long-term memory processing: extraction, updating, and question answering. HaluMem introduces formal definitions, metrics, and gold standards for these stages and provides large-scale, user-centric multi-turn datasets enabling systematic paper of hallucination accumulation and propagation in agent memory systems.

1. Motivation and Formal Framework

HaluMem’s core objective is to pinpoint and measure memory hallucinations—fabrication, errors, conflicts, omissions—by directly inspecting the outputs of each internal operation, in contrast to end-to-end QA-only benchmarks such as LoCoMo, LongMemEval, and PersonaMem. Memory hallucinations frequently originate in upstream operations (e.g., extraction or update), then propagate and amplify, leading to cascading failures in downstream tasks. HaluMem’s explicit operational decomposition reveals the precise locus and nature of such errors, supporting targeted architectural improvements.

The memory system $S$ receives a multi-turn dialogue

$D = \bigl(u_1, a_1), (u_2, a_2), \dots, (u_N, a_N)\bigr$

and executes the following operations: $E: D \to \widehat{M}^{\mathrm{ext}},\quad U: (\widehat{M}^{\mathrm{ext}}, D) \to \widehat{M},\quad R: (\widehat{M}, q) \to \widehat{R},\quad Q: (\widehat{R}, q) \to \hat y,$ where $E$ denotes extraction of candidate memories, $U$ denotes memory updating, $R$ stands for memory retrieval, and $Q$ for QA generation.

A memory hallucination is any incorrect or unsupported memory operation in $E$ or $U$ , categorized as:

Fabrication: Extracted memory not present in gold set.
Error: Semantic deviation from gold memory or gold update.
Conflict: Inconsistent or logically incoherent coexisting memories.
Omission: Failure to extract required memories or apply required updates.

The core error metrics include: $\mathrm{Halluc}_E = \frac{\#\{\text{fabrications}\} + \#\{\text{errors}\}}{|\widehat{M}^{\text{ext}}|}, \quad \mathrm{Omission}_E = \frac{\#\{\text{omitted gold}\}}{|G^{\text{ext}}|}$ and the total memory-operation error rate over operation set $\mathcal{O}$ : $\mathrm{ErrorRate}_\mathcal{O} = \frac{\#\{\text{hallucinated points over } \mathcal{O}\}}{\#\{\text{total memory operations } \mathcal{O}\}}$

2. Benchmark Task Design

HaluMem decomposes evaluation into three distinct operational tasks, each with dedicated inputs, outputs, gold standards, and metrics:

2.1 Memory Extraction

Input: Session dialogue $D^s$
Gold: Gold extraction memories $G_s^{\rm ext} = \{m_i^s\}_{i=1}^{K_s}$
System output: $\widehat{M}_s^{\rm ext} = \{\hat m_j^s\}_{j=1}^{\widehat{K}_s}$
Metrics: Precision $_{\rm ext}$ , Recall $_{\rm ext}$ , F1 $_{\rm ext}$ , Memory Recall (anti-omission), Memory Accuracy (anti-fabrication), Target Precision, False Memory Resistance (FMR), and Weighted Memory Recall.

2.2 Memory Updating

Input: Pre-extracted memories $\widehat{M}^{\rm ext}$ , new dialogue $D^s$
Gold: Gold updates $G_s^{\rm upd} = \{(m^{\rm old} \to m^{\rm new})\}$
Output: Updated memories $\widehat{G}_s^{\rm upd}$
Metrics: Update Accuracy ( $\mathrm{UpdAcc}$ ), Update Hallucination Rate ( $\mathrm{UpdHall}$ ), Update Omission ( $\mathrm{UpdOmit}$ )

$\begin{aligned} \mathrm{UpdAcc} &= \frac{N_{\rm correct\text{-}upd}}{N_{\rm target\text{-}upd}}, \ \mathrm{UpdHall} &= \frac{N_{\rm wrong\text{-}upd}}{N_{\rm target\text{-}upd}}, \ \mathrm{UpdOmit} &= \frac{N_{\rm missed\text{-}upd}}{N_{\rm target\text{-}upd}}. \end{aligned}$

2.3 Memory Question Answering

Input: Question $q_j$ , retrieved memories $\widehat{R}(q_j)$
Gold: Reference answer $y^*_j$
Output: Generated answer $\hat y_j$
Metrics: QA Accuracy ( $\mathrm{QA\text{-}Acc}$ ), QA Hallucination Rate ( $\mathrm{QA\text{-}Hall}$ ), QA Omission ( $\mathrm{QA\text{-}Omit}$ )

$\begin{aligned} \mathrm{QA\text{-}Acc} &= \frac{N_{\rm correct}}{N_{\rm total}}, \ \mathrm{QA\text{-}Hall} &= \frac{N_{\rm hallucinated}}{N_{\rm total}}, \ \mathrm{QA\text{-}Omit} &= \frac{N_{\rm omitted}}{N_{\rm total}}. \end{aligned}$

3. Dataset Composition and Annotation

HaluMem makes use of two user-centric, multi-turn interaction datasets at different scales: HaluMem-Medium and HaluMem-Long. Both support operation-level memory hallucination analysis, enabling scaling studies and ablation by memory type.

Metric	HaluMem-Medium	HaluMem-Long
Avg. context length (tokens/user)	159,910.95	1,007,264.65
Avg. sessions per user	69.35	120.85
Avg. turns per session	21.68	22.14
Total turns	30,073	53,516
Total memory points	14,948	14,948
Total QA pairs	3,467	3,467

The dataset construction pipeline comprises six stages:

Persona seeds from PersonaHub, structured and refined via GPT-4o.
Construction of life skeletons (career anchors, evolving preferences).
Event flow planning with before/after states mapped.
Session summaries and explicit memory points with metadata.
Multi-turn dialogues augmented with adversarial distractors and self-verification.
QA pair generation, with six QA types (Basic Recall, Multi-hop, Dynamic Update, Memory Boundary, Memory Conflict, Generalization & Application), each traceable to precise evidence.

A subset of 700 sessions (≈50% of Medium) was annotated by eight human raters with high reliability: 95.70% correctness, average relevance 9.58/10, and consistency 9.45/10.

4. Protocol, Metrics, and Error Propagation

HaluMem standardizes an evaluation workflow:

Sessions $D^1, ..., D^S$ are processed sequentially.
After each evaluation-relevant session, the appropriate operation (Extraction, Update, QA) is executed, metrics are computed, and errors are recorded.
Key API abstractions: AddDialogue, GetDialogueMemory, and RetrieveMemory for pipeline orchestration.

HaluMem uniquely enables error propagation analysis by expressing downstream QA hallucination in terms of upstream errors. Given extraction error $E_{\rm ex} = 1 - \mathrm{MemAcc}$ and update error $E_{\rm upd} = \mathrm{UpdHall}$ , empirically

$E_{\rm QA} = 1-\mathrm{QA\text{-}Acc}\approx \alpha\,E_{\rm ex} + \beta\,E_{\rm upd},\quad \alpha,\beta > 0,$

demonstrating that reduction of hallucinations in extraction and updating stages directly mitigates QA-level hallucinations.

5. Experimental Results and Analysis

Four memory systems were evaluated: Mem0, Mem0-Graph, Memobase, and Supermemory. Scoring for extraction and update steps used GPT-4o; for QA, GPT-4o generated answers using retrieved memoranda.

Key results on overall metrics:

Dataset	System	MemIntegrity (R↑)	Weighted R↑	Target P↑	MemAcc ↑	FMR↑	Upd C↑	Upd H↓	Upd O↓	QA C↑	QA H↓	QA O↓
Medium	Mem0	42.91%	65.03%	86.26%	60.86%	56.80%	25.50%	0.45%	74.02%	53.02%	19.17%	27.81%
Medium	Mem0-Graph	43.28%	65.52%	87.20%	61.86%	55.70%	24.50%	0.26%	75.24%	54.66%	19.28%	26.06%
Medium	Memobase	14.55%	25.88%	92.24%	32.29%	80.78%	5.20%	0.55%	94.25%	35.33%	29.97%	34.71%
Medium	Supermemory	41.53%	64.76%	90.32%	60.83%	51.77%	16.37%	1.15%	82.47%	54.07%	22.24%	23.69%
Long	Mem0	3.23%	11.89%	88.01%	46.01%	87.65%	1.45%	0.03%	98.51%	28.11%	17.29%	54.60%
Long	Mem0-Graph	2.24%	10.76%	87.32%	41.26%	88.36%	1.47%	0.04%	98.40%	32.44%	21.82%	45.74%
Long	Memobase	6.18%	14.68%	88.56%	25.61%	85.39%	4.10%	0.36%	95.38%	33.60%	29.46%	36.96%
Long	Supermemory	53.02%	70.73%	85.82%	29.71%	36.86%	17.01%	0.58%	82.42%	53.77%	22.21%	24.02%

Extraction accuracy by memory type (Medium / Long):

System	Event	Persona	Relationship
Mem0	29.7% / 0.9%	33.7% / 3.0%	27.8% / 2.2%
Mem0-Graph	30.0% / 1.1%	33.7% / 2.0%	26.6% / 1.6%
Memobase	5.1% / 4.1%	13.4% / 5.3%	6.8% / 4.2%
Supermemory	28.7% / 38.5%	32.1% / 40.9%	20.7% / 32.6%

Notable observations:

All systems experience sharp declines in coverage and accuracy when scaling from Medium to Long.
Extraction failures are primary; poor recall in extraction severely limits update candidates, which in turn damages QA performance.
Update accuracy remains below 26% in Medium and approaches zero in Long.
QA hallucination rates exceed 19% in Medium; omission and hallucination error rates are substantial.

Time-performance trade-offs are also notable, with evaluation on the longest contexts sometimes exceeding 1,800 minutes in aggregate.

6. Recommendations and Directions

Advancing external memory robustness for LLMs and agents requires specifically targeted design interventions, as revealed by HaluMem:

Interpretable and Constrained Memory Operations: Promoting reliability through explicit validation checks in extraction and update modules, rule-based conflict detection (e.g., version control, temporal constraints), and soft extraction constraints (e.g., fact confidence thresholds).
Stage Linkage: Restricting updates to memories already present in the extracted set and employing “memory watchdog” modules to block questionable or unsupported updates.

Algorithmic Protocol: The evaluation protocol can be summarized as the following pseudocode:

initialize M = ∅
for each session D^s:
    M_ext = ExtractMemories(D^s)
    scoreExtraction(M_ext, G^ext_s)
    for each gold update (old→new) in G^upd_s:
        if old ∈ M_ext:
            M = UpdateMemory(M, old→new)  # only update if old exists
        else:
            recordOmission(old→new)
    scoreUpdates(M, G^upd_s)
    for each question q:
        R = Retrieve(M, q, k)
        ŷ = Answer(q, R)
        scoreQA(ŷ, y^*)
aggregate all metrics and analyze error propagation

Error Propagation Control: Optimization objectives should minimize a weighted sum of error rates in sequential operations, and uncertainty estimates should be propagated to both update and QA modules, so low-confidence memories are flagged (e.g., “I’m not sure, please confirm”).

A plausible implication is that, by making memory operations transparent and stage-wise measurable, future memory architectures can directly suppress spurious or unsupported content and improve long-term reliability in AI agents. HaluMem sets a precedent for rigorous operational evaluation, supporting methodical progress in robust memory system design for agentic LLM deployments.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Hallucination in Memory Benchmark (HaluMem).