Simulating Human Memory with Language Models

Published 25 May 2026 in cs.CL and cs.AI | (2605.25680v1)

Abstract: LLMs are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measure this gap, we run a series of classic memory experiments from psychology on both humans and LLMs. Across tasks, we find that out-of-the-box LLMs exhibit better memory than humans, even when prompted to imitate human behavior. We then show that better prompting strategies and the use of a compactor can cause LLMs to forget content in a more human-like way. Using these methods, we show preliminary evidence that LLMs with human-like memory constraints can function as more effective user simulators in a downstream education task. Finally, we release human reference data and benchmarks to support future work on simulating human memory with LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that current LLMs vastly exceed human recall, exposing limits of prompt-based methods in replicating human memory failures.
It introduces the Compactor, a capacity-constrained key-value memory module that forces LLMs to mimic human-like forgetting through chunked representations.
Rigorous benchmarking across ten memory tasks reveals that explicit architectural constraints are more effective than prompts in simulating human memory patterns.

Simulating Human Memory with LLMs: A Comprehensive Evaluation

Introduction

The paper "Simulating Human Memory with LLMs" (2605.25680) presents a rigorous investigation into the alignment between current LLM behavior and human memory limitations. The authors construct a benchmark suite of ten canonical memory tasks derived from the psychology and cognitive science literatures and systematically compare human data ( $N=50$ ) against the performance of a spectrum of state-of-the-art LLMs under varied simulation protocols. The results indicate stark discrepancies in recall abilities, with even human-simulation-prompted LLMs achieving near-ceiling scores, thereby fundamentally diverging from human-like forgetting patterns. The authors demonstrate that architectural interventions, such as the introduction of a capacity-limited key-value memory module (Compactor), are necessary to induce more human-congruent performance distributions. This work closes by examining the practical implications of memory fidelity for user simulation in downstream educational applications.

Memory Benchmark: Tasks and Experimental Setup

A central contribution of the paper is the formulation of a memory benchmarking suite encompassing:

Working memory tasks: forward digit span, reverse digit span, N-back, variable mapping
Recognition and recall: word recognition, narrative free recall
Episodic and relational memory: factual QA, narrative QA, map task, craft task

Tasks were deliberately constructed with "text in, text out" designs to enable direct comparability between human subjects and LLMs. Human data were collected via a rigorously controlled online platform (Figure 1). The evaluation metric for human-likeness is based on normalized Wasserstein distances between the empirical distributions of human and LLM task performance.

Figure 2: Three representative benchmark tasks, illustrating human memory limitations versus LLM ceiling performance.

Human vs. Model Memory: Quantitative Comparisons

The results unambiguously demonstrate that default LLM behavior disregards human-like recall constraints. Across all ten tasks, models attain ceiling-level performance even with only baseline task descriptions (TaskPr) and irrespective of model size or architecture. This massively overshoots human ability. For example, in forward digit span, models reliably recall 20 digits versus the human mean of 6.88; in reverse span the discrepancy is analogous.

Figure 3: Human-model performance comparison for Qwen3-8B and GPT-5.4; LLMs consistently achieve normalized scores near 1.0, while human performance is far lower.

Explicit human simulation prompting (HumPr, MemPr) generally has negligible or unreliable effect. Only in a small subset of tasks—especially narrative free recall—do such prompts somewhat increase alignment, but overall, performance distribution mismatch prevails.

Mechanisms for Human-like Memory: Prompting and Architectures

The authors propose that explicit architectural constraints are necessary to force human-like performance regimes. The central intervention is Compactor: an agentic LLM equipped with a key-value store implementing a strong capacity bottleneck ( $K=4$ ), mirroring the classic "four chunk" limit in human working memory literature.

Under Compactor, LLMs process input in an encode phase, populate the memory store with interpretable compressed chunks, and subsequently answer questions relying only on those chunks, with the original context scrubbed.

Figure 4: Human-model similarity (i.e., humanlikeness) across models and prompt conditions, with Compactor yielding consistently higher alignment.

This method produces substantial, statistically significant increases in humanlikeness, especially in working-memory-dominated tasks (digit span, word recognition, N-back). In forward digit span, similarity with human distributions improves as much as $+\!0.61$ for Qwen3-8B Standard.

Ablation experiments confirm that a structured chunked memory module is more optimal for simulating humanlike recall patterns than unstructured abstractive summaries.

Figure 5: Compactor outperforms unstructured summarizer agents in recapitulating human memory behavior.

Analysis of Prompt-based Simulation: Few-Shot ICL, Forgetting Patterns

The authors probe whether in-context learning with few-shot demonstrations of human performance can bridge the simulation gap. Results show that only in-domain demonstrations of specific task behavior improve model humanlikeness; out-of-domain errors do not confer transfer benefits, indicating an absence of generalizable human-like forgetting schemas within the models.

Figure 6: Few-shot in-domain prompting improves GPT-5.4’s alignment to human behavior, but out-of-domain examples do not.

Furthermore, analyses of error patterns reveal important qualitative distinctions. When errors are induced, LLMs forget in non-humanlike ways: for example, they often produce nearly complete lists with rare item substitutions (MemPr) or abrupt truncations (Compactor), in contrast to the more graded and distributed errors observed in human recall.

Application: Educational Document Reranking with User Simulators

The study applies these insights to an educational application, wherein LLM user simulators are tasked with reranking documents by predicted human memorability. By introducing distractors, increasing redundancy, and manipulating reading difficulty, the experiment assesses which LLM configuration best predicts document difficulty for human learners.

Compactor-enhanced user simulators yield improved pairwise reranking accuracy and higher humanlikeness compared to prompt-only models, but the alignment is still suboptimal in absolute terms.

Figure 7: Left—Pairwise reranking accuracy for Llama 3 8B Instruct. Right—Humanlikeness of LLM user simulators across conditions.

Figure 8: Despite gains with Compactor, model-inferred document difficulty rankings remain discordant with actual human behavior, highlighting the remaining gap.

Theoretical and Practical Implications

This work robustly establishes that current LLMs, even when explicitly instructed, do not spontaneously simulate human memory failures at granular or distributional levels. For realistic user simulation in applications such as education or policy modeling, naive use of "out-of-the-box" LLMs is inappropriate, as they systematically overpredict recall and comprehension.

The empirical observation that explicit architectural interventions grounded in cognitive science improve humanlikeness has important implications. It suggests future LLM-based simulators ought to be built with resource and process constraints reflective of actual human cognitive architectures. These insights generalize beyond memory—other dimensions such as reasoning and attention should be similarly revisited.

Additionally, the study's diagnostic approach, integrating empirical benchmarks, human data, and distribution-matching metrics, sets a methodological precedent for human-AI cognitive alignment research.

Conclusion

The investigation reveals that LLMs fundamentally diverge from human memory limitations, with significant practical implications for user simulation and the evaluation of human-AI interaction systems. Prompt engineering alone is insufficient to induce realistic levels of forgetting. Instead, integrating cognitive architectural constraints (such as the Compactor's chunk-limited memory) substantially increases model–human alignment, although perfect simulation remains elusive, especially in detailed error structure and downstream ranking tasks. Future work should continue to refine such mechanisms and extend them to broader cognitive domains, aiming for principled, empirically validated human-AI alignment.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Simulating Human Memory with Language Models

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Simulating Human Memory with LLMs — A Simple Explanation

Overview

This paper studies how well AI LLMs can pretend to have human-like memory. The authors compare how people and AI remember information in classic memory tests. They find that most AI models remember far better than humans, which makes them bad “user simulators” for tasks like teaching. Then they test ways to make AI forget in a more human-like way, and show that doing this can make the AI more useful for simulating how real people learn.

Key Questions

The paper asks three main questions:

How different is AI memory from human memory on well-known psychology tests?
Can we make AI “forget” more like humans do, just by giving it better instructions (prompts)?
If we force the AI to use a small, human-like working memory, does it become a more realistic stand-in for a human in practical tasks, like choosing the best study materials?

Methods and Approach

Here’s how the researchers tested their ideas:

They built a set of 10 memory tasks used in psychology. Examples include:
- Digit span: remember a growing list of numbers (like 3, then 5, then 7 numbers).
- Reverse digit span: remember the numbers, but say them backwards.
- N-back: check whether the current letter matches the one from 1, 2, or 3 steps earlier.
- Story tasks: read a short passage, then answer questions or retell the story.
- Map task: study a map, then answer questions about routes after the map disappears.
- Craft task: learn simple “recipes” for making items from parts, then answer questions after the recipes are hidden.
They collected human results from 50 people and compared them to nine different LLMs.
They tried three types of instructions (prompts) for the AI:
- Task Prompt: just do the task.
- Human Prompt: pretend you are a human in a psychology experiment.
- Memory Prompt: pretend you are a human and remember that humans sometimes forget.
They also built a small “memory tool” for the AI called Compactor:
- Think of it like a backpack with only four pockets. The AI must cram what it thinks is most important into four labeled chunks (like “characters,” “events,” “key facts,” “rules”).
- After packing, the AI must answer questions using only what’s in those four pockets.
- This imitates human “working memory,” which is often described as holding about four “chunks” of information.
To measure how human-like the AI’s memory was, they used a similarity score. You can think of it like comparing two piles of sand (the humans’ scores vs. the AI’s scores) and asking: how much sand would you need to move to make the piles look the same? Smaller differences mean more human-like behavior.

Main Findings

Out-of-the-box AI remembers too well: On all ten tasks, AIs scored near perfect, while humans made normal memory mistakes (for example, humans can usually recall around 7 digits, AIs handled 20 easily).
Just telling AI to “act human” doesn’t fix it: Prompts that say “simulate a human” or “humans forget” didn’t reliably make the AI behave more like people across tasks.
Limiting AI’s working memory helps: Using the Compactor (the four-pocket memory) made AI scores much closer to human scores, especially on the short-term memory tasks like digit span, reverse digit span, N-back, and word recognition.
Example-based training is task-specific: Showing the AI real examples of human performance helped only when the examples matched the same task. Examples from a different task didn’t transfer well.
AI and humans forget differently: Even when the AI made mistakes, the types of mistakes weren’t the same as humans’. For instance, AIs often reproduced nearly the whole sequence with one wrong digit in the middle, while humans showed different error patterns.
Practical test in education: In a small teaching experiment, the human-like memory AI (with Compactor) was better at predicting which version of a reading passage people would remember best. It wasn’t perfect, but it was clearly more useful than the standard models.

Why It Matters

If we want to use AI to design lessons, train professionals, or test policies, the AI should behave like real people—including having realistic limits on memory. This paper shows:

Today’s AIs are too good at remembering compared to humans, which can mislead designers and teachers.
Simple memory limits (like forcing the AI to keep only four chunks) can make the AI a better stand-in for humans.
More work is needed to match not just how much AIs forget, but the way they forget, so their mistakes look human too.

In short, building AIs with human-like memory can make them more trustworthy and useful when they’re used to simulate real users, especially in education and training.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, formulated to guide actionable future research.

Benchmark coverage: No tasks directly assess semantic memory, false memories (e.g., DRM paradigm), prospective memory, or visual/spatial memory beyond the map task; extending the suite to these domains is needed to claim broader “human memory” coverage.
Modality constraint: All tasks are “text in, text out”; auditory presentation (canonical for digit span) and mixed-modality encoding/retrieval were not examined, limiting ecological validity and comparability to classic human results.
Time/attention constraints: LLMs were not subjected to timing, attentional load, or dual-task interference comparable to humans (e.g., presentation rate, enforced delays, distractors), which are central to human forgetting; effects of such constraints on model behavior remain unexplored.
Retention interval and consolidation: No tests of delayed recall or multi-session retention (minutes/hours/days) were included; human forgetting curves and consolidation dynamics (e.g., sleep effects) are unaddressed.
Serial position effects: The benchmark does not analyze whether models reproduce human primacy/recency or transposition gradients in list recall; methods to induce and measure these effects in LLMs are missing.
Interference phenomena: Proactive/retroactive interference and similarity-based confusions (core to human memory) were not systematically probed; tasks that manipulate item similarity or update frequency are needed.
Error typologies: Although the paper notes LLM error patterns differ from humans (e.g., truncated vs mid-sequence substitution), there is no systematic taxonomy or cross-task analysis of error types to target specific mechanisms.
Individual variability: The approach compares aggregate distributions; it does not model individual differences (e.g., variance across participants, age, cognitive ability) or whether LLMs can simulate distinct “personas” with different memory capacities.
Calibration and metacognition: No assessment of confidence, calibration, or metacognitive judgments (e.g., feel-of-knowing, judgments of learning) was included, despite their importance for user simulation in education.
Metric limitations: Humanlikeness is defined via 1D Wasserstein distance on normalized task scores, which ignores within-task structure (e.g., serial position curves, item-wise confusions) and may compress rich patterns into a single scalar.
LLM distributional sampling: It is unclear how LLM score “distributions” were obtained (number of seeds, temperature, sampling strategies); within-model variability and stochasticity were not systematically analyzed.
Embedding-based free recall scoring: Narrative recall was evaluated with BLEU and a single embedding model; sensitivity to scoring method (e.g., ROUGE-L, BERTScore, event-based scoring, human grading) and potential embedding biases remains untested.
Few-shot ICL generalization: In-domain examples help but fail to transfer across tasks; the paper leaves open whether larger or structured ICL corpora, meta-prompts, or task-agnostic “memory-style” schemas could enable cross-task generalization.
Prompt dependence and compliance: The effect of different instruction framings, refusal tendencies, verbosity constraints, and decoding settings (temperature/top-p) on memory-like behavior was not explored.
Model coverage: Evaluation focuses on transformer LLMs; architectures with explicit recurrence or state (e.g., RWKV, Mamba), or memory-augmented models, were not compared to test whether native inductive biases yield more human-like memory.
Compactor capacity K: The memory slot capacity is fixed at K=4 (motivated by Cowan), but no ablation probes optimal K, adaptive K per task, or individual/persona-conditioned K; sensitivity to K and slot management policies is unknown.
Chunking strategy and content: Compactor’s chunking is unconstrained and not validated against human chunking behavior (e.g., event segmentation, semantic clustering); how to induce human-typical chunk boundaries and content remains open.
Memory operations: Compactor only supports write/overwrite/delete; there is no mechanism for graded decay, interference-based degradation, retrieval latency, rehearsal, or priority-based retention, all central to human memory.
Generalization beyond summarization: The paper ablates against a single free-form summary but does not test other bottlenecks (e.g., event schemas, scripts, graph-structured memory, compressed rehearsal traces) or learning the compaction policy end-to-end.
Long-horizon interaction: Tasks are single-session and mostly single-turn; whether human-like memory constraints improve fidelity in multi-turn, long-horizon dialogues and agent settings is not evaluated.
Education application scope: The reranking proof-of-concept uses 10 biographies and multiple-choice questions; effects may not transfer to authentic curricula, open-ended responses, or real learning outcomes (delayed tests, transfer tasks).
Baseline comparisons for reranking: The study lacks comparisons to simple heuristics (e.g., readability metrics, length, redundancy counts) or classical cognitive models (e.g., ACT-R–style memory predictions) for document selection.
Cross-linguistic and cross-cultural generality: All human participants were U.S.-based native English speakers; memory simulation across languages, scripts, reading directions, and cultural familiarity is untested.
Population diversity: Effects of age, working-memory capacity, neurodiversity, language proficiency, and domain expertise on alignment are not examined; simulators that capture population heterogeneity remain to be developed.
Task realism and content familiarity: Some materials are LLM-generated; how prior knowledge, schema consistency, and real-world familiarity modulate human–LLM alignment is unmeasured.
Fairness and robustness: The paper does not examine whether memory simulation quality varies with content domain, demographic references, or sensitive attributes.
Data sufficiency and power: Human sample sizes (N=50 for benchmark; N=100 for application) may be underpowered for fine-grained distributional analyses and subgroup effects; replication with larger, stratified cohorts is needed.
Training vs prompting: No attempt was made to fine-tune or reinforcement-learn models to match human memory distributions or error types; whether learning from human traces yields better cross-task alignment remains open.
Mechanistic alignment: The work evaluates external behavior but does not probe whether internal model dynamics (activations/attention patterns) mirror cognitive mechanisms (e.g., gating, maintenance, interference).
Transfer to other downstream tasks: Beyond reranking texts, it is unknown whether human-like memory constraints improve user simulation in tutoring, programming assistance, health counseling, or policy simulations.
Safety implications: The trade-off between making models forgetful for realism and potentially losing critical safety-relevant recall is not analyzed; mechanisms for task-conditional memory constraints are needed.
Reproducibility details: Tool-call specifications, prompt templates, and seed control are provided in appendices, but complete protocols for sampling-based evaluations (e.g., number of runs per task/model) are not fully standardized for exact replication.
Open question on optimality: The paper shows no consistent benefit of “frontier” models in memory simulation; it remains unknown which capabilities (e.g., longer context, better reasoning) actually hinder or help human-like forgetting and why.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are deployable use cases that build directly on the paper’s methods (Compactor memory bottleneck, prompts, benchmark, and evaluation) and findings (LLMs over-remember; human-like memory improves user simulation and document reranking).

Education and Training

Memory-aware user simulators for content selection in LMSs and tutoring systems
- Sector: Education/EdTech
- What: Use the Compactor agent (4-slot key–value memory) to predict which reading variant (easier, redundant, distractor) students will best recall; rerank or adapt materials accordingly.
- Tools/workflow: Wrap existing LLMs with Compactor’s encode→recall flow; batch-score content variants; deploy as a reranking API before lesson delivery.
- Assumptions/dependencies: Access to tool-calling LLMs; content is text; four-chunk capacity generalizes; student cohort resembles benchmark population.
Curriculum design assistants that enforce memory-friendly scaffolding
- Sector: Education/Instructional Design
- What: Authoring assistants that “lint” lessons for cognitive load, recommending chunking, redundancy, and removal of distractors to match human memory limits.
- Tools/workflow: Content linting plugin using the benchmark’s tasks as heuristics; Compactor-based “student persona” to spot overload.
- Assumptions/dependencies: Domain calibration required; authoring teams accept automated edits; evaluation based on Wasserstein similarity metric.
Realistic mock students for teacher training and assessment
- Sector: Education/Teacher Prep
- What: Simulated students that forget in human-like ways, making practice exams, Socratic dialogues, and feedback sessions more authentic.
- Tools/workflow: Compactor personas for varied working-memory capacity; scenario scripts that limit recall access to key–value memories only.
- Assumptions/dependencies: Prompt libraries for domains; oversight to avoid reinforcing incorrect pedagogy.

Software, AI, and Product UX

“Memory-constrained” user simulators for product research and A/B testing
- Sector: Software/Product Research
- What: Evaluate UI copy, onboarding flows, and help docs with simulated users that miss or forget steps like real users do.
- Tools/workflow: Instrumented tasks (e.g., multi-step signup) with Compactor users; compare completion and error patterns; iterate on UX microcopy.
- Assumptions/dependencies: Text-first tasks; mapping from benchmark tasks (digit span, word recognition) to product flows is valid.
Middleware for memory-aware agents
- Sector: AI Platforms/Developer Tools
- What: Package the Compactor agent as a drop-in wrapper (K=4 key–value store) to optionally enforce human-like memory in any LLM workflow.
- Tools/workflow: OpenAI-style tool-calling functions write_memory/delete_key; recall-only answers; configurable K for personas.
- Assumptions/dependencies: Model/tool-calling support; governance to prevent misuse (e.g., faking incompetence).
Evaluation and model selection using the Human Memory Simulation Benchmark
- Sector: AI Evaluation/ML Ops
- What: Integrate the released human data, tasks, and Wasserstein-based humanlikeness metric into model eval dashboards to pick simulators that match target users.
- Tools/workflow: CI pipelines that run the 10 tasks; report humanlikeness scores alongside accuracy.
- Assumptions/dependencies: Benchmark relevance to domain; licensing of benchmark assets.

Customer Support and Enterprise Enablement

Script and knowledge-base optimization for recall under load
- Sector: CX/Customer Support/Enterprise Docs
- What: Rewrite support scripts and KB articles with redundancy and chunking tuned to human memory constraints; evaluate with Compactor user simulators.
- Tools/workflow: Pre/post A/B via simulator reranking; memory-linting for long docs; add “recap inserts” aligned to four slots.
- Assumptions/dependencies: Textual channels (chat, email); stable mapping between simulator preferences and human outcomes.
Compliance and safety training with realistic “forgetting”
- Sector: Enterprise L&D/Compliance
- What: Simulated trainees that miss steps or forget policy details at human-like rates to stress-test training materials and assessments.
- Tools/workflow: Compactor trainees; reporting on missed/retained items; targeted remediation content generation.
- Assumptions/dependencies: Domain adaption needed; legal validation for training equivalence.

Healthcare Communication and Patient Education (non-diagnostic)

Memory-informed patient education materials
- Sector: Healthcare/Patient Education
- What: Rerank or rewrite discharge instructions and medication guides to maximize recall; insert critical redundancy, limit distractors.
- Tools/workflow: Compactor-based pretesting of materials; “must-remember” facts highlighted and duplicated.
- Assumptions/dependencies: Not a clinical decision tool; requires clinician review; text-only applicability.

Research and Academia

Cognitive modeling baselines for experiments
- Sector: Cognitive Science/Psycholinguistics
- What: Use the benchmark and Compactor to set “human-likeness” baselines, generate hypotheses, and design stimuli with controlled memory load.
- Tools/workflow: Wasserstein similarity for model–human fits; task scripts for replication.
- Assumptions/dependencies: IRB constraints; generalization beyond U.S.-based English speakers.

Long-Term Applications

These require further research, validation, scaling, or domain-specific integration to be reliable and safe.

Education and Workforce Development

Personalized memory-aware tutoring agents at scale
- Sector: Education/EdTech
- What: Tutors that estimate individual working-memory profiles, adapt chunking, pacing, and redundancy over time, and predict retention trajectories.
- Dependencies: Longitudinal calibration; privacy-preserving learner models; alignment of forgetting patterns (beyond current Compactor) to real human errors.
Automated curriculum optimization via simulator-in-the-loop training
- Sector: EdTech/Content Platforms
- What: Reinforcement learning with memory-constrained simulators as reward models to evolve lesson sequences and assessments.
- Dependencies: Proven correlation between simulator rewards and student outcomes; guardrails against gaming.

Healthcare and Clinical Settings

Preclinical testing of digital therapeutics and cognitive interventions
- Sector: Digital Health/Neuropsychology
- What: Use increasingly human-like memory simulators to screen cognitive training regimens or medication adherence aids before clinical trials.
- Dependencies: Regulatory acceptance; diverse, clinically validated memory models (beyond text tasks); multimodal stimuli.
Synthetic patients for clinician training with realistic memory variability
- Sector: Medical Education/Simulation
- What: Simulated patients who forget timelines, misremember doses, or confabulate details consistent with human error distributions.
- Dependencies: Domain-grounded error taxonomies; integration with standardized patient protocols; rigorous bias and safety reviews.

Policy, Communication, and Public Health

Population-level simulations incorporating memory constraints
- Sector: Public Policy/Public Health
- What: Forecast uptake/retention of guidance (e.g., emergency instructions, health campaigns) under realistic forgetting; optimize message design.
- Dependencies: Demographic calibration (age, education, language); multilingual and multimodal memory modeling; ethical governance.
Misinformation resilience modeling
- Sector: Policy/Platform Safety
- What: Simulate how people retain or forget corrections vs. falsehoods; test interventions (repetition, spacing, salience) before deployment.
- Dependencies: Empirically matched forgetting patterns; longitudinal dynamics; platform integration.

Robotics and Human–Robot Interaction

HRI policies that respect human memory limits
- Sector: Robotics/Industrial Operations
- What: Robots and cobots that pace instructions, repeat critical steps, and chunk directions anticipating human recall failures.
- Dependencies: Embedding text-based memory models into multimodal HRI; real-world validation in safety-critical environments.

Software Engineering and Knowledge Work

“Human-memory-calibrated” coding assistants and documentation copilots
- Sector: Software/DevTools
- What: Assistants that avoid overloading users, schedule reiterations of key API details, and predict which context snippets a developer will forget.
- Dependencies: Mapping from psych tasks to developer workflows; IDE telemetry and privacy-safe personalization; validated benefit in productivity studies.
Organizational knowledge retention simulators
- Sector: Enterprise/Knowledge Management
- What: Simulate how teams forget process steps after handoffs; design SOPs with redundancy and recap triggers.
- Dependencies: Domain adaptation; change-management buy-in; measurement of downstream error reduction.

AI Systems and Alignment

Training assistants with human-like cognitive resource constraints
- Sector: AI Alignment/Agent Design
- What: Bake in bounded memory as an architectural prior to produce assistants that collaborate more naturally with humans.
- Dependencies: Better models of forgetting types (not just capacity); task-generalization beyond few-shot; benchmarks extended beyond text-only memory.
Standardized “humanlikeness” metrics in AI model cards
- Sector: AI Governance/Evaluation
- What: Report memory humanlikeness alongside accuracy and safety for downstream selection in user-facing apps.
- Dependencies: Community adoption; broader benchmarks (semantic/visual memory); consensus on acceptable ranges.

Notes on Assumptions and Dependencies

Human memory capacity proxy: The Compactor enforces $K=4$ “chunks” based on Cowan (2001); domain- and population-specific variability may require tuning.
Modality constraints: The benchmark is text-in/text-out; extending to audio, vision, or mixed modalities will require new tasks and validation.
Population generalization: Human data comes from N≈50 U.S.-based native English speakers; broader demographics are needed for policy/healthcare use.
Error pattern mismatch: Current methods improve score distributions but still diverge in error types from humans; applications depending on fine-grained error realism need further R&D.
Tooling requirements: Best results rely on tool/function-calling LLMs and controllable agent frameworks; some closed models or environments may limit this.
Ethics and safety: Simulating cognitive limitations can be misused; governance is required to prevent manipulative practices and to ensure accessibility and fairness.

View Paper Prompt View All Prompts

Glossary

ablation: An experimental analysis technique where components of a system are removed or modified to assess their contribution. "We ablate Compactor in Appendix~\ref{app:ablation}, comparing it to an unstructured summarizer agent that receives the same input but produces a single free-form prose summary instead of $K$ keyed chunks."
BLEU score: A metric for evaluating text generation by comparing n-gram overlap between candidate and reference texts. "evaluated by a BLEU score and an embedding-based similarity score computed using sentence-transformers/all-MiniLM-L6-v2."
bootstrapped confidence intervals: Uncertainty estimates computed by resampling data with replacement to form confidence intervals. "Error bars indicate bootstrapped 95\% confidence intervals."
chain-of-thought reasoning: A prompting or reasoning approach where models generate explicit step-by-step reasoning. "regardless of model size or whether the model uses chain-of-thought reasoning."
cognitive map formation: The process of building an internal representation of spatial environments and relations. "This task relates to cognitive map formation and spatial relational memory."
Compactor: The paper’s agent that enforces a human-like working memory bottleneck using a fixed-size key–value store. "we construct an agent \cite{yao2023react,sumers2024cognitive} that we refer to as Compactor."
computational psycholinguistics: The study of language processing using computational models and methods. "Computational psycholinguistics."
context window: The span of tokens a LLM can attend to at once; effectively its short-term memory during inference. "Longer context windows do not guarantee reliable memory use"
directed acyclic graph (DAG): A directed graph with no cycles, often used to represent dependencies or workflows. "that form a directed acyclic graph (DAG), and then answer questions about how items combine after the rules are removed."
embedding similarity: A measure of closeness between texts using vector representations in an embedding space. "Performance is evaluated using embedding similarity."
empirical CDF: The empirical cumulative distribution function, which summarizes the distribution of observed data. "the empirical CDFs of humans and LLMs."
episodic memory: Memory for specific events or experiences, including when and where they occurred. "This task can be viewed as a form of episodic memory assessment"
executive function: Cognitive processes that control and regulate other abilities and behaviors (e.g., updating, inhibition, task-switching). "This is a widely used paradigm for studying cognitive control and executive function, including working memory processes."
few-shot prompting: Supplying a model with a small number of examples in the prompt to guide its behavior on a task. "This suggests that the benefit of few-shot prompting is not attributable to exposure to generic human-like mistakes, but rather to elements of task-specific structure."
hippocampal indexing theory: A neurobiological theory proposing that the hippocampus indexes distributed cortical representations to enable episodic memory retrieval. "inspired by the neurobiological hippocampal indexing theory"
humanlikeness (similarity measure): The paper’s metric for how closely model performance matches humans, defined as 1 minus the normalized Wasserstein distance. "Finally, we convert $W_1$ into a similarity measure $Humanlikeness = 1 - W_1$ , where larger values indicate closer human--model alignment."
in-context learning (ICL): A model’s ability to infer task behavior from examples provided in its input, without parameter updates. "These results highlight a key limitation of in-context learning (ICL) for human simulation"
key–value memory module: A structured memory mechanism storing compressed information as key–value pairs with fixed capacity. "This agent explicitly implements a human-like working memory bottleneck via a key-value memory module."
multi-hop QA: Question answering that requires combining information across multiple pieces of evidence or reasoning steps. "improve performance on multi-hop QA benchmarks"
N-back: A working-memory task where one decides if the current item matches the one shown n steps earlier. "N-back. Participants see a sequence of letters and must identify whether each letter matches the one shown $n$ steps earlier (1-back, 2-back, 3-back)."
OpenAI tool-calling format: A standardized interface for LLMs to invoke external tools or functions by structured calls. "two tool-use functions in the OpenAI tool-calling format"
pairwise reranking accuracy: An evaluation metric measuring how often a system correctly prefers the same item as humans when comparing pairs. "We also compute pairwise reranking accuracy, as follows."
persona-conditioned prompting: Prompting that conditions a model’s responses on a specified persona to induce diverse or human-like behavior. "techniques like verbalized sampling \citep{zhang2025verbalized} or persona-conditioned prompting \citep{park2023generative}"
positional interpolation: A method to extend or modify positional encodings so models can handle longer contexts than trained on. "such as positional interpolation \cite{chen2023extending} and rotary embeddings \cite{su2024roformer}."
retrieval augmented generation: Systems that fetch external information at inference time to augment the model’s generation. "retrieval augmented generation systems inspired by the neurobiological hippocampal indexing theory"
rotary embeddings: A positional encoding technique that applies rotations in embedding space to encode token positions. "such as positional interpolation \cite{chen2023extending} and rotary embeddings \cite{su2024roformer}."
sparse embedding lookups: Memory mechanisms that retrieve information using sparse indexing into embedding stores rather than dense attention alone. "sparse embedding lookups \cite{cheng2026conditional,lin2025sparsememory}"
verbalized sampling: A technique where models explicitly articulate sampled choices or uncertainties to produce diverse outputs. "verbalized sampling \citep{zhang2025verbalized}"
Wasserstein distance (earth mover's distance): A distance between probability distributions measuring the minimal “work” to transform one into the other. "we use the Wasserstein distance, or earth mover's distance, between the distributions of human and LLM performance on the task"
working memory capacity: The limited number of items or chunks that can be actively maintained and manipulated in mind. "Humans are typically understood to have a working memory capacity of about four chunks, each of which can consist of a compressed, complex concept \cite{cowan2001}."
working memory updating paradigms: Experimental tasks that require maintaining bindings while replacing outdated information with new inputs. "working memory updating paradigms, which require maintaining and dynamically updating arbitrary bindings while discarding outdated information"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Simulating Human Memory with Language Models

Summary

Simulating Human Memory with LLMs: A Comprehensive Evaluation

Introduction

Memory Benchmark: Tasks and Experimental Setup

Human vs. Model Memory: Quantitative Comparisons

Mechanisms for Human-like Memory: Prompting and Architectures

Analysis of Prompt-based Simulation: Few-Shot ICL, Forgetting Patterns

Application: Educational Document Reranking with User Simulators

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Simulating Human Memory with LLMs — A Simple Explanation

Overview

Key Questions

Methods and Approach

Main Findings

Why It Matters

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Education and Training

Software, AI, and Product UX

Customer Support and Enterprise Enablement

Healthcare Communication and Patient Education (non-diagnostic)

Research and Academia

Long-Term Applications

Education and Workforce Development

Healthcare and Clinical Settings

Policy, Communication, and Public Health

Robotics and Human–Robot Interaction

Software Engineering and Knowledge Work

AI Systems and Alignment

Notes on Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research