PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Published 24 Mar 2026 in cs.AI | (2603.23231v1)

Abstract: Empowering LLMs with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

Abstract PDF Upgrade to Chat

Summary

The paper's main contribution is the PERMA benchmark that shifts evaluation from static fact-recall to dynamic, event-driven memory integration.
It introduces a robust two-phase pipeline that decomposes user interactions into fine-grained events to simulate authentic persona evolution.
It compares LLMs and memory systems, revealing insights on token efficiency, noise resilience, and challenges in cross-domain personalization.

PERMA: Event-Driven Benchmarking of Personalized Memory Agents

Motivation and Limitations of Prior Work

The PERMA benchmark addresses critical limitations in the evaluation of LLM-based personalized agents. Existing frameworks overwhelmingly rely on scenario formulations that reduce the personalized memory challenge to "needle-in-a-haystack" retrieval. In these settings, static user preferences are presented as isolated statements, and the principal evaluation metric becomes factual recall in the presence of token-level distractors. This approach fails to capture the dynamic emergence of preferences, the integration of behavioral signals across sessions, and the myriad ambiguities and inconsistencies present in real interaction environments.

PERMA aims to capture the full complexity of real-world persona modeling by shifting the focus from static fact- or preference-recall to the maintenance of a temporally evolving persona state. The benchmark incentivizes and evaluates both continual preference integration and robust persona-state construction throughout protracted, noisy, and cross-domain interactions (Figure 1).

Figure 1: Comparison of evaluation paradigms—conventional benchmarks assess isolated preference retrieval, while PERMA integrates event-driven, temporally compounded preference tracking.

PERMA Data Construction and Event-Driven Evaluation

PERMA's dataset is constructed by leveraging authentic user profiles sampled from diverse geographies, demographics, and interests. User interaction histories are recast into temporally-ordered event sequences. Each interaction event is explicit or implicit in the emergence or supplementation of preferences, and all dialogues are systematically validated by LLM-based and human-in-the-loop pipelines for both coverage and linguistic naturalness.

The pipeline operates in two phases (Figure 2). First, a planner decomposes long-term user histories into fine-grained events, generating event-specific descriptions and objectives. These are then expanded into naturalistic, multi-turn dialogues by an LLM-based generator. Each session is categorized by type—emergence, supplement, or task probe. Task events (i.e., checkpoints) are interleaved at key positions to probe memory integration and resilience to cross-session/contextual interference.

Notably, text variability and user idiolects are explicitly simulated: prompt defects, ambiguous/inconsistent user behaviors, colloquialisms, multi-lingual fragments, and style-aligned turns are programmatically injected to reflect real usage distributions.

Figure 2: PERMA pipeline—timeline construction from user summaries, noise and style variability injection, and two-stage evaluation.

Evaluation Protocol and Metrics

PERMA comprehensively disentangles and probes memory fidelity, persona consistency, and retrieval efficiency:

MCQ Probing: Task events are evaluated through carefully designed multiple-choice queries, ablating axes of task completion, preference consistency, and information confidence. This setup ensures models must demonstrate long-horizon synthesis rather than exploiting superficial cues.
Interactive Evaluation: A user simulator, using the ground-truth persona state, provides iterative feedback until satisfactory agent responses are produced. This allows fine-grained measurement of Turn=1 (one-shot success), Turn≤2 (minor clarification), and overall Completion rates.
Memory Fidelity & Efficiency: BERT-f1, LLM-as-a-judge memory scores, search duration, and token usage are used to assess the reconstruction and deployment cost of memory representations. These metrics reveal bottlenecks and trade-offs between token compression and retrieval quality.

Three temporal checkpoints are used: Type 1 ("zero-memory" baseline), Type 2 (post-domain session, maximal recall opportunity), and Type 3 (post-interference, maximal noise). Performance is further dissected by context depth and position (positional probing), allowing diagnosis of recency bias, catastrophic forgetting, and context-window saturation.

Model and System Comparison

Models evaluated include standalone LLMs (reasoning and chat variants), standard RAG pipelines, and state-of-the-art structured memory systems such as MemOS, Mem0, LightMem, Memobase, EverMemOS, and Supermemory. Across clean, noisy, and long-context conditions, experiments elucidate four primary dynamics:

Reasoning LLMs excel at explicit persona consistency but degrade sharply when context length exceeds effective window size or when idiosyncratic lexical distribution increases (Figure 3).
Memory systems demonstrate extreme token efficiency (up to 300 $\times$ lower token usage) and higher stability across temporal drift, due to persistent persona representations. MemOS in particular provides robust performance across both MCQ and interactive settings with minimal recency collapse.
Cross-domain and multi-session queries remain challenging (Figure 4), with all evaluated systems exhibiting significant accuracy and stability drops, especially under adversarial in-session noise or divergent style. This is attributed to structural limitations in retrieval segmentation (fixed top-k, absence of task-adaptive selection).
Marginal utility of increased retrieval depth is highly system dependent (Figure 5): vanilla RAG pipelines benefit from expanded search due to coarse-grained chunk selection, whereas over-retrieval for memory agents introduces noise and degrades the downstream reasoning fidelity.

Figure 6: Memory system performance and robustness—MCQ accuracy and memory score trends across event checkpoints in clean single-domain evaluation.

Figure 7: Cross-setting (clean/noise) MCQ and interactive Turn=1 performance for LLMs and memory systems.

Implications and Future Directions

PERMA's event-driven, temporally-evolving paradigm exposes crucial deficiencies in current personalization architectures. LLMs, despite large parameterization, exhibit severe context window limitations and cannot reliably track or infer emerging preferences under non-canonical linguistic patterns. Memory-augmented agents, though token efficient and resilient to synthetic context bloat, remain brittle under high cross-domain task entropy and lack flexible retrieval granularity. The presence of ambiguity- and style-induced in-session noise paradoxically improves some memory systems, but only up to the threshold where integration errors accumulate and persona drift undermines consistency.

Three primary implications emerge:

Headroom for dynamic, task-adaptive memory management: Static segment retrieval and monolithic persona state modeling are insufficient for realistic long-horizon agentic applications. Hierarchically adaptive, retriever-planner architectures or RL-based memory optimization may address retrieval quality and adaptive filtering under context drift.
Rigorous evaluation must occur at multiple abstraction layers: Option-based MCQ accuracy alone is inadequate; interactive, user-centered benchmarks, multi-metric breakdowns, and longitudinal drift analysis are required for meaningful diagnosis of personalization capacity.
Robust lifelong memory for agentic LLMs remains unsolved: True lifelong "companionship" demands mechanisms capable of integrating events, inferring latent cross-domain persona traits, and dynamically adjusting to both user volatility and environmental noise—a research challenge not subsumed by advances in vanilla long-context LLMs or retrieval augmentation.
Figure 4: Multi-domain MCQ and interactive performance—compressed memory and persona-state agents exhibit marked degradation in the presence of cross-domain noise, unlike single-domain or clean conditions.

Conclusion

PERMA defines a new benchmark for assessing the full spectrum of challenges in event-driven, evolving persona memory—beyond static recall or "needle-in-a-haystack" search. It demonstrates that token compression and context window expansion alone do not address the complexity of personalization; structured, event-indexed, and evolution-aware memory architectures are essential. While MemOS and similar systems demonstrate marked gains in efficiency and resilience, robust multi-domain, high-entropy personalized agents remain a modeling and engineering frontier. Future progress will require advances in memory representation, event abstraction, and adaptive retrieval conditioned jointly on temporal evolution, cross-domain dependencies, and user behavioral noise.

(End.)

Markdown Report Issue