Create a Video View Paper

PERMA: Event-Driven Benchmarking of Personalized Memory Agents

This presentation explores PERMA, a groundbreaking benchmark that shifts personalized AI evaluation from static fact-recall to dynamic, temporally-evolving persona tracking. Unlike conventional frameworks that reduce personalization to isolated preference retrieval, PERMA assesses how agents maintain coherent persona states across noisy, multi-domain interactions over extended timelines. The talk examines PERMA's event-driven construction pipeline, multi-layered evaluation protocol, and comparative analysis of reasoning LLMs versus structured memory systems, revealing that token efficiency and context expansion alone cannot solve the personalization challenge—event-indexed, evolution-aware architectures are essential for robust lifelong agentic memory.

Script

Most AI personalization benchmarks test whether agents can find a needle in a haystack. But real users don't express preferences as isolated facts—they reveal them gradually, inconsistently, across conversations that span weeks and domains. PERMA is the first benchmark designed to capture that messy reality.

Traditional evaluation treats user preferences like database entries to be looked up. An agent reads a preference statement, then later proves it can recall that fact amid distractors. This fundamentally misses how preferences actually emerge—through fragmented hints, contradictions, and behavioral drift over time.

PERMA reconstructs personalization as a temporal challenge.

The benchmark constructs timelines from real user profiles, then decomposes them into fine-grained events—some explicit, others implicit. Each event becomes a naturalistic multi-turn dialogue, deliberately polluted with the defects you'd encounter in production: vague phrasing, contradictory signals, even mid-sentence language switching. Task probes are interleaved to measure whether agents can synthesize preferences across sessions or collapse under interference.

This breakdown reveals a stark divide. Reasoning-focused language models excel at explicit consistency but degrade sharply when context exceeds their effective window or when linguistic idiosyncrasies accumulate. Memory systems like MemOS, by contrast, compress persona state into persistent representations—achieving up to 300 times lower token usage while maintaining stability across temporal drift. The gap widens further under noise: memory architectures prove far more resilient to in-session ambiguity than raw context stuffing.

PERMA exposes a fundamental asymmetry. Memory-augmented agents can compress and persist persona state efficiently, but the moment you introduce cross-domain tasks or adversarial style noise, performance collapses. Fixed top-k retrieval and monolithic state modeling break down. True lifelong personalization—agents that evolve with users across months, inferring latent traits and adapting to volatility—remains an open engineering and modeling frontier.

Token compression buys you efficiency, but it doesn't buy you understanding. PERMA proves that robust personalized agents require architectures that can reason about time, ambiguity, and cross-domain coherence—not just store facts. To explore PERMA and create your own research video, visit EmergentMind.com.