ES-MemEval: Memory Benchmark in ES Agents
- ES-MemEval is a benchmark that evaluates agents’ long-term memory in personalized, evolving emotional support interactions across multi-session dialogues.
- It assesses five core memory competencies—information extraction, temporal reasoning, conflict detection, abstention, and user modeling—using precise quantitative and qualitative metrics.
- The EvoEmo dataset underpins evaluations in question answering, summarization, and dialogue generation, highlighting the impact of retrieval-augmented strategies.
ES-MemEval is a comprehensive benchmark for the systematic evaluation of conversational agents’ long-term memory capabilities, specifically in the domain of personalized, long-duration emotional support (ES) interactions. Addressing the deficiencies of previous benchmarks—which often focus on static and explicit fact retrieval—ES-MemEval targets scenarios where user information is dispersed, evolving, and often implicit. By integrating five core long-term memory competencies and introducing the multi-session EvoEmo dataset, ES-MemEval establishes new protocols for the quantitative and qualitative assessment of agent memory, temporal understanding, conflict management, abstention, and user modeling in question answering, summarization, and generative dialogue tasks (Chen et al., 2 Feb 2026).
1. Motivation and Scope
ES-MemEval was conceived in response to three critical shortcomings in legacy long-term dialogue evaluations: (1) their emphasis on static facts rather than dynamic personal histories, (2) insufficiency in measuring agents’ handling of fragmented or conflicting information, and (3) inability to assess adaptive user modeling over prolonged, evolving interactions. In the ES domain, agents must synthesize content spread over extended timelines, reason about event causality and chronology, detect and resolve information inconsistencies, abstain when memory is inadequate, and continuously update models of the user’s emotional baseline and latent traits. ES-MemEval’s central goal is to provide a scalable, diagnostic framework for these tasks within data drawn from realistic multi-session ES scenarios (Chen et al., 2 Feb 2026).
2. Core Long-Term Memory Capabilities
Five core memory skills are precisely formalized for evaluation in ES-MemEval:
- Information Extraction (IE): Extraction of explicit facts (entities, events, preferences) from the entire user history. Task: Given question and history , output , with token-level F1 as the principal metric.
- Temporal Reasoning (TR): Chronological or causal inference over user event timelines. Task: Generate a summary preserving event order. Metrics: ROUGE scores and event-based F1, using to denote discrete event extraction.
- Conflict Detection (CD): Identification and resolution of cross-session contradictions. Task: Given a statement , decide if it is consistent with the history (YES/NO/UNKNOWN). Metrics: Accuracy or specialized F1.
- Abstention (Abs): Withholding an answer when historical evidence is lacking. Task: Output “Cannot answer” where appropriate; measured by abstention precision and recall:
- User Modeling (UM): Inference of latent personality traits, goals, and emotional states. Evaluated in both QA and open-ended dialogue by “LLM-as-Judge,” scoring task outputs for user profile fidelity on scale 0–2 (QA) and 0–5 (others) (Chen et al., 2 Feb 2026).
3. The EvoEmo Dataset
EvoEmo is a synthetically constructed, annotation-rich dataset of 18 “virtual” users, each with evolving emotional states across up to 33 sessions (mean: 22.3 sessions per user, spanning 15 months and 13.3K tokens per conversation). Profiles derive from ESConv dialogues and controlled timelines of 25 event annotations per user, generated and expanded using GPT-4o plus human-in-the-loop validation. Dialogue generation conditions on current events, user profiles, and session summaries, with annotation of every turn for emotion, topic, summary, and “observation” (salient detail) consistency. The data set supports 1,209 QA samples (stratified by the five memory capabilities), 125 summaries, and 34 customized dialogue scenarios (Chen et al., 2 Feb 2026).
4. Benchmark Task Suite: Protocols and Metrics
Question Answering
- Inputs: Multi-session history or retrieved context; capability-labeled question.
- Outputs: Short-form answer or YES/NO/“Cannot answer” judgment.
- Ground Truth: Human-annotated answers plus supporting sessions.
- Metrics: Token F1, BERTScore, LLM-as-Judge (0–2), retrieval accuracy (Recall@k, nDCG@k).
Summarization
- Inputs: Thematically or temporally grouped sessions per user.
- Outputs: Cross-session summary abstracting state evolution and conflict resolution.
- Ground Truth: Human summaries plus explicit event annotations.
- Metrics: ROUGE-1/2/L, event-based Precision/Recall/F1, LLM-as-Judge (0–5).
Dialogue Generation
- Inputs: Scenario specs (topic, user state, prior session summaries, turn-level observations).
- System: Model converses with a GPT-4o “simulated user” agent.
- Metrics: Observation recall (identification of relevant user observations), weighted accuracy, LLM-as-Judge on memory, personalization, and ES (all 1–5). Reliability assessed by weighted Cohen’s ( 0.6 for QA/Sum) and LLM/human Spearman ( 0.7) (Chen et al., 2 Feb 2026).
5. Experimental Paradigms and Comparative Results
Tested approaches include:
- Open-Source Long-Context LLMs: Mistral-8B-Instruct-2410, Phi-3-Medium-128k-Instruct, Mistral-24B-Instruct-2503.
- Commercial LLMs: GPT-3.5-turbo (4K context), GPT-4o (16K context).
- Retrieval-Augmented Generation (RAG): All models augmented via bge-m3 dense retrieval with FAISS, defaulting to session-level granularity and top-k=4 retrievals.
Main Findings
- QA Performance: Baselines without retrieval struggle (). RAG improves open-source (e.g., 15.5 18.8% for Mistral-24B), and LLM-Judge (1.01 1.27). Best: GPT-4o+RAG (=23.9%, LLM-Judge=1.33).
- Capability Breakdown: IE, TR, CD, UM all benefit from RAG, but TR/UM remain below 20% . Abstention outputs are inconsistent—GPT-4o becomes overconfident (LLM-Judge drops from 1.67 to 1.30).
- Retrieval Strategy: Session-level memory consistently outperforms finer- or coarser-grained slices. peaks at 62%. Increasing improves Recall@.
- Context-Length Effects: Smaller models (e.g., Mistral-8B) cannot exploit very long contexts (K tokens); performance plateaus or degrades. RAG mitigates this limitation.
- Summarization: RAG nearly doubles ROUGE-L and event for open-source LLMs. E.g., Mistral-24B ROUGE-L 10.921.0, event 26.848.1.
- Dialogue Generation: Observation-based recall increases for all models with memory or RAG (Mistral-24B recall 0.200.35). “LLM-as-Judge” ratings of memory and personalization climb from 2.5 (no-mem) to 4.5 (full-hist/RAG); emotional support quality is less memory-sensitive (from 3.0 to 4.7).
- Qualitative Examples: Models without explicit memory fail to connect current user states to previous events (e.g., failing to link breakup with later emotional triggers). RAG-empowered agents can reference long-past disclosures and demonstrate nuanced empathy. Conflict detection and nuanced temporal reasoning remain open challenges, as even full-context LLMs misinterpret or ignore subtle contradictions.
- Reliability: Human-LLM judge agreement is substantial (, ) (Chen et al., 2 Feb 2026).
6. Diagnostic Insights and Ablation Analyses
Ablation of retrieval granularity and memory storage reveals session-level segmentation is most effective for capturing the sparse, shifting signals characteristic of ES data. RAG narrows the performance gap between open-source and commercial LLMs, though commercial systems remain superior in overall scores. Notably, increasing context length alone does not suffice—effective agent memory requires explicit retrieval and integration mechanisms.
7. Conclusions and Future Research Directions
Key conclusions established by ES-MemEval include:
- Explicit, structured long-term memory is essential for minimizing hallucinations and tailoring agent responses to evolving user states.
- RAG mechanisms substantially improve retrieval of factual and personalized content but struggle with subtler temporal and conflictive phenomena.
- Effective personalization is strongly dependent on robust memory architectures, while the affective (emotional support) dimension is less memory-dependent.
- Coarse memory granularity (session-level) aligns with the sparse disclosure patterns in ES conversations.
- Model size and context capacity alone do not resolve long-term coherence; hybrid memory–retrieval solutions are required.
- RAG and segmentation-based approaches decrease the performance gap between open and proprietary LLMs (Chen et al., 2 Feb 2026).
Future avenues highlighted include retrieval-adaptive calibration (for dynamic timelines), adaptive and hybrid neural-symbolic memory schemas (for conflict detection and user-state graphs), and expansion of EvoEmo to encompass greater cultural, linguistic, and domain diversity. ES-MemEval and EvoEmo thus set a rigorous standard for the development of personalized, contextually aware, and temporally robust conversational agents.