EvoEmo Dataset Overview

Updated 14 February 2026

EvoEmo is a multi-session dataset that simulates extended emotional support dialogues with detailed temporal and causal annotations.
It employs synthetic dialogue generation and rigorous annotation protocols to capture implicit user disclosures and evolving emotional states.
The dataset underpins the ES-MemEval benchmark, enabling robust evaluation of memory, temporal reasoning, and personalization in conversational agents.

EvoEmo is a multi-session dataset created for the systematic evaluation of long-term memory capabilities in conversational agents providing personalized emotional support (ES). Engineered to address deficiencies in existing benchmarks—particularly the lack of emphasis on implicit, fragmented user disclosures and evolving user trajectories—it underpins the ES-MemEval benchmark suite. EvoEmo models user interactions over extended timeframes, enabling the assessment of information extraction, temporal reasoning, conflict detection, abstention, and user modeling in emotionally supportive dialogues (Chen et al., 2 Feb 2026).

1. Motivation and Objectives

The primary motivation behind EvoEmo is the recognition that real-world emotional support frequently spans days to months, requiring conversational agents to recall, abstract, and reason across fragmented or evolving user disclosures. Existing dialogue resources tend to focus on static knowledge or short-term exchanges, thereby neglecting the complexity presented by longitudinal, personalized ES scenarios. EvoEmo was specifically constructed to:

Model causally coherent, temporally longitudinal user trajectories.
Capture the fragmentation and implicitness of user disclosures—details that may recur or shift semantic weight across sessions.
Supply a high-quality, richly annotated corpus for the empirical study of personalization, temporal abstraction, conflict detection, and abstention within ES dialogues (Chen et al., 2 Feb 2026).

2. Data Construction and Session Organization

EvoEmo consists of 18 virtual users, each represented by detailed demographic, personality, and relational profiles. The dataset comprises 401 sessions—averaging 22.3 sessions (and ∼14.9 months) per user—with each session containing approximately 23.4 conversational turns, yielding a collective total of ≈13,291.6 tokens per conversation and ≈9,400 total turns.

Session Generation Protocol

Seed Data: Initial sessions are sampled from the public ESConv dataset.
Synthetic Expansion: GPT-4o is employed to generate individualized event timelines with average 24.8 events per user (indexed by timestamps $t_i$ and event labels $e_i$ ). Each event timeline undergoes verification by trained annotators for temporal and causal coherence. Subsequent novel ES sessions are generated by GPT-4o conditioned on the current event, profile, and summaries of prior sessions; all outputs are then reviewed for safety, coherence, and fidelity to user profiles by six annotators.
Ethical Guidelines: All dialogues are fully synthetic and grounded in anonymized seed scenarios, with explicit protocols ensuring annotator compensation and ethical standards; no real user data is disclosed.

Fragmented and Deferred Disclosure

Sessions permit partial, implicit, or deferred reference to user events. For example, a brief mention of a breakup may occur in session one, whereas its emotional ramifications are only articulated in session seven. This construction supports the evaluation of agents’ ability to contextualize indirect emotional expressions based on long-term memory (Chen et al., 2 Feb 2026).

3. Annotation Framework and Taxonomy

EvoEmo is comprehensively annotated to facilitate diverse evaluation protocols:

Emotion Category: Each session is tagged with a primary emotion (e.g., sadness, anxiety).
Dialogue Topic: Streams into one of eight thematic categories: Emotion & Mood, Career & Study, Relationships, Love & Intimacy, Family Issues, Self-Growth, Treatment & Help-Seeking, Behavior Issues.
Turn-Level Observations: Fine-grained atomic statements reflecting user states or preferences (e.g., “prefers alone time after work”).
Event Timeline Annotations: Each event $e_i$ is timestamped $t_i$ and includes causal links and state-change descriptors.

QA and Summarization Labels

Each QA instance is annotated by its question type from the set {IE (Information Extraction), TR (Temporal Reasoning), CD (Conflict Detection), Abs (Abstention), UM (User Modeling)}. Summarization cases emphasize temporal reasoning and user modeling.

Metrics

Event-based summaries are evaluated using set-based precision $P = |E_{ref} \cap E_{gen}|/|E_{gen}|$ , recall $R = |E_{ref} \cap E_{gen}|/|E_{ref}|$ , and F₁. Retrieval efficacy in QA is measured by Recall@k and nDCG@k, given standard definitions (Chen et al., 2 Feb 2026).

4. Corpus Statistics and Evaluation Structure

Key EvoEmo and ES-MemEval statistics are summarized below:

Statistic	Value
Avg. time span per user (months)	14.9
Avg. sessions per user	22.3
Avg. turns per session	23.4
Avg. tokens per conversation	13,291.6
Total users	18
Total sessions	401
QA samples (ES-MemEval)	1,209
Summarization cases	125
Dialogue scenarios	34

Rather than traditional train/validation/test splits, the EvoEmo dataset forms the substrate for three held-out evaluation sets targeting zero-shot or few-shot evaluation: question answering, summarization, and dialogue generation.

5. Supported Tasks and Memory Benchmarks

EvoEmo underlies the ES-MemEval evaluation suite, which encompasses:

Question Answering: Multi-session context retrieval and reasoning for the five question types (IE, TR, CD, Abs, UM).
Summarization: Abstraction across session boundaries to produce temporally coherent and personality-aware overviews.
Dialogue Generation: Open-ended ES dialogues necessitating the integration of evolving user history for effective personalization.

Each task leverages EvoEmo’s suite of annotated profiles, event timelines, and state descriptors to systematically evaluate the five core memory-oriented competencies (Chen et al., 2 Feb 2026).

6. Applications, Limitations, and Prospects

EvoEmo supports several research directions:

Benchmarking architectures—especially the contrast between long-context LLMs and retrieval-augmented generation (RAG)—with respect to memory orientation in personalized ES.
Development and evaluation of memory modules that model state transitions and personality evolution across prolonged interactions.
Experimentation with conflict-aware and abstention-savvy agent behaviors to enhance conversational safety and reliability.

Identified Limitations

Synthetic Data: All dialogues are GPT-generated; potential linguistic or psychological artifacts inherent to real human interactions may be underrepresented.
Limited Scale and Diversity: Contains only 18 users and 401 sessions; demographic and topical breadth is circumscribed. Self-growth and treatment seeking are underrepresented topic areas.
Memory Granularity: Utilizes session-level memory units; finer granularities (turn- or summary-level) and alternative retrieval algorithms (e.g., BM25, DRAGON) remain to be explored.
Ethical Risk: While synthetic, the risk of hallucinated sensitive personal history exists. Adversarial/safety-stress scenarios and stringent human oversight protocols are recommended for future iterations.

A plausible implication is that expanding the scale, diversity, and integration with real (de-identified, consenting) ES logs will further enhance the dataset’s ecological validity and benchmarking utility.

7. Significance and Future Directions

EvoEmo, together with the ES-MemEval framework, constitutes a foundational resource for advancing the study and deployment of conversational AI in personalized, long-term emotional support. It enables systematic benchmarking of agents’ abilities to aggregate and reason over extended user trajectories characterized by fragmented, implicit, and dynamic personal history. The gap between synthetic and naturalistic dialogue, as well as the need for more balanced topic and demographic coverage, represents key frontiers for future dataset releases and research (Chen et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoEmo Dataset.