EmpatheticDialogues Dataset Overview

Updated 27 April 2026

EmpatheticDialogues is a large-scale corpus focused on training empathetic conversational agents, featuring multi-turn dialogues anchored in personal emotional situations.
The dataset employs a two-stage crowdsourcing method with narrative prompts and free-form empathetic responses, ensuring unbiased dialogue generation across 32 emotions.
It underpins benchmark evaluations in affective computing, using metrics like BLEU, BERTScore, and human empathy ratings to advance dialogue system research.

EmpatheticDialogues (ED) is a large-scale, open-domain corpus designed to benchmark and train conversational agents in the skill of empathetic response generation. Each ED conversation is anchored by a personal, first-person description of an emotion-laden situation, paired with free-form, multi-turn chat between two human participants. The dataset systematically covers 32 fine-grained emotions and provides both contextual richness and standardization, making it foundational for affective computing and dialogue system research (Rashkin et al., 2018, Raamkumar et al., 2022, Sotolar et al., 2024).

1. Dataset Construction and Annotation Protocol

Construction of EmpatheticDialogues followed a two-stage crowdsourcing methodology on Amazon Mechanical Turk (Rashkin et al., 2018):

Speaker prompt: Workers were assigned one of 32 discrete emotion labels (from resources like ISEAR, SemEval, DailyDialog), then composited a brief narrative recounting a real-world incident embodying that emotion (mean 19.8 words).
Dialogue elicitation: The narrative was forwarded to another worker (“listener”), tasked with responding empathetically through unscripted text chat, alternating roles across 4–8 turns (mean ≈4.31). Listeners at no point access the original emotion label, ensuring responses are grounded solely in context.

Annotation yielded a single categorical label per dialogue, corresponding to the eliciting emotion. No secondary annotation (multi-label, span annotation, or agreement/adjudication) was performed (Raamkumar et al., 2022). Workers were capped in their contributions to ensure emotional and participant diversity.

2. Corpus Statistics and Structure

EmpatheticDialogues covers a broad range of personal events—such as milestones, setbacks, relationships, and daily incidents—systematically sampled across emotion space. Key corpus statistics are:

Dataset Metric	Value
Dialogues	24,850
Avg. Turns/Dialog	4–8 (mean ~4.31, capped)
Total Utterances	~107,000
Emotion Classes	32 (nearly uniform)
Avg. Utterance Length	15.2 words
Avg. Situation Length	19.8 words
Data Splits (train/dev/test)	19,533 / 2,770 / 2,547

The JSON format encodes each dialogue as ordered pairs of “speaker” and “listener” turns; the initial turn is always the speaker’s paraphrase of the situation, followed by alternating roles (Rashkin et al., 2018, Raamkumar et al., 2022).

3. Emotion Taxonomy and Distribution

The 32 emotion categories originate from established affective taxonomies, mapped to include both basic (e.g., “joy,” “fear,” “anger”) and nuanced (e.g., “nostalgic,” “sentimental,” “hopeful,” “proud,” “apprehensive”) affective states. Dialogue collection was stratified to promote near-uniform representation across all classes; per-class deviations are reported within ±2% (Rashkin et al., 2018, Lin et al., 2022). Each dialogue inherits a single emotion label, with no provision for concurrent or graded affect annotation.

An abridged sample of emotion labels includes: “afraid,” “anxious,” “confident,” “disgusted,” “excited,” “grateful,” “jealous,” “joyful,” “proud,” “relieved,” “sad,” “surprised,” “terrified,” “trusting.”

4. Methodological Role in Empathetic Response Generation

EmpatheticDialogues is the de facto benchmark for evaluating affect-sensitive dialogue models (Raamkumar et al., 2022, Lin et al., 2022, Sotolar et al., 2024). Typical machine learning scenarios include:

Response Generation: Given a dialogue context, predict the next “listener” reply. Architectures are either retrieval-based (ranking candidate responses by encoder similarity) or generative (transformer-based, producing text token-by-token). The canonical training objective is negative log-likelihood:

$\mathcal{L} = -\sum_t \log P(y_t | y_{<t}, X)$

where $X$ is the dialogue context.

Emotion Classification: Identify the correct emotion label from a dialogue snippet. Baselines use BERT/RoBERTa or multilabel classifiers, evaluated by simple accuracy.
Preference Optimization: Recent work constructs positive/negative preference pairs using ED’s emotion labels as ground-truth for reward-based model alignment (Direct Preference Optimization), leveraging “polar-opposite” label mappings derived from Plutchik’s theory (Sotolar et al., 2024).

For training, common practice is to tokenize and concatenate the dialogue history, optionally mapping “speaker”/“listener” to special tokens. In preference optimization settings, turns are masked such that the model only updates on “listener” turns, using matched/unmatched responses by emotion polarity as positive/negative targets.

5. Evaluation Protocols

Evaluation of models trained on EmpatheticDialogues employs both automatic and human-centered metrics (Rashkin et al., 2018, Raamkumar et al., 2022, Lin et al., 2022, Sotolar et al., 2024):

Automatic metrics: Perplexity (generative), BLEU-n, Distinct-n (diversity proxy), Precision@1 (retrieval); BERTScore for semantic congruence. For empathy alignment, diff-EPITOME quantifies the absolute mean difference between gold-standard human empathy and model responses across “Empathetic Response,” “Explanation,” and “Interpretation” axes:

$\text{diff-k} = \frac{1}{N} \sum_{i=1}^N | \text{score}_k(\text{model}_i) - \text{score}_k(\text{gold}_i) |$

Human evaluation: Annotators rate fluency, empathy (showing understanding of affect), and relevance on a five-point Likert scale. Empathy ratings for fine-tuned transformer models reach 3.7–4.0, substantially greater than non-empathetic baselines (~2.3–2.8).

6. Impact, Variants, and Research Directions

EmpatheticDialogues is broadly credited as the foundational resource catalyzing empathetic conversational agent research (Raamkumar et al., 2022). Successors and related corpora include multilingual variants (e.g., Japanese EmpatheticDialogues with eight Plutchik-derived top-level emotions (Pang et al., 2024)), and more complex annotation schemas (e.g., emotion intensity, cause marking), though these remain less widespread.

Recommendations for future extensions cite the incorporation of multimodal signals (audio, video), moving beyond single-label taxonomy to graded or multi-label affect, modeling dialogue acts and empathy mechanisms (consolation, mirroring), and annotating emotion-causing text spans. The constraint to fixed [4,8] turn lengths and single-modality text remains a limitation for modeling real-world affect-rich dialogue.

EmpatheticDialogues continues to underpin state-of-the-art empathetic system innovations, including content-emotion disentanglement (Lin et al., 2022), preference-based LLM alignment (Sotolar et al., 2024), knowledge-infused and emotion-cause-aware models, and codesigning of new metrics for computational empathy.

7. Accessibility and Ethical Considerations

The data and splits are publicly available via the original authors’ GitHub and the ParlAI framework. Licensing details are included with the repository; crowdworker participation was capped to ensure topic and demographic diversity, though no participant-level demographics or inter-annotator agreement figures are published (Rashkin et al., 2018, Raamkumar et al., 2022). The dataset’s design—requiring workers to recount genuine emotional episodes and then respond to anonymized, contextually rich scenarios—has proven effective for naturalistic, high-variance affective response modeling.

EmpatheticDialogues represents the canonical resource for benchmarking advances in machine empathy, driving both architectural and evaluative innovation in affective open-domain dialogue systems (Rashkin et al., 2018, Raamkumar et al., 2022, Lin et al., 2022, Sotolar et al., 2024).