Papers
Topics
Authors
Recent
2000 character limit reached

DailyDialog Dataset Overview

Updated 4 December 2025
  • DailyDialog is a large-scale dialogue dataset composed of human-written, two-speaker interactions focused on daily-life topics and open-domain conversation modeling.
  • It features meticulous manual annotations for dialog acts and emotion categories, providing rich insights into conversational dynamics and linguistic quality.
  • DailyDialog++ extends the corpus with multi-reference responses and adversarial negatives, enhancing robust evaluations for dialog systems and metrics.

DailyDialog is a large-scale, high-quality multi-turn dialogue dataset constructed from human-written conversations between two speakers. Designed primarily for open-domain dialogue modeling, emotion recognition in conversations (ERC), and dialog act classification, it is characterized by careful manual annotation, topical diversity, and its adoption as a standard benchmark for open-domain dialog system research (Li et al., 2017).

1. Corpus Construction and Statistical Properties

DailyDialog was crawled from English-learning websites to simulate naturalistic, daily two-speaker interactions. The canonical release comprises 13,118 dialogues, each with an average of 7.9 turns and approximately 14.6 tokens per utterance, resulting in a total of 102,979 utterances (Li et al., 2017, Pereira et al., 2023). The conversations are formal and less noisy relative to movie subtitles or online forums, reflecting language commonly used in daily life. After de-duplication, spell correction, and exclusion of multi-party exchanges (≥3 speakers), the dataset was filtered to ensure high linguistic quality.

Vocabulary statistics indicate 25,000 word types in the capped vocabulary for modeled experiments, though the raw corpus exceeds 45,000 distinct surface forms (Li et al., 2017).

The topic distribution is dominated by daily-life subjects. The ten major topics and their proportions are:

Topic Percentage of Dialogues
Relationship 33.33 %
Ordinary Life 28.26 %
Work 14.49 %
Tourism 4.65 %
School 4.11 %
Health 3.22 %
Communication 3.17 %
Finance 2.12 %
Entertainment 1.64 %
Politics 1.01 %

The corpus is split into training, validation (development), and test sets with 11,118, 1,000, and 1,000 dialogues, respectively (“Yanran splits”) (Pereira et al., 2023). This approximately yields 87,000/7,000/8,000 utterances per split.

2. Annotation Schema: Dialog Acts and Emotion Labels

DailyDialog provides manual utterance-level annotation along two axes: dialog acts (communicative intention) and emotion categories (Li et al., 2017).

Dialog Act Annotation

Based on conventions from Amanova et al. (2016) and ISO standards, every utterance is labeled as one of:

  • Inform (45.2%)
  • Questions (28.6%)
  • Directives (16.8%)
  • Commissive (9.4%)

Bi-turn and multi-turn flow patterns (e.g., Questions→Inform, Proposal→Commitment) are explicitly represented. Approximately 18.3% of the dialogues contain an “Answer + New Question” turn, and 9.2% feature “Proposal → Counter-proposal → Commitment”.

Emotion Category Annotation

Each utterance receives a single label drawn from Ekman’s basic six emotions plus additional categories (Pereira et al., 2023, Li et al., 2017). The official set includes:

  • Anger (Ang)
  • Disgust (Disg)
  • Fear (Fear)
  • Happiness (Hap)
  • Sadness (Sad)
  • Surprise (Sur)
  • Neutral/Other (Neu/Other)

Emotion label distribution (over all 102,979 utterances):

Emotion % of Utterances
Anger 1.0 %
Disgust 0.3 %
Fear 0.2 %
Happiness 12.5 %
Sadness 1.1 %
Surprise 1.8 %
Neutral 83.1 %

Manual annotation was employed, but published sources do not provide inter-annotator agreement statistics or details of the specific annotation protocol (Pereira et al., 2023).

3. Data Splitting, Overlap, and Cleaning

Initial public releases randomly distributed dialogues into the canonical train/dev/test splits, with context–response extraction applied for both single-turn and multi-turn modeling setups (e.g., 1-turn or 3-turn context → next utterance) (Wen et al., 2022). This produced ~76K training, ~7K validation, and ~7K test instances for both single- and multi-turn versions.

Wen et al. (Wen et al., 2022) identified pervasive overlap between splits: 23.15% of test examples are identical to training samples (overlap ratio R=1.0R=1.0), and 37% have R0.8R \geq 0.8. This overlap, stemming from the original source material, leads to inflated performance metrics due to memorization.

A deduplication pipeline was introduced:

  • Unit-level deduplication: Remove any dialogue with R0.80R\geq0.80 overlap with another unit.
  • Instance-level deduplication: Remove all duplicate (context, response) pairs.

The cleaned DailyDialog contains ~60K training instances (single-turn: 60,005; multi-turn: 60,138), a ~21% reduction relative to the original. Test and validation sets retain comparable scale.

4. Derived Resources and Extensions: DailyDialog++

DailyDialog++ augments the base dataset with multi-reference and adversarially constructed negative samples, primarily for robust evaluation of dialog metrics (Sai et al., 2020). Using the original dialogues as seed, 19,071 “contexts” were defined. For each, five human-authored fluent, relevant responses were collected, and for 11,429 contexts, five adversarially irrelevant but lexically overlapping negatives were generated and verified.

Corpus statistics for DailyDialog++:

  • Average turns/context: 3.31
  • Average words/context: 45.32
  • Average words/relevant response: 10.13
  • Average words/adversarial response: 13.80

The dataset supports systematic evaluation of metrics such as BLEU, METEOR, ROUGE, BERTScore, ADEM, and model-based frameworks (e.g., RUBER, DEB). Adversarial negatives expose the brittleness of existing evaluation schemes: even state-of-the-art learned metrics such as DEB (Dialog Evaluation using BERT) drop in accuracy from 88.3% (random negatives) to 66.8% (adversarial), while untrained metrics perform barely above chance (Sai et al., 2020).

Code and data are publicly available for reproducible multi-reference, adversarial evaluation protocols.

5. System Benchmarking and Evaluation Metrics

DailyDialog is widely used as a benchmark for retrieval-based and generation-based conversational models (Li et al., 2017). Common evaluation regimes include:

  • Perplexity (PPL): Standard formulation for LLMs.
  • BLEU-n: Measures n-gram overlap; computed via BLEU=exp(n=1Nwnlogpn)BLEU = \exp \left(\sum_{n=1}^N w_n \log p_n\right) with brevity penalty, typically with uniform weights.
  • Equivalence rate: Percentage of retrieved responses that match the reference’s dialog act or emotion label.

Baseline results:

Model BLEU-4
Embedding-based 0.150
Feature-based 0.194
+ I-E-Rerank 0.164
Seq2Seq (vanilla) 0.006
Attn-Seq2Seq 0.006
HRED 0.009
L+Attn-Seq2Seq 0.009

Label conditioning (on dialog act/emotion) improves BLEU scores; naïve pretraining on subtitles reduces perplexity but not necessarily dialogic relevance.

Equivalence rates (feature-based): 46.3% (Intent); 73.7% (Emotion).

In DailyDialog++, evaluation includes model-based metrics such as ADEM, RUBER, and DEB, with performance systematically degraded by adversarial negatives (Sai et al., 2020).

6. Preprocessing, Protocols, and Best Practices

For modeling, input sequences are tokenized and upper-cased via RoBERTa’s HuggingFace tokenizer (Pereira et al., 2023). For context-aware tasks, the immediate preceding turn is concatenated with each target utterance; sequences that exceed model length limits are truncated per transformer conventions.

Best practices, especially after identification of overlap-induced contamination (Wen et al., 2022), include:

  1. Use deduplicated, non-overlapping splits (unit-level threshold T=0.80T=0.80).
  2. Preserve original dialogue integrity within splits.
  3. Report overlap statistics for published splits.
  4. Supplement BLEU and Distinct metrics with human evaluations, especially when adversarial negatives are involved.
  5. Release code and cleaned data splits for replicability.

7. Limitations and Future Directions

DailyDialog’s main limitations stem from severe label and topic imbalance: 83.1% of utterances are neutral/other, only 12.5% labeled as “Happiness,” and under 3% for other emotions; over 60% of dialogues pertain to relationships or ordinary life (Li et al., 2017, Pereira et al., 2023). As a corpus intended for English learners, language shows formality and lacks spontaneous disfluencies.

Proposed directions include collecting additional data for underrepresented emotions and topics, extending dialogue length, incorporating more spontaneous speech, and annotating for richer meta-pragmatic dimensions (e.g., politeness, sarcasm). Cross-lingual and multi-party extension are also identified as promising avenues (Li et al., 2017).

Adherence to rigorous deduplication and transparent evaluation protocols is essential to maintain DailyDialog’s value as a benchmark in conversational modeling. The release of DailyDialog++, with its multi-reference and adversarial framework, further strengthens the landscape for robust and reliable evaluation (Sai et al., 2020).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DailyDialog Dataset.