DailyDialog++: Multi-Reference Evaluation Dataset
- DailyDialog++ is a multi-reference, adversarial-augmented dataset that provides multiple valid human responses per context along with semantically improper negatives.
- The methodology augments the original DailyDialog corpus using crowd-sourced responses and adversarial negative generation to probe metric–human correlations in both quality and diversity.
- The dataset enables robust benchmarking of automatic and model-based evaluation metrics, supporting the development of more reliable dialogue systems.
DailyDialog++ is a multi-reference, adversarial-augmented evaluation dataset and resource suite for open-domain dialog systems, designed to address the limitations of traditional single-reference automatic evaluation. Originating as an augmentation of the DailyDialog corpus, DailyDialog++ supports robust benchmarking of both automatic and model-based dialog evaluation metrics by providing (i) multiple valid human responses per context and (ii) sets of adversarial, word-overlap-rich yet semantically improper negatives. This structure enables detailed analyses of metric–human correlation for both quality and diversity, and facilitates the development and validation of sophisticated dialog evaluation models, including large-scale pretrained approaches.
1. Dataset Construction and Composition
DailyDialog++ builds on the original DailyDialog corpus of ~13k two-speaker, multi-turn dialogues. The dataset construction comprises two primary axes: multi-reference positive response collection and adversarial negative response generation.
- Multi-reference positives: For the open-domain test set, each context (a dialog up to the next response) is supplemented with up to four additional crowd-sourced valid continuations, in addition to the original reference, yielding five references per context. Responses are authored via clearly scoped crowd-instructions (distinct, non-trivial, 8–10+ words, no generic replies), filtered for appropriateness and deduplicated. In (Gupta et al., 2019), the original test set comprised 1,000 dialogues, leading to 6,740 context–reference pairs, all augmented to five references (26,960 new plus 6,740 original responses).
- Adversarial irrelevant responses: A subset of contexts (11,429 out of 19,071 in (Sai et al., 2020)) is assigned up to five adversarially constructed negatives per context. Each adversarial response is generated to maximize n-gram overlap with the context (possibly leveraging synonyms, antonyms, homonyms, or topical co-occurrences), while ensuring semantic incoherence as judged by independent validators and tertiary annotators. Adversarial responses average 13.8 words, undergo rigorous multi-stage human validation (syntactic, topical, and semantic criteria), and reflect high similarity in surface form but not in meaning.
- Quality control: Both responses and adversarial negatives undergo multi-level validation for length, appropriateness, and diversity. Inter-annotator agreement is monitored with Cohen’s κ, typically enforcing κ ≥ 0.2 and achieving mean κ ≈ 0.4.
Summary Table: Main Statistics
| Attribute | Value (Gupta et al., 2019) | Value (Sai et al., 2020) |
|---|---|---|
| # Contexts (test set/all) | 6,740 / 1,000 dialogues | 19,071 |
| # Human references/context | 5 | 5 |
| # Adversarial negatives/context | – | 5 (on 11,429) |
| Avg. utterance/context length (words) | 13.55 | 13.55 |
2. Automatic Evaluation Metrics and Scoring Formulations
DailyDialog++ supports both traditional and model-based automatic evaluation metrics, enhanced by the multi-reference setting.
- Single- and multi-reference quality scoring: For a hypothesis and references , with scoring function :
- Single-reference: .
- Multi-reference (quality): , rewarding closeness to the best-matched reference.
- Multi-reference diversity (referenced recall): For outputs ,
incentivizing systems to generate outputs covering all human references.
- Metrics catalogued:
- N-gram overlap: BLEU-N (with brevity penalty), METEOR, ROUGE-L (LCS-based).
- Embedding-based: average embedding cosine similarity, Vector Extrema, Greedy Matching, off-the-shelf sentence encoders (e.g. Skip-Thought, GenSen), BERTScore.
- Trained model-based metrics: ADEM, RUBER, BERT-based regressors, BERT + DNN.
3. Correlation Analyses: Human Versus Automatic Metrics
Systematic correlation analyses have been performed at both utterance and system granularity, for both quality and diversity.
- Utterance-level quality: Annotators rate single model outputs (per context) on a 1–5 appropriateness scale (filtered for κ ≥ 0.2, mean κ ≈ 0.43), yielding mean human scores for comparison. All metrics (e.g., BLEU-2, METEOR, ROUGE-L, VectorExtrema) are evaluated in single- and multi-reference modes.
- System-level quality: For each model, average the human appropriateness scores and the automatic metric (across the test set), and compute Pearson/Spearman correlation across 5 models.
- Diversity correlation: Annotators mark outputs as “appropriate” and then enumerate semantic diversity among marked responses. Unreferenced metrics (Distinct-n, Self-BLEU) and referenced recall metrics are compared in terms of their alignment with human diversity judgements.
Key Correlation Results Snapshot ((Gupta et al., 2019), utterance level):
| Metric | Spearman ρ, 1-ref | Pearson r, 1-ref | Spearman ρ, 5-ref | Pearson r, 5-ref |
|---|---|---|---|---|
| BLEU-2 | 0.025 | 0.180 | 0.208 | 0.291 |
| METEOR | 0.106 | 0.187 | 0.225 | 0.286 |
| ROUGE-L | 0.072 | 0.141 | 0.220 | 0.280 |
| VectorExtrema | 0.192 | 0.211 | 0.279 | 0.295 |
- Multi-reference scoring consistently doubles/triples metric–human correlation compared to single-reference.
- System-level correlations can reach r ≈ 0.87 (BLEU-2) and r ≈ 0.90 (METEOR) in multi-reference mode, while being near zero in single-reference.
4. Model-Based Metrics, Adversarial Negatives, and Robustness
DailyDialog++ enables rigorous evaluation of model-based metrics (ADEM, RUBER, BERT-based, DEB) under both random and adversarial negative conditions.
- Adversarial negatives: Adversarial examples maximize n-gram overlap while remaining incoherent, challenging metrics relying primarily on surface form similarity. Construction is formalized as maximizing
subject to human labelling as “irrelevant”.
- Accuracy and point-biserial correlation (ρ):
| Metric | Acc. (Random Neg) | Acc. (Adv. Neg) | ρ (Random) | ρ (Adv) | |--------------------|------------------|-----------------|------------|-----------| | n-gram/Embedding | ≈70% | ≈55% | ≈0.4 | ≈0.15 | | ADEM | 64.7% | 53.3% | 0.40 | 0.12 | | RUBER-Large | 82.4% | 68.9% | 0.69 | 0.27 | | DEB | 88.3% | 82.0% | 0.79 | 0.33 |
- Training with adversarial negatives: Inclusion of adversarial examples in training significantly raises metric robustness to such inputs (e.g., DEB: 66.8% → 70.9% adversarial accuracy with 1% negatives added, up to 92.7% with full adversarial inclusion).
5. Empirical Insights and Practical Recommendations
- Quality: Multi-reference evaluation substantially increases both utterance- and system-level human–metric correlation across all tested metrics. Max-over-references scoring should be adopted for all existing automatic metrics when multiple references are available.
- Diversity: Referenced-recall metrics provide moderate correlation (ρ ≈ 0.21) with human-perceived diversity, contrasting with unreferenced metrics (Distinct-n, Self-BLEU), which can correlate negatively.
- Robustness limits: Even large-scale pretrained evaluators (e.g., DEB, based on 727M Reddit utterance pairs) demonstrate marked performance drops on adversarial negatives despite state-of-the-art accuracy and correlation on random negatives.
- Reference count: Gains in correlation plateau with ~6 references per context, with the largest improvement observed when increasing from 1 to 4 references.
- Evaluation protocol: Employ a combination of 4–6 references per context, max-over-references quality evaluation, recall-based diversity metrics, and systematic human validation calibrated via inter-annotator agreement ().
6. Resources, Availability, and Applications
DailyDialog++ is publicly released and maintained by the IIT-M NLP group. Dataset resources and code are available at https://iitmnlp.github.io/DailyDialog-plusplus/ and https://github.com/iitmnlp/Dialogue-Evaluation-with-BERT (Sai et al., 2020).
The dataset is widely adopted for evaluating open-domain dialog systems, especially for probing metric robustness, evaluating model-based metrics (including those leveraging adversarial training), and benchmarking large-scale pretrained evaluators. Its design directly supports research into the reliable quantification of both response quality and semantic diversity, and provides a basis for adversarial robustness studies in dialogue evaluation (Gupta et al., 2019, Sai et al., 2020).