Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Evaluation of Interactive Dialog with DialoGPT (2006.12719v1)

Published 23 Jun 2020 in cs.CL, cs.AI, and cs.HC

Abstract: It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.

Unsupervised Evaluation of Interactive Dialog with DialoGPT

The paper "Unsupervised Evaluation of Interactive Dialog with DialoGPT" introduces an innovative approach for dialog system evaluation with a focus on unsupervised fine-grained assessment. In response to the inadequacies of traditional metrics like BLEU and METEOR for dialog evaluation, this paper proposes the FED metric: Fine-grained Evaluation of Dialog. Built upon DialoGPT, the FED metric aligns dialog system evaluation closer to human-like performance by employing a model that inherently understands conversational cues without supervision or reference responses.

Key Contributions:

  1. FED Dataset: The researchers compiled the FED dataset by annotating conversations between humans, the Meena conversational agent, and Mitsuku with eighteen dialog qualities, both at turn and dialog levels. This dataset facilitates benchmarking against human judgment.
  2. Predictive Capability of DialoGPT: DialoGPT's massive pre-training on conversational data renders it capable of predicting dialog quality. The FED metric leverages this capability by measuring the likelihood of DialoGPT generating specific follow-up utterances that reflect various dialog qualities.
  3. Correlation Analysis: The FED metric achieves a moderate to strong correlation with human judgments, especially in dialog-level evaluations, signifying pre-trained models' potential in unsupervised dialog quality assessment.

Numerical Results and Observations:

  • The strongest correlation achieved by the FED metric is 0.209 at the turn level and 0.443 at the dialog level using the largest fine-tuned DialoGPT model.
  • Meena outperformed both Mitsuku and Human in turn-level qualities but was outstripped by Humans in dialog-level evaluations, highlighting the limitations of assessing systems based only on single-turn interactions.
  • Fine-grained qualities substantial to overall impression include 'interesting', 'relevant', and 'fluent' at the turn level; 'coherent', 'understanding', and 'likeable' at the dialog level.

Theoretical and Practical Implications:

The paper underscores the significance of Dialog systems' evaluation beyond traditional paradigms, stressing the need for metrics that reflect nuanced conversational qualities. By demonstrating that pre-trained models can discern dialog quality sans explicit supervision, the research sets a stage for developing metrics applicable across diverse dialog systems without domain restrictions.

Future Directions:

While the FED metric shows promise for evaluating open-domain chit-chat, its applicability to goal-oriented dialogs remains untested. Future exploration into model fine-tuning on targeted conversational domains or integrating more diverse data sources may strengthen its robustness across different dialog types. The FED approach also sparks possibility for advancing unsupervised metrics to incorporate new dialog qualities and expand into other realms of language processing.

Ultimately, the findings reinforce the potential of pre-trained models like DialoGPT in evolving dialog system evaluation strategies by harnessing their implicit understanding of conversation dynamics, paving the way for more insightful and effective assessments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shikib Mehri (28 papers)
  2. Maxine Eskenazi (35 papers)
Citations (166)