Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation (1706.09799v1)

Published 29 Jun 2017 in cs.CL

Abstract: Automated metrics such as BLEU are widely used in the machine translation literature. They have also been used recently in the dialogue community for evaluating dialogue response generation. However, previous work in dialogue response generation has shown that these metrics do not correlate strongly with human judgment in the non task-oriented dialogue setting. Task-oriented dialogue responses are expressed on narrower domains and exhibit lower diversity. It is thus reasonable to think that these automated metrics would correlate well with human judgment in the task-oriented setting where the generation task consists of translating dialogue acts into a sentence. We conduct an empirical study to confirm whether this is the case. Our findings indicate that these automated metrics have stronger correlation with human judgments in the task-oriented setting compared to what has been observed in the non task-oriented setting. We also observe that these metrics correlate even better for datasets which provide multiple ground truth reference sentences. In addition, we show that some of the currently available corpora for task-oriented language generation can be solved with simple models and advocate for more challenging datasets.

PDF Abstract

Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

The paper explores the correlation between automated evaluation metrics and human judgments in the context of task-oriented dialogue response generation. Utilizing automated metrics like BLEU and METEOR—a common practice in machine translation—has shown limitations in non task-oriented dialogue settings. This research seeks to empirically analyze if these metrics are more suitable for task-oriented dialogue, where dialogue responses are less diverse and resemble a constrained translation task.

Key Insights

The authors posit that task-oriented dialogue systems generate responses with limited variability due to their narrower domains, such as restaurant bookings. This contrasts with non task-oriented dialogue, where the responses often exhibit high diversity. Therefore, they investigate whether word-overlap metrics like BLEU and METEOR better capture human judgment when applied to task-oriented settings.

Their empirical paper utilizes two popular datasets: DSTC2 and Restaurants. They find that automated metrics correlate more strongly with human judgments in task-oriented contexts than in non task-oriented cases. Notably, metrics like METEOR, known for incorporating synonymy and paraphrasing, consistently show superior correlation with human evaluations compared to BLEU.

Experimental Setup

Several linguistic models, including LSTM-based and sc-LSTM variants, were employed to evaluate the efficacy of these metrics. These models used input dialogue acts and slot types to generate appropriate dialogue responses. Results demonstrate that simpler models, such as traditional LSTMs, can perform adequately in generating high-quality responses, indicating a limited challenge posed by these datasets. The paper highlights the necessity of employing more complex datasets that better challenge current machine learning approaches.

Quantitative Findings

The analysis shows significant numerical results: METEOR exhibits robust correlations with human ratings across both datasets. In contrast, BLEU scores, while traditionally favored in machine translation, show weaker or even negative correlations in certain cases due to the constraints of single-reference sentences.

Implications and Future Directions

The findings are pivotal in guiding the evaluation strategies of future NLG systems for task-oriented dialogues. While automated metrics hold value for quick assessment and iteration, their correlation with human judgment depends heavily on the dataset's characteristics and diversity. Thus, leveraging multiple reference sentences—when available—enhances evaluation reliability.

Looking forward, the dialogue community should gravitate towards datasets encapsulating greater complexity and diversity, such as the recent Frames and E2E NLG Challenge datasets. These resources promise to extend the scope and challenge of NLG models, requiring more sophisticated evaluation mechanisms beyond current automated metrics.

In sum, this paper demonstrates that while automated metrics offer computational efficiency, their applicability and reliability in evaluating task-oriented dialogue systems necessitate careful consideration of dataset attributes and diversity. The paper affirms the continued relevance of human evaluation while encouraging the exploration of more complex datasets to further the development of robust NLG models.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shikhar Sharma (15 papers)
Layla El Asri (13 papers)
Hannes Schulz (15 papers)
Jeremie Zumer (4 papers)

Citations (219)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Maluuba/nlg-eval: Evaluation code for various unsupervised automated metrics for Natural Language Generation. (1,386 stars)