Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation
The paper explores the correlation between automated evaluation metrics and human judgments in the context of task-oriented dialogue response generation. Utilizing automated metrics like BLEU and METEOR—a common practice in machine translation—has shown limitations in non task-oriented dialogue settings. This research seeks to empirically analyze if these metrics are more suitable for task-oriented dialogue, where dialogue responses are less diverse and resemble a constrained translation task.
Key Insights
The authors posit that task-oriented dialogue systems generate responses with limited variability due to their narrower domains, such as restaurant bookings. This contrasts with non task-oriented dialogue, where the responses often exhibit high diversity. Therefore, they investigate whether word-overlap metrics like BLEU and METEOR better capture human judgment when applied to task-oriented settings.
Their empirical paper utilizes two popular datasets: DSTC2 and Restaurants. They find that automated metrics correlate more strongly with human judgments in task-oriented contexts than in non task-oriented cases. Notably, metrics like METEOR, known for incorporating synonymy and paraphrasing, consistently show superior correlation with human evaluations compared to BLEU.
Experimental Setup
Several linguistic models, including LSTM-based and sc-LSTM variants, were employed to evaluate the efficacy of these metrics. These models used input dialogue acts and slot types to generate appropriate dialogue responses. Results demonstrate that simpler models, such as traditional LSTMs, can perform adequately in generating high-quality responses, indicating a limited challenge posed by these datasets. The paper highlights the necessity of employing more complex datasets that better challenge current machine learning approaches.
Quantitative Findings
The analysis shows significant numerical results: METEOR exhibits robust correlations with human ratings across both datasets. In contrast, BLEU scores, while traditionally favored in machine translation, show weaker or even negative correlations in certain cases due to the constraints of single-reference sentences.
Implications and Future Directions
The findings are pivotal in guiding the evaluation strategies of future NLG systems for task-oriented dialogues. While automated metrics hold value for quick assessment and iteration, their correlation with human judgment depends heavily on the dataset's characteristics and diversity. Thus, leveraging multiple reference sentences—when available—enhances evaluation reliability.
Looking forward, the dialogue community should gravitate towards datasets encapsulating greater complexity and diversity, such as the recent Frames and E2E NLG Challenge datasets. These resources promise to extend the scope and challenge of NLG models, requiring more sophisticated evaluation mechanisms beyond current automated metrics.
In sum, this paper demonstrates that while automated metrics offer computational efficiency, their applicability and reliability in evaluating task-oriented dialogue systems necessitate careful consideration of dataset attributes and diversity. The paper affirms the continued relevance of human evaluation while encouraging the exploration of more complex datasets to further the development of robust NLG models.