Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs (2306.03984v2)

Published 6 Jun 2023 in cs.CL and cs.LG

Abstract: Measurement of interaction quality is a critical task for the improvement of spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.

Authors (7)

Abishek Komma (1 paper)
Nagesh Panyam Chandrasekarasastry (1 paper)
Timothy Leffel (3 papers)
Anuj Goyal (14 papers)
Angeliki Metallinou (14 papers)
Spyros Matsoukas (23 papers)
Aram Galstyan (142 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs (2306.03984v2)

Summary

Related Papers