Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs (2306.03984v2)

Published 6 Jun 2023 in cs.CL and cs.LG

Abstract: Measurement of interaction quality is a critical task for the improvement of spoken dialog systems. Existing approaches to dialog quality estimation either focus on evaluating the quality of individual turns, or collect dialog-level quality measurements from end users immediately following an interaction. In contrast to these approaches, we introduce a new dialog-level annotation workflow called Dialog Quality Annotation (DQA). DQA expert annotators evaluate the quality of dialogs as a whole, and also label dialogs for attributes such as goal completion and user sentiment. In this contribution, we show that: (i) while dialog quality cannot be completely decomposed into dialog-level attributes, there is a strong relationship between some objective dialog attributes and judgments of dialog quality; (ii) for the task of dialog-level quality estimation, a supervised model trained on dialog-level annotations outperforms methods based purely on aggregating turn-level features; and (iii) the proposed evaluation model shows better domain generalization ability compared to the baselines. On the basis of these results, we argue that having high-quality human-annotated data is an important component of evaluating interaction quality for large industrial-scale voice assistant platforms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Abishek Komma (1 paper)
  2. Nagesh Panyam Chandrasekarasastry (1 paper)
  3. Timothy Leffel (3 papers)
  4. Anuj Goyal (14 papers)
  5. Angeliki Metallinou (14 papers)
  6. Spyros Matsoukas (23 papers)
  7. Aram Galstyan (142 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.