Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings (1904.10635v1)

Published 24 Apr 2019 in cs.CL

Abstract: Despite advances in open-domain dialogue systems, automatic evaluation of such systems is still a challenging problem. Traditional reference-based metrics such as BLEU are ineffective because there could be many valid responses for a given context that share no common words with reference responses. A recent work proposed Referenced metric and Unreferenced metric Blended Evaluation Routine (RUBER) to combine a learning-based metric, which predicts relatedness between a generated response and a given query, with reference-based metric; it showed high correlation with human judgments. In this paper, we explore using contextualized word embeddings to compute more accurate relatedness scores, thus better evaluation metrics. Experiments show that our evaluation metrics outperform RUBER, which is trained on static embeddings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sarik Ghazarian (13 papers)
  2. Johnny Tian-Zheng Wei (9 papers)
  3. Aram Galstyan (142 papers)
  4. Nanyun Peng (205 papers)
Citations (89)