Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation (2308.16797v2)

Published 31 Aug 2023 in cs.CL

Abstract: Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting LLMs. Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. John Mendonça (9 papers)
  2. Patrícia Pereira (10 papers)
  3. Helena Moniz (10 papers)
  4. João Paulo Carvalho (8 papers)
  5. Alon Lavie (12 papers)
  6. Isabel Trancoso (26 papers)
Citations (16)