Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators (2312.15407v2)

Published 24 Dec 2023 in cs.CL

Abstract: Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique, reference-free neural metrics that better align with human evaluations. Notably among them, LLMs, particularly the instruction-tuned variants like ChatGPT, are shown to be promising substitutes for human judges. Yet, existing works on utilizing LLMs for automatic dialogue evaluation are limited in their scope in terms of the number of meta-evaluation datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains inconclusive how effective these LLMs are. To this end, we conduct a comprehensive study on the application of LLMs for automatic dialogue evaluation. Specifically, we analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels, using a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels. Finally, we explore how model-level and dimension-level ensembles impact the evaluation performance. All resources are available at https://github.com/e0397123/comp-analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
  3. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv preprint arXiv: Arxiv-2304.01373.
  4. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 1877–1901. Curran Associates, Inc.
  5. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study. arXiv preprint arXiv: Arxiv-2304.00723.
  6. Phoenix: Democratizing ChatGPT across Languages. arXiv preprint arXiv: Arxiv-2304.10453.
  7. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  8. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311.
  9. Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv: Arxiv-2210.11416.
  10. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM.
  11. GPTScore: Evaluate as You Desire. arXiv:2302.04166.
  12. OpenLLaMA: An Open Reproduction of LLaMA.
  13. What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation. In Findings of the Association for Computational Linguistics: ACL 2022, 4194–4204. Dublin, Ireland: Association for Computational Linguistics.
  14. DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 771–785. Dublin, Ireland: Association for Computational Linguistics.
  15. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. arXiv preprint arXiv: 2303.15056.
  16. InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 505–525. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  17. Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation. In The 13th International Workshop on Spoken Dialogue Systems Technology.
  18. Achieving Reliable Human Assessment of Open-Domain Dialogue Systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6416–6437. Dublin, Ireland: Association for Computational Linguistics.
  19. The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 1265–1285. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  20. Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5871–5883. Seattle, United States: Association for Computational Linguistics.
  21. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 986–995. Taipei, Taiwan: Asian Federation of Natural Language Processing.
  22. LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), 47–58. Toronto, Canada: Association for Computational Linguistics.
  23. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2122–2132. Austin, Texas: Association for Computational Linguistics.
  24. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv preprint arXiv: Arxiv-2303.16634.
  25. MIME: MIMicking Emotions for Empathetic Response Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8968–8979. Online: Association for Computational Linguistics.
  26. Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges.
  27. Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 225–235. 1st virtual meeting: Association for Computational Linguistics.
  28. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 681–707. Online: Association for Computational Linguistics.
  29. Interactive Evaluation of Dialog Track at DSTC9. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 5731–5738. Marseille, France: European Language Resources Association.
  30. I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1699–1713. Online: Association for Computational Linguistics.
  31. Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length. Salesforce AI Research Blog.
  32. Training language models to follow instructions with human feedback. In Advances in neural information processing systems.
  33. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
  34. Recipes for Building an Open-Domain Chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 300–325. Online: Association for Computational Linguistics.
  35. Perturbation CheckLists for Evaluating NLG Evaluation Metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7219–7234. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  36. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv:2211.05100.
  37. What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 1702–1723. Minneapolis, Minnesota: Association for Computational Linguistics.
  38. Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents. In Proceedings of the 4th Workshop on NLP for Conversational AI, 77–97. Dublin, Ireland: Association for Computational Linguistics.
  39. iEval: Interactive Evaluation Framework for Open-Domain Empathetic Chatbots. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, 419–431. Edinburgh, UK: Association for Computational Linguistics.
  40. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca.
  41. LLaMA: Open and Efficient Foundation Language Models. ARXIV.
  42. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv: 2307.09288.
  43. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. arXiv preprint arXiv: 2306.04751.
  44. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
  45. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
  46. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. arXiv preprint arXiv: 2304.01196.
  47. WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv preprint arXiv: Arxiv-2304.12244.
  48. A Comprehensive Assessment of Dialog Evaluation Metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, 15–33. Online: Association for Computational Linguistics.
  49. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations.
  50. DynaEval: Unifying Turn and Dialogue Level Evaluation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 5676–5689. Online: Association for Computational Linguistics.
  51. FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3336–3355. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  52. MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10): 11657–11666.
  53. Personalizing Dialogue Agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2204–2213. Melbourne, Australia: Association for Computational Linguistics.
  54. OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv: Arxiv-2205.01068.
  55. Designing Precise and Robust Dialogue Response Evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 26–33. Online: Association for Computational Linguistics.
  56. Robust Machine Reading Comprehension by Learning Soft labels. In Proceedings of the 28th International Conference on Computational Linguistics, 2754–2759. Barcelona, Spain (Online): International Committee on Computational Linguistics.
  57. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chen Zhang (403 papers)
  2. Luis Fernando D'Haro (20 papers)
  3. Yiming Chen (106 papers)
  4. Malu Zhang (43 papers)
  5. Haizhou Li (285 papers)
Citations (18)