Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Closer Look into Automatic Evaluation Using Large Language Models (2310.05657v1)

Published 9 Oct 2023 in cs.CL

Abstract: Using LLMs to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  4. Results of the WMT17 metrics shared task. In Proceedings of the Second Conference on Machine Translation, pages 489–513, Copenhagen, Denmark. Association for Computational Linguistics.
  5. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  6. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  7. Topical-chat: Towards knowledge-grounded open-domain conversations.
  8. Yvette Graham and Timothy Baldwin. 2014. Testing for significance of increased correlation with human judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 172–176, Doha, Qatar. Association for Computational Linguistics.
  9. Accurate evaluation of segment-level machine translation metrics. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1183–1191, Denver, Colorado. Association for Computational Linguistics.
  10. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  11. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736.
  12. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  13. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  14. Matouš Macháček and Ondřej Bojar. 2014. Results of the WMT14 metrics shared task. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 293–301, Baltimore, Maryland, USA. Association for Computational Linguistics.
  15. Shikib Mehri and Maxine Eskenazi. 2020. Usr: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707.
  16. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Accessed on January 10, 2023.
  17. OpenAI. 2023. Gpt-4 technical report.
  18. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  19. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  20. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  21. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  22. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  23. Socratic models: Composing zero-shot multimodal reasoning with language. In The Eleventh International Conference on Learning Representations.
  24. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  25. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Cheng-Han Chiang (18 papers)
  2. Hung-yi Lee (325 papers)
Citations (10)