Emergent Mind

Abstract

Using LLMs to evaluate text quality has recently gained popularity. Some prior works explore the idea of using LLMs for evaluation, while they differ in some details of the evaluation process. In this paper, we analyze LLM evaluation (Chiang and Lee, 2023) and G-Eval (Liu et al., 2023), and we discuss how those details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal. Last, we reveal that asking the LLM to explain its own ratings consistently improves the correlation between the ChatGPT and human ratings and pushes state-of-the-art (SoTA) correlations on two meta-evaluation datasets.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Falcon-40B: an open large language model with state-of-the-art performance
  2. A General Language Assistant as a Laboratory for Alignment
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback
  4. Results of the WMT17 metrics shared task. In Proceedings of the Second Conference on Machine Translation, pages 489–513, Copenhagen, Denmark. Association for Computational Linguistics.
  5. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  6. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  7. Topical-chat: Towards knowledge-grounded open-domain conversations
  8. Yvette Graham and Timothy Baldwin. 2014. Testing for significance of increased correlation with human judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 172–176, Doha, Qatar. Association for Computational Linguistics.
  9. Accurate evaluation of segment-level machine translation metrics. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1183–1191, Denver, Colorado. Association for Computational Linguistics.
  10. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  11. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
  12. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  13. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  14. Matouš Macháček and Ondřej Bojar. 2014. Results of the WMT14 metrics shared task. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 293–301, Baltimore, Maryland, USA. Association for Computational Linguistics.
  15. Shikib Mehri and Maxine Eskenazi. 2020. Usr: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707.
  16. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Accessed on January 10
  17. OpenAI. 2023. Gpt-4 technical report.
  18. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  19. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  20. Is ChatGPT a Good NLG Evaluator? A Preliminary Study
  21. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  22. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  23. Socratic models: Composing zero-shot multimodal reasoning with language. In The Eleventh International Conference on Learning Representations.
  24. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  25. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Show All 25