Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges (2401.07103v2)

Published 13 Jan 2024 in cs.CL
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Abstract: In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing LLMs has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This paper aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods. Our detailed exploration includes critically assessing various LLM-based methodologies, as well as comparing their strengths and limitations in evaluating NLG outputs. By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.

Introduction to LLM-based NLG Evaluation

Natural Language Generation (NLG) is a critical aspect of modern AI-driven communication, with its applications sprawling across various fields such as machine translation and content creation. With the advancement of LLMs, our ability to generate text has seen a leap in quality. This, in turn, necessitates robust evaluation methods that can accurately assess the quality of the generated content. Traditional NLG evaluation metrics often fail to fully capture semantic coherence or fail to align with human judgments. In contrast, the emergent capacities of LLMs offer promising new methods for NLG evaluation through improved interpretability and alignment with human preferences.

A Structured Framework for Evaluation

This paper presents a detailed overview of utilizing LLMs for NLG evaluation and establishes a formalized taxonomy to categorize various LLM-based evaluation metrics. By identifying the core dimensions of evaluation tasks, references, and functions, a structured perspective emerges that enhances our understanding of different approaches. Moreover, the paper investigates the role of LLMs in NLG evaluation, acknowledging their potential in evaluating tasks. It explores the novel applications of LLMs in generating evaluation metrics directly, considering continuous scoring, likelihood estimations, and comparative pairwise analyses. The taxonomy presented herein provides clarity on the landscape of LLM-based evaluators, delineating between generative-based methods and matching-based approaches.

Advancement and Meta-Evaluation

Emphasizing the ability to measure alignment with human judgment, the survey reviews meta-evaluation benchmarks across diverse NLG tasks, including machine translation, text summarization, and more. These benchmarks offer important platforms for testing evaluator efficacy by incorporating human annotations and by assessing agreement with human preferences. The paper recognizes the evolution of LLMs in general generation tasks and outlines the development of multi-scenario benchmarks that contribute to a richer understanding of evaluator performances.

The Road Ahead for NLG Evaluation

Despite the progress, several challenges linger in the domain of LLM-based NLG evaluation, such as biases inherent in LLM evaluators, their robustness against adversarial inputs, the need for domain-specific evaluation, and the quest for unified evaluation across a variety of complex tasks. Addressing these challenges is crucial for advancing the field and developing more reliable and effective evaluators. The paper concludes by advocating for future research to tackle these open problems and propel the NLG evaluation landscape forward.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (128)
  1. From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292.
  2. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
  3. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
  4. Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478.
  5. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Linguistics.
  6. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, Online. Association for Computational Linguistics.
  7. The 2020 bilingual, bi-directional WebNLG+ shared task: Overview and evaluation results (WebNLG+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pages 55–76, Dublin, Ireland (Virtual). Association for Computational Linguistics.
  8. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
  9. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  10. StoryER: Automatic story evaluation via ranking, rating and reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1739–1753, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  11. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723.
  12. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  13. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  14. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  15. A survey on legal judgment prediction: Datasets, metrics, models and challenges. IEEE Access.
  16. Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5804–5812.
  17. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  18. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
  19. Findings of the 2021 conference on machine translation (wmt21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88. Association for Computational Linguistics.
  20. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  21. BLEU might be guilty but references are not innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61–71, Online. Association for Computational Linguistics.
  22. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  23. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online. Association for Computational Linguistics.
  24. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  25. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554.
  26. Trueteacher: Learning factual consistency evaluation with large language models. arXiv preprint arXiv:2305.11171.
  27. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
  28. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237.
  29. Topical-chat: Towards knowledge-grounded open-domain conversations. ArXiv, abs/2308.11995.
  30. A systematic survey on automated text generation tools and techniques: application, evaluation, and challenges. Multimedia Tools and Applications, pages 1–56.
  31. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719, New Orleans, Louisiana. Association for Computational Linguistics.
  32. OpenMEVA: A benchmark for evaluating open-ended story generation metrics. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6394–6407, Online. Association for Computational Linguistics.
  33. Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv preprint arXiv:2309.13701.
  34. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  35. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899.
  36. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736.
  37. Multi-dimensional evaluation of text summarization with in-context learning. arXiv preprint arXiv:2306.01200.
  38. Exploring chatgpt’s ability to rank content: A preliminary study on consistency with human preferences. arXiv preprint arXiv:2303.07610.
  39. Zero-shot faithfulness evaluation for text summarization with foundation language model. arXiv preprint arXiv:2310.11648.
  40. Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
  41. Stylized data-to-text generation: A case study in the e-commerce domain. ACM Transactions on Information Systems.
  42. Open-domain dialogue generation: What we can do, cannot do, and should do next. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 148–165.
  43. MPST: A corpus of movie plot synopses with tags. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  44. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702.
  45. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  46. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633.
  47. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
  48. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012.
  49. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  50. The eval4nlp 2023 shared task on prompting large language models as explainable metrics. arXiv preprint arXiv:2310.19792.
  51. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  52. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  53. Generative judge for evaluating alignment. CoRR, abs/2310.05470.
  54. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
  55. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  56. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  57. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  58. Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711.
  59. A survey on neural data-to-text generation. IEEE Transactions on Knowledge and Data Engineering.
  60. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
  61. X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects. arXiv preprint arXiv:2311.08788.
  62. Alignbench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743.
  63. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  64. Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766.
  65. Yixin Liu and Pengfei Liu. 2021. SimCLS: A simple framework for contrastive learning of abstractive summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1065–1072, Online. Association for Computational Linguistics.
  66. Evaluate what you can’t evaluate: Unassessable generated responses quality. arXiv preprint arXiv:2305.14658.
  67. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308.
  68. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint arXiv:2303.13809.
  69. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621.
  70. Phrase-based statistical language generation using graphical models and active learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1552–1561, Uppsala, Sweden. Association for Computational Linguistics.
  71. Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 688–725, Online. Association for Computational Linguistics.
  72. Shikib Mehri and Maxine Eskenazi. 2020a. Unsupervised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics.
  73. Shikib Mehri and Maxine Eskenazi. 2020b. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics.
  74. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
  75. OpenAI. 2023. Gpt-4 technical report.
  76. Text style transfer evaluation using large language models. arXiv preprint arXiv:2308.13577.
  77. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  78. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  79. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  80. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84, Copenhagen, Denmark. Association for Computational Linguistics.
  81. T5score: Discriminative fine-tuning of generative evaluation metrics.
  82. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  83. A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39.
  84. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  85. Yuchen Shen and Xiaojun Wan. 2023. Opinsummeval: Revisiting automated evaluation for opinion summarization. arXiv preprint arXiv:2310.18122.
  86. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4275–4293.
  87. Significant Gravitas. AutoGPT.
  88. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  89. Towards better evaluation of instruction-following: A case-study in summarization. arXiv preprint arXiv:2310.08394.
  90. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231.
  91. Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks. arXiv preprint arXiv:2310.13800.
  92. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  93. Not all metrics are guilty: Improving nlg evaluation with llm paraphrasing. arXiv preprint arXiv:2305.15067.
  94. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928.
  95. Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121, Online. Association for Computational Linguistics.
  96. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  97. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  98. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4):652–663.
  99. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  100. Learning personalized story evaluation. arXiv preprint arXiv:2310.03304.
  101. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  102. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  103. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592.
  104. Task-oriented dialogue system as natural language generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703.
  105. Automated evaluation of personalized text generation using large language models. arXiv preprint arXiv:2310.11593.
  106. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  107. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  108. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  109. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  110. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721, Lisbon, Portugal. Association for Computational Linguistics.
  111. Large language models are diverse role-players for summarization evaluation. arXiv preprint arXiv:2303.15078.
  112. Less is more for long document summary evaluation by llms. arXiv preprint arXiv:2309.07382.
  113. Instructscore: Towards explainable text generation evaluation with automatic feedback. arXiv preprint arXiv:2305.14282.
  114. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031.
  115. DOC: Improving long story coherence with detailed outline control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3378–3465, Toronto, Canada. Association for Computational Linguistics.
  116. Re3: Generating longer stories with recursive reprompting and revision. arXiv preprint arXiv:2210.06774.
  117. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
  118. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311.
  119. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
  120. Summit: Iterative text summarization via chatgpt. arXiv preprint arXiv:2305.14835.
  121. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
  122. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  123. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862.
  124. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
  125. Investigating table-to-text generation capabilities of llms in real-world information seeking scenarios.
  126. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  127. Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17907–17917.
  128. Terry Yue Zhuo. 2023. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhen Li (334 papers)
  2. Xiaohan Xu (9 papers)
  3. Tao Shen (87 papers)
  4. Can Xu (98 papers)
  5. Jia-Chen Gu (42 papers)
  6. Chongyang Tao (61 papers)
  7. Yuxuan Lai (16 papers)
  8. Shuai Ma (86 papers)
Citations (1)
Youtube Logo Streamline Icon: https://streamlinehq.com