Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
43 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists (2403.18771v2)

Published 27 Mar 2024 in cs.CL

Abstract: Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments, improving the average correlation with human judgments by 0.10. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Ron Artstein. 2017. Inter-annotator agreement. Handbook of linguistic annotation (2017), 297–313.
  3. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  4. Generating literal and implied subquestions to fact-check complex claims. arXiv preprint arXiv:2205.06938 (2022).
  5. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (Eds.). Association for Computational Linguistics, Nusa Dua, Bali, 361–374. https://aclanthology.org/2023.findings-ijcnlp.32
  6. Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15607–15631. https://doi.org/10.18653/v1/2023.acl-long.870
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  8. LA Clark. [n. d.]. 8c Watson, D.(1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment 7, 3 ([n. d.]), 309–319.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9 (2021), 391–409.
  11. Generating fact checking briefs. arXiv preprint arXiv:2011.05448 (2020).
  12. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.
  13. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
  14. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019. 1891–1895. https://doi.org/10.21437/Interspeech.2019-3079
  15. Perception score: A learned metric for open-ended text generation evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12902–12910.
  16. Openmeva: A benchmark for evaluating open-ended story generation metrics. arXiv preprint arXiv:2105.08920 (2021).
  17. Timothy R Hinkin. 1998. A brief tutorial on the development of measures for use in survey questionnaires. Organizational research methods 1, 1 (1998), 104–121.
  18. Wice: Real-world entailment for claims in wikipedia. arXiv preprint arXiv:2303.01432 (2023).
  19. GENIE: Toward reproducible and standardized human evaluation for text generation. arXiv preprint arXiv:2101.06561 (2021).
  20. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491 (2023).
  21. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520 (2023).
  22. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. arXiv preprint arXiv:2301.13298 (2023).
  23. SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 9662–9676. https://doi.org/10.18653/v1/2023.emnlp-main.600
  24. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  25. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  26. X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects. arXiv preprint arXiv:2311.08788 (2023).
  27. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 4140–4170. https://doi.org/10.18653/v1/2023.acl-long.228
  28. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634 (2023).
  29. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308 (2023).
  30. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12076–12100. https://doi.org/10.18653/v1/2023.emnlp-main.741
  31. Ani Nenkova and Rebecca J Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the human language technology conference of the north american chapter of the association for computational linguistics: Hlt-naacl 2004. 145–152.
  32. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  33. Effects of semantic plausibility, syntactic complexity and n-gram frequency on children’s sentence repetition. Journal of Child Language 48, 2 (2021), 261–284.
  34. Language Models are Unsupervised Multitask Learners. (2019).
  35. Crowdsourcing lightweight pyramids for manual summary evaluation. arXiv preprint arXiv:1904.05929 (2019).
  36. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.450
  37. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. In Proceedings of the 4th New Frontiers in Summarization Workshop, Yue Dong, Wen Xiao, Lu Wang, Fei Liu, and Giuseppe Carenini (Eds.). Association for Computational Linguistics, Singapore, 1–11. https://doi.org/10.18653/v1/2023.newsum-1.1
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  39. Generating scientific claims for zero-shot scientific fact checking. arXiv preprint arXiv:2203.12990 (2022).
  40. Exploring syntactic and semantic features for authorship attribution. Applied Soft Computing 111 (2021), 107815.
  41. Instructscore: Towards explainable text generation evaluation with automatic feedback. arXiv preprint arXiv:2305.14282 (2023).
  42. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems 34 (2021), 27263–27277.
  43. Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. arXiv preprint arXiv:2109.11503 (2021).
  44. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622 (2019).
  45. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=uccHPGDlao
  46. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197 (2022).
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com