Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Calibrating LLM-Based Evaluator (2309.13308v1)

Published 23 Sep 2023 in cs.CL

Abstract: Recent advancements in LLMs on LLMing and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the LLM itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

Calibrating LLM-Based Evaluator

The paper examines the calibration of LLM-based evaluators used in the assessment of natural language generation (NLG) quality. The authors introduce AutoCalibrate, a multi-stage, gradient-free approach designed to align LLM-based evaluators with human preferences more closely. This paper addresses notable gaps in current LLM-based evaluative practices, highlighting their sensitivity to format and prompts and their lack of alignment with human judgment standards due to ambiguous scoring criteria.

Key Methodology

AutoCalibrate employs a novel calibration process comprising several stages:

  1. Data Labeling as Human Preference: Human preference is indirectly encoded through a set of expertly labeled sample-score pairs. This serves as the benchmark for aligning the LLM-based evaluator, ensuring that it mirrors human judgment more accurately.
  2. Criteria Drafting: Utilizing the powerful instruction-following capacity of LLMs, a diverse initial set of scoring criteria is generated through few-shot in-context examples. This step emphasizes leveraging the inherent learning capabilities of LLMs to infer scoring criteria, which are crucial for evaluating NLG tasks without the need for extensive reference outputs.
  3. Criteria Revisiting and Refinement: The criteria quality is enhanced through evaluation against expert labels, selecting top performers, and refining them using self-adjustment based on examples that previously led to differing human and model evaluations. The process leverages LLM's self-refinement capabilities to iteratively improve the scoring guidelines.

Experimental Evaluation

The AutoCalibrate framework was tested on various NLG tasks, including text summarization, data-to-text generation, and hallucination evaluation across datasets such as NewsRoom, SummEval, SFRES, SFHOT, and QAGS. The focus was on enhancing the correlation between LLM-generated scores and expert human evaluations.

  • On text summarization tasks, AutoCalibrate demonstrated substantial improvements over both traditional metrics like ROUGE and advanced LLM-based evaluations without calibration, indicating that explicit criteria considerably boost evaluator performance.
  • In data-to-text generation evaluation, the framework notably surpassed other model-based methods, illustrating its efficacy in aligning LLM evaluations with human judgment.
  • It also showed robustness across various datasets when evaluating hallucinations, further suggesting its applicability to different NLG evaluation contexts.

Implications and Future Directions

The findings underscore the potential of using LLMs as robust, reference-free evaluators once adequately calibrated. The gradient-free nature of AutoCalibrate supports its application in constrained environments where access to model weights or additional training is impractical.

Further research could explore extending AutoCalibrate to a broader range of language tasks, improving criteria induction techniques, or refining self-refinement strategies to iteratively enhance alignment further. The framework presents a foundational step toward more accurate and reliable automatic evaluation frameworks for NLG, leveraging state-of-the-art LLM capabilities aligned with human evaluative standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723, 2023.
  2. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7580–7605, 2021.
  3. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, 2019.
  4. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  5. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  6. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
  7. Results of wmt22 metrics shared task: Stop using bleu–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  46–68, 2022.
  8. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
  9. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of NAACL-HLT, pp.  708–719, 2018.
  10. Grade: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9230–9240, 2020.
  11. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  12. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520, 2023.
  13. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  14. Explainaboard: An explainable leaderboard for nlp. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp.  280–289, 2021.
  15. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  16. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint arXiv:2303.13809, 2023.
  17. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  18. Usr: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  681–707, 2020.
  19. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  20. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1797–1807, 2018.
  21. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  24. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2685–2702, 2020.
  25. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  26. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  90–121, 2020.
  27. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  28. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5008–5020, 2020a.
  29. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5008–5020, 2020b.
  30. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023a.
  31. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023b.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  33. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1711–1721, 2015.
  34. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
  35. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  36. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  563–578, 2019.
  37. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp. 12697–12706. PMLR, 2021.
  38. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  39. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2023–2038, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuxuan Liu (96 papers)
  2. Tianchi Yang (15 papers)
  3. Shaohan Huang (79 papers)
  4. Zihan Zhang (120 papers)
  5. Haizhen Huang (18 papers)
  6. Furu Wei (291 papers)
  7. Weiwei Deng (29 papers)
  8. Feng Sun (34 papers)
  9. Qi Zhang (784 papers)
Citations (22)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com