Emergent Mind

Calibrating LLM-Based Evaluator

(2309.13308)
Published Sep 23, 2023 in cs.CL

Abstract

Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study
  2. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7580–7605
  3. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics
  4. A Survey on In-context Learning
  5. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  6. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409
  7. Results of wmt22 metrics shared task: Stop using bleu–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.  46–68
  8. GPTScore: Evaluate as You Desire
  9. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of NAACL-HLT, pp.  708–719
  10. Grade: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9230–9240
  11. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38
  12. Large Language Models Are State-of-the-Art Evaluators of Translation Quality
  13. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81
  14. Explainaboard: An explainable leaderboard for nlp. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp.  280–289
  15. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  16. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
  17. Self-Refine: Iterative Refinement with Self-Feedback
  18. Usr: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  681–707
  19. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
  20. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1797–1807
  21. GPT-4 Technical Report
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318
  24. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2685–2702
  25. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  26. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  90–121
  27. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575
  28. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5008–5020, 2020a.
  29. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5008–5020, 2020b.
  30. Is ChatGPT a Good NLG Evaluator? A Preliminary Study
  31. Large Language Models are not Fair Evaluators
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837
  33. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1711–1721
  34. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277
  35. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=SkeHuCVFDr.

  36. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  563–578
  37. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp. 12697–12706. PMLR
  38. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  39. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2023–2038

Show All 39