Emergent Mind

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

(2307.02762)
Published Jul 6, 2023 in cs.CL and cs.AI

Abstract

Nowadays, the quality of responses generated by different modern LLMs are hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs as a reference-free metric for open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho and MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose the (1) peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments, respectively. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. A General Language Assistant as a Laboratory for Alignment
  2. Benchmarking Foundation Models with Language-Model-as-an-Examiner
  3. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, https://lmsys.org/blog/2023-03-30-vicuna/.

  5. Kwangsu Cho and Charles MacArthur. 2011. Learning by reviewing. Journal of educational psychology, 103(1):73.
  6. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  7. QLoRA: Efficient Finetuning of Quantized LLMs
  8. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  9. Evaluating coherence in dialogue systems using entailment. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3806–3812.
  10. Arpad E Elo. 1967. The proposed uscf rating system. its development, theory, and applications. Chess Life, 22(8):242–247.
  11. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601.
  12. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567.
  13. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  14. Statistical methods for rates and proportions. john wiley & sons.
  15. GPTScore: Evaluate as You Desire
  16. Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback
  17. Enabling Large Language Models to Generate Text with Citations
  18. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  19. Karen Sparck Jones and Julia R Galliers. 1995. Evaluating natural language processing systems: An analysis and review.
  20. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  21. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346.
  22. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
  23. Holistic Evaluation of Language Models
  24. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  25. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  26. WebGPT: Browser-assisted question-answering with human feedback
  27. Rethinking feedback practices in higher education: a peer review perspective. Assessment & evaluation in higher education, 39(1):102–122.
  28. OpenAI. 2022. Webgpt annotation guidelines.
  29. GPT-4 Technical Report
  30. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab.
  31. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  32. Generative Agents: Interactive Simulacra of Human Behavior
  33. Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  34. Toby Walsh. 2014. The peerrank method for peer assessment. In Proceedings of the Twenty-first European Conference on Artificial Intelligence, pages 909–914.
  35. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020.
  36. Large language models are not fair evaluators
  37. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
  38. A critical evaluation of evaluations for long-form question answering. In Proceedings of ACL.
  39. Benefits of peer review on students’ writing. Psychology Learning & Teaching, 18(3):317–325.
  40. BERTScore: Evaluating Text Generation with BERT
  41. Judging llm-as-a-judge with mt-bench and chatbot arena
  42. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038.
  43. LIMA: Less Is More for Alignment

Show All 43