Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations (2307.02762v2)

Published 6 Jul 2023 in cs.CL and cs.AI

Abstract: Nowadays, the quality of responses generated by different modern LLMs is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  2. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
  3. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Kwangsu Cho and Charles MacArthur. 2011. Learning by reviewing. Journal of educational psychology, 103(1):73.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  8. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  9. Evaluating coherence in dialogue systems using entailment. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3806–3812.
  10. Arpad E Elo. 1967. The proposed uscf rating system. its development, theory, and applications. Chess Life, 22(8):242–247.
  11. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601.
  12. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567.
  13. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  14. Statistical methods for rates and proportions. john wiley & sons.
  15. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  16. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142.
  17. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
  18. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  19. Karen Sparck Jones and Julia R Galliers. 1995. Evaluating natural language processing systems: An analysis and review.
  20. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  21. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346.
  22. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  23. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  24. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  25. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  26. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  27. Rethinking feedback practices in higher education: a peer review perspective. Assessment & evaluation in higher education, 39(1):102–122.
  28. OpenAI. 2022. Webgpt annotation guidelines.
  29. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  30. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab.
  31. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  32. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
  33. Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
  34. Toby Walsh. 2014. The peerrank method for peer assessment. In Proceedings of the Twenty-first European Conference on Artificial Intelligence, pages 909–914.
  35. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020.
  36. Large language models are not fair evaluators.
  37. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv, abs/2306.04751.
  38. A critical evaluation of evaluations for long-form question answering. In Proceedings of ACL.
  39. Benefits of peer review on students’ writing. Psychology Learning & Teaching, 18(3):317–325.
  40. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  41. Judging llm-as-a-judge with mt-bench and chatbot arena.
  42. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038.
  43. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ruosen Li (7 papers)
  2. Teerth Patel (3 papers)
  3. Xinya Du (41 papers)
Citations (78)
Github Logo Streamline Icon: https://streamlinehq.com