Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition (2404.08008v2)

Published 10 Apr 2024 in cs.LG, cs.CL, and cs.HC

Abstract: Reliable evaluation of LLMs is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. An In-depth Look at Gemini’s Language Abilities. arXiv:2312.11444, 2023.
  2. Program Synthesis with Large Language Models. arXiv:2108.07732, 2021.
  3. Qwen Technical Report. arXiv:2309.16609, 2023.
  4. Which prompts make the difference? data prioritization for efficient human llm evaluation. arXiv preprint arXiv:2310.14424, 2023.
  5. Debiased Subjective Assessment of Real-World Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 711–721, 2021.
  6. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv:2308.07201, 2023.
  7. A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 2023.
  8. Sahil Chaudhary. Code Alpaca: An Instruction-following LLaMA model for code generation. GitHub repository https://github.com/sahil280114/codealpaca, 2023.
  9. Humans or llms as the judge? a study on judgement biases. arXiv:2402.10669, 2024.
  10. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021.
  11. INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. arXiv:2306.04757, 2023.
  12. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chatgpt Quality. Blog post https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
  13. Chatbot arena: An open platform for evaluating llms by human preference. arXiv:2403.04132, 2024.
  14. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
  15. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv:2103.10360, 2021.
  16. BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues. arXiv:2310.13650, 2023.
  17. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387, 2023.
  18. The Rating of Chessplayers: Past and Present. Ishi Press International, 2008.
  19. MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation. arXiv:2211.05719, 2022.
  20. Evaluating Large Language Models: A Comprehensive Survey. arXiv:2310.19736, 2023.
  21. Measuring Massive Multitask Language Understanding. arXiv:2009.03300, 2020.
  22. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322, 2023.
  23. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436, 2019.
  24. Mistral 7B. arXiv:2310.06825, 2023.
  25. Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337, 2021.
  26. OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327, 2023.
  27. Efficient Memory Management for Large Language Model Serving with Pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, page 611–626. Association for Computing Machinery, 2023.
  28. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760, 2023a.
  29. Generative Judge for Evaluating Alignment. arXiv:2310.05470, 2023b.
  30. AlpacaEval: An Automatic Evaluator of Instruction-following Models. GitHub repository https://github.com/tatsu-lab/alpaca_eval, 2023c.
  31. Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, 2004.
  32. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688, 2023.
  33. Group Maximum Differentiation Competition: Model Comparison with Few Samples. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 851–864, 2018.
  34. William M. Mckeeman. Differential Testing for Software. Digital Technical Journal, pages 100–107, 1998.
  35. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. arXiv:2104.08773, 2021.
  36. Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, 2017.
  37. OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023.
  38. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022.
  39. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  40. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Symposium on Operating Systems Principles, pages 1–18, 2017.
  41. InFoBench: Evaluating Instruction Following Ability in Large Language Models. arXiv:2401.03601, 2024.
  42. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, pages 249–266, 2019.
  43. Rylan Schaeffer. Pretraining on the Test Set Is All You Need. arXiv:2309.08632, 2023.
  44. Variational Adversarial Active Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
  45. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615, 2022.
  46. Stanford Alpaca: An Instruction-following LLaMA Model. GitHub repository https://github.com/tatsu-lab/stanford_alpaca, 2023.
  47. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805, 2023.
  48. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023.
  49. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. arXiv:2111.02840, 2021.
  50. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv:2306.11698, 2023a.
  51. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. arXiv:2309.11235, 2023b.
  52. I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively. arXiv:2002.10648, 2020.
  53. Shepherd: A Critic for Language Model Generation. arXiv:2308.04592, 2023c.
  54. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. arXiv:2306.05087, 2023d.
  55. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560, 2022.
  56. Maximum Differentiation (MAD) Competition: A methodology for Comparing Computational Models of Perceptual Quantities. Journal of Vision, 8(12):8–8, 2008.
  57. Rethinking generative large language model evaluation for semantic comprehension. arXiv:2403.07872, 2024.
  58. WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv:2304.12244, 2023.
  59. Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition. International Journal of Computer Vision, pages 1768–1786, 2021.
  60. Evaluating Large Language Models at Evaluating Instruction Following. arXiv:2310.07641, 2023.
  61. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675, 2019.
  62. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685, 2023.
  63. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364, 2023.
  64. LIMA: Less Is More for Alignment. arXiv:2305.11206, 2023a.
  65. Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv:2311.01964, 2023b.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: