Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition (2404.08008v2)
Abstract: Reliable evaluation of LLMs is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .
- An In-depth Look at Gemini’s Language Abilities. arXiv:2312.11444, 2023.
- Program Synthesis with Large Language Models. arXiv:2108.07732, 2021.
- Qwen Technical Report. arXiv:2309.16609, 2023.
- Which prompts make the difference? data prioritization for efficient human llm evaluation. arXiv preprint arXiv:2310.14424, 2023.
- Debiased Subjective Assessment of Real-World Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 711–721, 2021.
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv:2308.07201, 2023.
- A Survey on Evaluation of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 2023.
- Sahil Chaudhary. Code Alpaca: An Instruction-following LLaMA model for code generation. GitHub repository https://github.com/sahil280114/codealpaca, 2023.
- Humans or llms as the judge? a study on judgement biases. arXiv:2402.10669, 2024.
- Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021.
- INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. arXiv:2306.04757, 2023.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chatgpt Quality. Blog post https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
- Chatbot arena: An open platform for evaluating llms by human preference. arXiv:2403.04132, 2024.
- Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv:2103.10360, 2021.
- BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues. arXiv:2310.13650, 2023.
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387, 2023.
- The Rating of Chessplayers: Past and Present. Ishi Press International, 2008.
- MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation. arXiv:2211.05719, 2022.
- Evaluating Large Language Models: A Comprehensive Survey. arXiv:2310.19736, 2023.
- Measuring Massive Multitask Language Understanding. arXiv:2009.03300, 2020.
- C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322, 2023.
- CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436, 2019.
- Mistral 7B. arXiv:2310.06825, 2023.
- Dynabench: Rethinking Benchmarking in NLP. arXiv:2104.14337, 2021.
- OpenAssistant Conversations – Democratizing Large Language Model Alignment. arXiv:2304.07327, 2023.
- Efficient Memory Management for Large Language Model Serving with Pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, page 611–626. Association for Computing Machinery, 2023.
- CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760, 2023a.
- Generative Judge for Evaluating Alignment. arXiv:2310.05470, 2023b.
- AlpacaEval: An Automatic Evaluator of Instruction-following Models. GitHub repository https://github.com/tatsu-lab/alpaca_eval, 2023c.
- Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, 2004.
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688, 2023.
- Group Maximum Differentiation Competition: Model Comparison with Few Samples. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 851–864, 2018.
- William M. Mckeeman. Differential Testing for Software. Digital Technical Journal, pages 100–107, 1998.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions. arXiv:2104.08773, 2021.
- Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, 2017.
- OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022.
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Symposium on Operating Systems Principles, pages 1–18, 2017.
- InFoBench: Evaluating Instruction Following Ability in Large Language Models. arXiv:2401.03601, 2024.
- CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, pages 249–266, 2019.
- Rylan Schaeffer. Pretraining on the Test Set Is All You Need. arXiv:2309.08632, 2023.
- Variational Adversarial Active Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615, 2022.
- Stanford Alpaca: An Instruction-following LLaMA Model. GitHub repository https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805, 2023.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023.
- Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. arXiv:2111.02840, 2021.
- DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv:2306.11698, 2023a.
- OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. arXiv:2309.11235, 2023b.
- I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively. arXiv:2002.10648, 2020.
- Shepherd: A Critic for Language Model Generation. arXiv:2308.04592, 2023c.
- PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. arXiv:2306.05087, 2023d.
- Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560, 2022.
- Maximum Differentiation (MAD) Competition: A methodology for Comparing Computational Models of Perceptual Quantities. Journal of Vision, 8(12):8–8, 2008.
- Rethinking generative large language model evaluation for semantic comprehension. arXiv:2403.07872, 2024.
- WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv:2304.12244, 2023.
- Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition. International Journal of Computer Vision, pages 1768–1786, 2021.
- Evaluating Large Language Models at Evaluating Instruction Following. arXiv:2310.07641, 2023.
- BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675, 2019.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685, 2023.
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364, 2023.
- LIMA: Less Is More for Alignment. arXiv:2305.11206, 2023a.
- Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv:2311.01964, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.