Prediction-Powered Ranking of Large Language Models (2402.17826v3)
Abstract: LLMs are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong LLM -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong LLMs, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong LLMs are often inconsistent with (the distribution of) human pairwise preferences.
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
- Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. In Proceedings of the Conference on Human Factors in Computing Systems, 2024.
- AI-Generated Medical Advice—GPT and Beyond. Journal of American Medical Association, 329(16):1349–1350, 2023.
- Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2023.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2024.
- Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations, 2021.
- Self-Instruct: Aligning language models with self-generated instructions. In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508. Association for Computational Linguistics, 2023.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
- LMSYS. Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings. https://lmsys.org/, 2023. Online; accessed 20 February 2024.
- Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023. Online; accessed 20 February 2024.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems, data track, 2023.
- Generative judge for evaluating alignment. In Proceedings of the International Conference on Learning Representations, 2024.
- PRD: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023.
- Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv preprint arXiv:2311.17295, 2023.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, July 2023.
- Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621, 2023.
- Large Language Models are not Fair Evaluators. arXiv preprint arXiv:2305.17926, 2023.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org, 2023. Online; accessed 24 February 2024.
- Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631. Association for Computational Linguistics, 2023.
- LLM-Blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada, 2023. Association for Computational Linguistics.
- PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In Proceedings of the International Conference on Learning Representations, 2024.
- Is ChatGPT a general-purpose natural language processing task solver? In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023.
- AlpacaFarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, 2024.
- Preference proxies: Evaluating large language models in capturing human preferences in human-AI tasks. In ICML Workshop The Many Facets of Preference-Based Learning, 2023.
- QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems, 2024.
- Evaluating language model agency through negotiations. In Proceedings of the International Conference on Learning Representations, 2024.
- Prediction-powered inference. Science, 382(6671):669–674, 2023.
- PPI++: Efficient Prediction-Powered Inference. arXiv preprint arXiv:2311.01453, 2023.
- Cross-Prediction-Powered Inference. arXiv preprint arXiv:2309.16598, 2023.
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2311.09476, 2023.
- Incorporating natural variation into IVF clinic league tables. Human Reproduction, 22(5):1359–1362, 2007.
- League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 159(3):385–409, 1996.
- Using the bootstrap to quantify the authority of an empirical ranking. The Annals of Statistics, 37:3929–3959, 2009.
- Reliability of league tables of in vitro fertilisation clinics: retrospective analysis of live birth rates. British Medical Journal, 316(7146):1701–1705, 1998.
- Ranking populations based on sample survey data. Statistics, page 12, 2014.
- Confidence intervals for population ranks in the presence of ties and near ties. Journal of the American Statistical Association, 104(486):775–788, 2009.
- Confidence intervals for ranks of age-adjusted rates across states or counties. Statistics in Medicine, 33(11):1853–1866, 2014.
- Confident Feature Ranking. In ICML workshop on Spurious Correlations, Invariance and Stability, 2023.
- Justin Rising. Uncertainty in ranking. arXiv preprint arXiv:2107.03459, 2021.
- Simultaneous confidence intervals for ranks with application to ranking institutions. Biometrics, 78(1):238–247, 2022.
- A joint confidence region for an overall ranking of populations. Journal of the Royal Statistical Society Series C: Applied Statistics, 69(3):589–606, 2020.
- PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Finetuned Language Models are Zero-Shot Learners. In Proceedings of the International Conference on Learning Representations, 2022.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational Linguistics, 2019.
- Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023.
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. In Proceedings of the International Conference on Machine Learning, pages 22631–22648. PMLR, Jul 2023.
- LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models. arXiv preprint arXiv:2307.07889, 2023.
- Split and merge: Aligning position biases in large language model based evaluators. arXiv preprint arXiv:2310.01432, 2023.
- Human-like Summarization Evaluation with ChatGPT. arXiv preprint arXiv:2304.02554, 2023.
- Benchmarking Foundation Models with Language-Model-as-an-Examiner. In Advances in Neural Information Processing Systems, 2024.