Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prediction-Powered Ranking of Large Language Models (2402.17826v3)

Published 27 Feb 2024 in cs.LG, cs.AI, cs.CL, cs.CY, cs.HC, and stat.ML

Abstract: LLMs are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong LLM -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong LLMs, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong LLMs are often inconsistent with (the distribution of) human pairwise preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  2. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. In Proceedings of the Conference on Human Factors in Computing Systems, 2024.
  3. AI-Generated Medical Advice—GPT and Beyond. Journal of American Medical Association, 329(16):1349–1350, 2023.
  4. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, 2023.
  5. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2024.
  6. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations, 2021.
  7. Self-Instruct: Aligning language models with self-generated instructions. In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508. Association for Computational Linguistics, 2023.
  8. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
  9. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
  10. LMSYS. Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings. https://lmsys.org/, 2023. Online; accessed 20 February 2024.
  11. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023. Online; accessed 20 February 2024.
  12. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems, data track, 2023.
  13. Generative judge for evaluating alignment. In Proceedings of the International Conference on Learning Representations, 2024.
  14. PRD: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023.
  15. Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. arXiv preprint arXiv:2311.17295, 2023.
  16. Large language models encode clinical knowledge. Nature, 620(7972):172–180, July 2023.
  17. Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621, 2023.
  18. Large Language Models are not Fair Evaluators. arXiv preprint arXiv:2305.17926, 2023.
  19. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org, 2023. Online; accessed 24 February 2024.
  20. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631. Association for Computational Linguistics, 2023.
  21. LLM-Blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada, 2023. Association for Computational Linguistics.
  22. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In Proceedings of the International Conference on Learning Representations, 2024.
  23. Is ChatGPT a general-purpose natural language processing task solver? In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023.
  24. AlpacaFarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, 2024.
  25. Preference proxies: Evaluating large language models in capturing human preferences in human-AI tasks. In ICML Workshop The Many Facets of Preference-Based Learning, 2023.
  26. QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems, 2024.
  27. Evaluating language model agency through negotiations. In Proceedings of the International Conference on Learning Representations, 2024.
  28. Prediction-powered inference. Science, 382(6671):669–674, 2023.
  29. PPI++: Efficient Prediction-Powered Inference. arXiv preprint arXiv:2311.01453, 2023.
  30. Cross-Prediction-Powered Inference. arXiv preprint arXiv:2309.16598, 2023.
  31. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2311.09476, 2023.
  32. Incorporating natural variation into IVF clinic league tables. Human Reproduction, 22(5):1359–1362, 2007.
  33. League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 159(3):385–409, 1996.
  34. Using the bootstrap to quantify the authority of an empirical ranking. The Annals of Statistics, 37:3929–3959, 2009.
  35. Reliability of league tables of in vitro fertilisation clinics: retrospective analysis of live birth rates. British Medical Journal, 316(7146):1701–1705, 1998.
  36. Ranking populations based on sample survey data. Statistics, page 12, 2014.
  37. Confidence intervals for population ranks in the presence of ties and near ties. Journal of the American Statistical Association, 104(486):775–788, 2009.
  38. Confidence intervals for ranks of age-adjusted rates across states or counties. Statistics in Medicine, 33(11):1853–1866, 2014.
  39. Confident Feature Ranking. In ICML workshop on Spurious Correlations, Invariance and Stability, 2023.
  40. Justin Rising. Uncertainty in ranking. arXiv preprint arXiv:2107.03459, 2021.
  41. Simultaneous confidence intervals for ranks with application to ranking institutions. Biometrics, 78(1):238–247, 2022.
  42. A joint confidence region for an overall ranking of populations. Journal of the Royal Statistical Society Series C: Applied Statistics, 69(3):589–606, 2020.
  43. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  44. Finetuned Language Models are Zero-Shot Learners. In Proceedings of the International Conference on Learning Representations, 2022.
  45. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. Association for Computational Linguistics, 2019.
  46. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  47. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  48. Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023.
  49. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. In Proceedings of the International Conference on Machine Learning, pages 22631–22648. PMLR, Jul 2023.
  50. LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models. arXiv preprint arXiv:2307.07889, 2023.
  51. Split and merge: Aligning position biases in large language model based evaluators. arXiv preprint arXiv:2310.01432, 2023.
  52. Human-like Summarization Evaluation with ChatGPT. arXiv preprint arXiv:2304.02554, 2023.
  53. Benchmarking Foundation Models with Language-Model-as-an-Examiner. In Advances in Neural Information Processing Systems, 2024.
Citations (3)

Summary

  • The paper introduces a statistical framework that quantifies uncertainty in LLM rankings using confidence ellipsoids and rank-sets from pairwise comparisons.
  • It employs a hybrid approach combining human and model-generated comparisons to calculate unbiased estimates of model preference probabilities.
  • The framework offers practical insights with robust coverage guarantees and an open-source implementation for efficient LLM evaluation.

Uncertainty Quantification in Ranking LLMs with Human and Model Pairwise Comparisons

Introduction

The evaluation of LLMs has conventionally focused on their alignment with human preferences, a paradigm which pivots on ranking models based on pairwise comparisons. The aspect of practicality under this evaluation lens, however, suffers due to the extensive human resource and time requirements. To mitigate this, utilizing a well-aligned strong LLM for generating pairwise comparisons has emerged as a common practice. Nevertheless, the reliability of rankings derived through such automated comparisons remains questionable due to potential discrepancies with human judgement and the statistical uncertainty inherent in finite comparison samples. This paper introduces a statistical framework that addresses these concerns by quantifying uncertainty in model rankings through the concept of rank-sets, which encapsulate the potential ranking positions of models given both human and automated pairwise comparisons.

Statistical Framework

The cornerstone of this framework is the integration of prediction-powered inference to construct confidence ellipsoids that encapsulate the true probabilities of a model being preferred, thereby facilitating the formation of rank-sets. These rank-sets offer a probabilistic guarantee to cover the actual model rankings as per human preferences. The framework's novelty lies in offering a computational mechanism to account for uncertainties without stringent assumptions about preference distributions and model alignments, marking a significant advancement in the assessment of LLMs' standings.

Methodology

Employing a combination of human and model-generated pairwise comparisons, the framework first calculates an unbiased estimate of models' preference probabilities. These estimates feed into the creation of a confidence ellipsoid that probabilistically encompasses the true preference probabilities, from which rank-sets are derived. For each LLM, the rank-set is computed by evaluating distances to hyperplanes that equate models' probabilities of being preferred, allowing an enumeration of potential ranking positions within statistically grounded confidence bounds.

Theoretical Contributions

The formal analysis underpinning the framework ensures that the calculated rank-sets possess coverage guarantees, meaning they are statistically expected to contain the true model rankings according to human preferences. This is a pivotal accomplishment that imbues the rank-sets with a robust measure of credibility and reliability, addressing the inherent uncertainties associated with model evaluations based on preferences.

Practical Implications and Future Directions

The implementation of this framework can substantially refine the process of LLM evaluation by incorporating a systematic consideration of uncertainty. The practical aspect is further strengthened by the provision of an open-source implementation, facilitating its accessibility and applicability in various domains where LLM ranking is essential. Looking ahead, the research introduces several pathways for exploration, such as the verification of the framework with empirical data, the extension of coverage guarantees under finite sample scenarios, and adjustments for inputs' distribution shifts.

Conclusions

The proposed statistical framework significantly enriches the landscape of LLM evaluation by quantifying ranking uncertainty through rank-sets. It embodies a methodologically sound approach that harmonizes the dual objectives of leveraging the efficiency of model-generated comparisons and ensuring fidelity to human preferences. As LLMs continue to evolve and find applications across an expanding spectrum of tasks, methodologies that provide nuanced insights into model rankings will remain indispensable, underscoring the pertinence of this research.