Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2403.04132v1)

Published 7 Mar 2024 in cs.AI and cs.CL
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Abstract: LLMs have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.

An Open Platform for Evaluating LLMs by Human Preference

Introduction

The rapid development of LLMs has posed new challenges in evaluating their performance, particularly concerning alignment with human preferences. Traditional benchmarks, often static and lacking in diversity, fail to fully capture the nuances of these advanced models. Addressing this gap, the introduction of \system provides a groundbreaking platform facilitating the evaluation of LLMs based on human preferences. It leverages a pairwise comparison methodology and crowdsourcing to compile a substantial volume of over 240K votes from a broad user base. This paper details the platform's design, the statistical mechanisms underpinning its model evaluations, and discusses the implications of this work for the future of LLM evaluation.

Crowdsourced Data Collection

At the core of \system is its innovative approach to data collection, relying on a crowdsourced, pairwise comparison method wherein users interact with anonymous models and cast their preferences. To date, this methodology has amassed over 240K votes across more than 50 models, reflecting a diverse set of languages. The platform’s design emphasizes diversity in user-generated prompts, ensuring a comprehensive evaluation that mirrors real-world use cases.

Statistical Foundations for Model Evaluation

A sophisticated suite of statistical tools underlies \system’s evaluation process. Utilizing models from Bradley-Terry to E-values, the platform can estimate rankings with improved efficiency and accuracy. This methodology not only ensures a robust model comparison but also allows for the strategic sampling of model pairs, enhancing the convergence of rankings while maintaining statistical integrity. This statistical approach has allowed for a highly effective evaluation mechanism within \system.

Data Analysis and Insights

A thorough analysis of the collected data confirms the platform's capacity to generate diverse and challenging prompts that effectively discriminate between models. Additionally, a comparison against expert ratings reveals a high degree of agreement, validating the reliability of crowdsourced votes. The platform also enables the construction of challenging benchmarks that can accentuate the differences between leading models, further showcasing the effectiveness of \system's approach.

Efficient Ranking Estimation and Anomalous User Detection

\system introduces an adaptive sampling algorithm that significantly enhances the platform's efficiency in estimating model rankings. Parallelly, the paper outlines a novel method for identifying anomalous user behaviors, ensuring the integrity of the data collected. These technological advancements denote significant strides forward in the methodology of LLM evaluation.

Implications and Forward Look

The establishment of \system as a leading platform for LLM evaluation marks a pivotal advance in the field. It not only addresses the critical need for a dynamic and human-centric evaluation mechanism but also sets the stage for future developments in AI and machine learning evaluation. As \system evolves, it is set to incorporate more comprehensive features, including topic leaderboards and support for multimodal and agent-based LLMs, promising an even richer evaluation landscape.

Conclusion

In conclusion, \system represents a significant leap forward in the methodology of evaluating LLMs, fostering a more dynamic, accurate, and human-aligned approach. By harnessing crowdsourced human preferences and employing rigorous statistical methods, this platform ensures a comprehensive and nuanced assessment of LLMs, paving the way for future innovations in AI evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Elo uncovered: Robustness and best practices in language model evaluation, 2023.
  3. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  4. Preference-based rank elicitation using statistical models: The case of mallows. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.  1071–1079, Bejing, China, 22–24 Jun 2014a. PMLR. URL https://proceedings.mlr.press/v32/busa-fekete14.html.
  5. Preference-based rank elicitation using statistical models: The case of mallows. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.  1071–1079, Bejing, China, 22–24 Jun 2014b. PMLR. URL https://proceedings.mlr.press/v32/busa-fekete14.html.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Chernoff, H. Sequential Design of Experiments, pp.  345–360. Springer New York, New York, NY, 1992. ISBN 978-1-4612-4380-9. doi: 10.1007/978-1-4612-4380-9_27. URL https://doi.org/10.1007/978-1-4612-4380-9_27.
  8. Can large language models be an alternative to human evaluations? In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15607–15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.870. URL https://aclanthology.org/2023.acl-long.870.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  11. Bootstrap confidence intervals. Statistical science, 11(3):189–228, 1996.
  12. Durrett, R. Probability: theory and examples, volume 49. Cambridge university press, 2019.
  13. Elo, A. E. The proposed uscf rating system, its development, theory, and applications. Chess Life, 22(8):242–247, 1967.
  14. Fisher, R. A. Statistical methods for research workers. Number 5. Oliver and Boyd, 1928.
  15. Freedman, D. A. On the so-called “huber sandwich estimator”’ and “robust standard errors”’. The American Statistician, 60(4):299–302, 2006.
  16. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  17. Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
  18. Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.
  19. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  20. Time-uniform chernoff bounds via nonnegative supermartingales. 2020.
  21. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143, 2023.
  22. Huber, P. J. et al. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pp.  221–233. Berkeley, CA: University of California Press, 1967.
  23. Hunter, D. R. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32(1):384 – 406, 2004. doi: 10.1214/aos/1079120141. URL https://doi.org/10.1214/aos/1079120141.
  24. Online active model selection for pre-trained classifiers. In International Conference on Artificial Intelligence and Statistics, pp.  307–315. PMLR, 2021.
  25. The perils of using Mechanical Turk to evaluate open-ended text generation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1265–1285, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.97. URL https://aclanthology.org/2021.emnlp-main.97.
  26. Dynabench: Rethinking benchmarking in nlp. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4110–4124, 2021.
  27. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  28. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  29. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  30. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  31. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  32. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  4694–4702, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.311. URL https://aclanthology.org/2023.findings-emnlp.311.
  33. Liu, T.-Y. et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
  34. Umap: Uniform manifold approximation and projection for dimension reduction, 2020.
  35. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  36. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623, 2023.
  37. Training language models to follow instructions with human feedback, 2022.
  38. Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38(4):576–601, 2023.
  39. Ties in paired-comparison experiments: A generalization of the bradley-terry model. Journal of the American Statistical Association, 62(317):194–204, 1967. doi: 10.1080/01621459.1967.10482901.
  40. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
  41. Online rank elicitation for plackett-luce: A dueling bandits approach. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/7eacb532570ff6858afd2723755ff790-Paper.pdf.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  43. E-values: Calibration, combination and applications. The Annals of Statistics, 49(3):1736–1754, 2021.
  44. Self-instruct: Aligning language models with self-generated instructions. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  45. Estimating means of bounded random variables by betting. arXiv preprint arXiv:2010.09686, 2020.
  46. White, H. Maximum likelihood estimation of misspecified models. Econometrica: Journal of the econometric society, pp.  1–25, 1982.
  47. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
  48. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, 2019.
  49. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023a.
  50. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. URL https://openreview.net/forum?id=uccHPGDlao.
  51. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  52. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Wei-Lin Chiang (19 papers)
  2. Lianmin Zheng (34 papers)
  3. Ying Sheng (31 papers)
  4. Anastasios Nikolas Angelopoulos (3 papers)
  5. Tianle Li (25 papers)
  6. Dacheng Li (22 papers)
  7. Hao Zhang (947 papers)
  8. Banghua Zhu (38 papers)
  9. Michael Jordan (28 papers)
  10. Joseph E. Gonzalez (167 papers)
  11. Ion Stoica (177 papers)
Citations (263)
Youtube Logo Streamline Icon: https://streamlinehq.com