Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ranking Large Language Models without Ground Truth (2402.14860v4)

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluation and ranking of LLMs has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  2. Unitxt: Flexible, shareable and reusable data preparation and evaluation for generative ai.
  3. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  5. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  6. Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.
  7. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  9. Recognizing Textual Entailment: Models and Applications. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
  10. Arpad E Elo. 1967. The proposed uscf rating system, its development, theory, and applications. Chess Life, 22(8):242–247.
  11. Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the eye of the user: A critique of nlp leaderboards. arXiv preprint arXiv:2009.13888.
  12. A framework for few-shot language model evaluation.
  13. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
  14. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  15. Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  16. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486.
  17. Holistic evaluation of language models. Transactions on Machine Learning Research.
  18. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  19. Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, page 71–78, USA. Association for Computational Linguistics.
  20. Ibomoiye Domor Mienye and Yanxia Sun. 2022. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access, 10:99129–99149.
  21. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  22. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  24. Customizing contextualized language models for legal document reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 2139–2148. IEEE.
  25. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561.
  26. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  27. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  28. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  29. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), 28(4):1–38.
  30. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  31. Benchmarking llms via uncertainty quantification. arXiv preprint arXiv:2401.12794.
  32. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  33. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  34. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
  35. The moral integrity corpus: A benchmark for ethical dialogue systems. arXiv preprint arXiv:2204.03021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Amit Dhurandhar (62 papers)
  2. Rahul Nair (26 papers)
  3. Moninder Singh (17 papers)
  4. Elizabeth Daly (16 papers)
  5. Karthikeyan Natesan Ramamurthy (68 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

HackerNews