Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

tinyBenchmarks: evaluating LLMs with fewer examples (2402.14992v2)

Published 22 Feb 2024 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: The versatility of LLMs led to the creation of diverse benchmarks that thoroughly test a variety of LLMs' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. For example, we show that to accurately estimate the performance of an LLM on MMLU, a popular multiple-choice QA benchmark consisting of 14K examples, it is sufficient to evaluate this LLM on 100 curated examples. We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results.

Efficient Evaluation of LLMs Using TinyBenchmarks

Introduction to Efficient Benchmarking

The evaluation of LLMs on comprehensive benchmarks has become a cornerstone for measuring advancements in the field of NLP. However, the extensive computational, environmental, and financial costs associated with these evaluations have ignited a search for more efficient methodologies. This paper introduces tinyBenchmarks, an approach that significantly reduces the number of examples needed to accurately estimate LLM performance across various key benchmarks. By curating a subset of 100 examples, this method achieves an average estimation error under 2%, effectively addressing the challenge of resource-intensive evaluation processes.

The Problem of Costly Evaluations

Evaluating LLMs involves testing models across numerous examples to ascertain their abilities comprehensively. Traditional benchmarks, including those like MMLU, Open LLM Leaderboard, HELM, and AlpacaEval 2.0, consist of hundreds or thousands of examples. The detailed analysis provided by these benchmarks comes at a very high cost, with single model evaluations requiring thousands of GPU hours or substantial financial investment, especially when commercial models are utilized as part of the evaluation process.

Evaluation Strategies and Empirical Analysis

The research investigates three primary strategies for reducing the number of evaluation examples without compromising the reliability of performance estimation:

  • Stratified Random Sampling, the simplest approach, though it can result in larger estimation errors.
  • Clustering Based on Correctness Patterns, which performs well in some contexts but can be unreliable due to potential spurious correctness patterns, particularly with domain-specific LLMs.
  • Item Response Theory (IRT) Based Evaluation, which utilizes standardized testing methodologies to identify robust evaluation sets and develop tools for accurate performance estimation with any subset of examples.

The empirical analysis demonstrates the superiority of the IRT-based approach, which efficiently predicts the performance of LLMs on all considered benchmarks with minimal examples. Tiny versions of benchmarks released alongside IRT-based tools underscore the practical application of the research findings.

Theoretical and Practical Implications

The paper substantiates the potential of IRT methods in streamlining LLM evaluations, supporting the practical utility of tinyBenchmarks. This efficient evaluation facilitates more frequent testing across development cycles, especially during fine-tuning and prompt engineering, thereby expediting the iterative process of model improvement. Furthermore, the research proposes extensions to prompt evaluation and adaptive testing, indicating directions for future advancements in efficient LLM benchmarking strategies.

Limitations and Future Directions

While tinyBenchmarks significantly mitigate evaluation costs, the approach faces challenges in scenarios involving severe distribution shifts, such as rapid advancements in model capabilities or significant changes in model architectures. To counteract these limitations, periodic updates to the example set and IRT model recalibrations are recommended.

Conclusion

This paper presents a significant step forward in the efficient evaluation of LLMs, offering the NLP research community a method to reduce the computational and financial burdens of benchmark testing. The release of tinyBenchmarks and related tools paves the way for more sustainable and frequent evaluations, contributing to the accelerated pace of innovation in LLM development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Item response theory: What it is and how you can use the irt procedure to apply it. SAS Institute Inc, 10(4):364–2014, 2014.
  3. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  93–104, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-demo.9. URL https://aclanthology.org/2022.acl-demo.9.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  5. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023a.
  6. Pythia: A suite for analyzing large language models across training and scaling. ArXiv, abs/2304.01373, 2023b. URL https://api.semanticscholar.org/CorpusID:257921893.
  7. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp.  12–58, 2014.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Brzezińska, J. Item response theory models in the measurement theory. Communications in Statistics-Simulation and Computation, 49(12):3299–3313, 2020.
  10. Item response theory. Annual Review of Statistics and Its Application, 3:297–321, 2016.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  13. Active Learning for BERT: An Empirical Study. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7949–7962, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.638. URL https://aclanthology.org/2020.emnlp-main.638.
  14. Rethinking the effective sample size. International Statistical Review, 90(3):525–550, 2022.
  15. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Advances in Neural Information Processing Systems, 36, 2024.
  16. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  17. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  18. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  19. Active bayesian assessment of black-box classifiers. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7935–7944, May 2021. doi: 10.1609/aaai.v35i9.16968. URL https://ojs.aaai.org/index.php/AAAI/article/view/16968.
  20. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  21. Active evaluation of classifiers on large datasets. In 2012 IEEE 12th International Conference on Data Mining, pp.  329–338, 2012. doi: 10.1109/ICDM.2012.161.
  22. The feasibility of using item response theory as a psychometric model for the gre aptitude test. ETS Research Report Series, 1982(1):i–148, 1982.
  23. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
  24. Active testing: Sample-efficient model evaluation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  5753–5763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/kossen21a.html.
  25. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  26. py-irt: A scalable item response theory library for python. INFORMS Journal on Computing, 35(1):5–13, 2023.
  27. Building an evaluation scale using item response theory. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2016, pp.  648. NIH Public Access, 2016.
  28. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  29. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  30. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  31. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550, 2023.
  32. Statistical theories of mental test scores. 1968.
  33. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  34. Effective sample size, dimensionality, and generalization in covariate shift adaptation. Neural Computing and Applications, 35(25):18187–18199, 2023.
  35. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  36. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.759.
  37. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  589–612, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.50. URL https://aclanthology.org/2022.findings-acl.50.
  38. State of what art? a call for multi-prompt llm evaluation. arXiv preprint arXiv:2401.00595, 2023.
  39. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4885–4901, 2020.
  40. Efficient benchmarking (of language models). arXiv preprint arXiv:2308.11696, 2023.
  41. Petersen, N. S. et al. Using item response theory to equate scholastic aptitude test scores. 1982.
  42. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4486–4503, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.346. URL https://aclanthology.org/2021.acl-long.346.
  43. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  44. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  45. Song, W. T. Minimal-mse linear combinations of variance estimators of the sample mean. In 1988 Winter Simulation Conference Proceedings, pp.  414–421. IEEE, 1988.
  46. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  47. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  48. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  50. Van der Linden, W. J. Handbook of item response theory: Three volume set. CRC Press, 2018.
  51. Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  52. Comparing test sets with item response theory. arXiv preprint arXiv:2106.00840, 2021.
  53. Anchor points: Benchmarking models with much fewer examples. arXiv preprint arXiv:2309.08638, 2023.
  54. Mind your format: Towards consistent evaluation of in-context learning improvements. arXiv preprint arXiv:2401.06766, 2024.
  55. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  56. The icl consistency test. arXiv preprint arXiv:2312.04945, 2023a.
  57. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. arXiv preprint arXiv:2310.13486, 2023b.
  58. Larger language models do in-context learning differently. ArXiv preprint, abs/2303.03846, 2023. URL https://arxiv.org/abs/2303.03846.
  59. How predictable are large language model capabilities? a case study on big-bench. arXiv preprint arXiv:2305.14947, 2023.
  60. Ground-truth labels matter: A deeper look into input-label demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2422–2437, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.155.
  61. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Felipe Maia Polo (18 papers)
  2. Lucas Weber (11 papers)
  3. Leshem Choshen (78 papers)
  4. Yuekai Sun (62 papers)
  5. Gongjun Xu (51 papers)
  6. Mikhail Yurochkin (68 papers)
Citations (43)