Papers
Topics
Authors
Recent
2000 character limit reached

Efficient multi-prompt evaluation of LLMs (2405.17202v3)

Published 27 May 2024 in cs.CL, cs.AI, cs.LG, and stat.ML

Abstract: Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry; for example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Moreover, we show how PromptEval can be useful in LLM-as-a-judge and best prompt identification applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Fixed-budget best-arm identification in structured bandits. arXiv preprint arXiv:2106.04763, 2021.
  2. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  3. Justyna Brzezińska. Item response theory models in the measurement theory. Communications in Statistics-Simulation and Computation, 49(12):3299–3313, 2020.
  4. Item response theory. Annual Review of Statistics and Its Application, 3:297–321, 2016.
  5. Statistical inference for noisy incomplete binary matrix. Journal of Machine Learning Research, 24(95):1–66, 2023.
  6. Navigating the modern evaluation landscape: Considerations in benchmarks and frameworks for large language models (llms). In International Conference on Language Resources and Evaluation, 2024. URL https://api.semanticscholar.org/CorpusID:269804253.
  7. Development of a measure of early mathematics achievement using the rasch model: The research-based early maths assessment. Educational Psychology, 28(4):457–482, 2008.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  9. Lmentry: A language model benchmark of elementary language tasks. arXiv preprint arXiv:2211.02069, 2022.
  10. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. The Annals of Statistics, 13(1):342–368, 1985.
  11. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  12. Rasch Georg. Probabilistic models for some intelligence and attainment tests. Copenhagen: Institute of Education Research, 1960.
  13. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  15. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://www.aclweb.org/anthology/2020.emnlp-main.550.
  16. The problem of m rankings. The annals of mathematical statistics, 10(3):275–287, 1939.
  17. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Building an evaluation scale using item response theory. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2016, page 648. NIH Public Access, 2016.
  19. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  20. Spell: Semantic prompt evolution based on a llm. arXiv preprint arXiv:2310.01260, 2023.
  21. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  22. Statistical theories of mental test scores. 1968.
  23. Meta. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3, 2024.
  24. State of what art? a call for multi-prompt llm evaluation. arXiv preprint arXiv:2401.00595, 2023.
  25. Risk assessment and statistical significance in the age of foundation models. arXiv preprint arXiv:2310.07132, 2023.
  26. Efficient benchmarking (of language models). arXiv preprint arXiv:2308.11696, 2023.
  27. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
  28. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3827–3846, 2023.
  29. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  30. Sidney Resnick. A probability path. Springer, 2019.
  31. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4486–4503, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.346. URL https://aclanthology.org/2021.acl-long.346.
  32. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  33. Best arm identification for prompt learning under a limited budget. arXiv preprint arXiv:2402.09723, 2024.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  35. Effective user interface designs to increase energy-efficient behavior in a rasch-based energy recommender system. In Proceedings of the eleventh ACM conference on recommender systems, pages 65–73, 2017.
  36. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  37. Wim J Van der Linden. Handbook of item response theory: Three volume set. CRC Press, 2018.
  38. Comparing test sets with item response theory. arXiv preprint arXiv:2106.00840, 2021.
  39. Anchor points: Benchmarking models with much fewer examples. arXiv preprint arXiv:2309.08638, 2023.
  40. Mind your format: Towards consistent evaluation of in-context learning improvements. arXiv preprint arXiv:2401.06766, 2024.
  41. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022.
  42. The icl consistency test. arXiv preprint arXiv:2312.04945, 2023a.
  43. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 294–313, 2023b.
  44. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  45. Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661, 2023a.
  46. How predictable are large language model capabilities? a case study on big-bench. arXiv preprint arXiv:2305.14947, 2023b.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 44 likes about this paper.