Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

State of What Art? A Call for Multi-Prompt LLM Evaluation (2401.00595v3)

Published 31 Dec 2023 in cs.CL
State of What Art? A Call for Multi-Prompt LLM Evaluation

Abstract: Recent advances in LLMs have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.

Introduction to LLM Evaluation

Evaluating LLMs has become a critical component of understanding their capabilities and limitations. Typically, benchmarks use a single instruction template for assessing an LLM's performance on various tasks. However, given the diversity of natural language, even semantically equivalent instructions can alter an LLM's response, raising concerns about the reliability of these evaluations.

Delving into Multi-Prompt Evaluation

A new paper explores the impact of paraphrasing instructions on the performance of LLMs. The researchers evaluated a staggering 6.5 million instances involving 20 distinct LLMs across 39 tasks from three established benchmarks. Their empirical evidence shows that model performance can change significantly with different paraphrases, challenging previous conclusions drawn from single-prompt evaluations.

The paper introduces a dataset of over 5,000 varied instruction paraphrases, compiled through a mix of automated methods and manual validation. This new approach aims to measure LLM performance more holistically by considering how models respond to a range of possible instructions for each task, rather than to a single, fixed prompt.

Rethinking Evaluation Metrics

The authors argue that a one-size-fits-all evaluation metric is insufficient to capture the nuanced ways in which LLMs can be used. They propose several targeted metrics for LLM evaluation reflecting different use cases—whether it's an LLM developer aiming for robust performance across multiple prompts or a product team seeking the best LLM for a specific downstream task.

Four new metrics were suggested: maximum performance metric for best possible results in specific applications, average performance metric for model robustness assessment, saturation metric for open-ended applications, and a combined performance score to account for peak capabilities and consistency.

The Real-world Implications

After applying these new metrics, the researchers uncovered variations in the absolute performance and model rankings, sometimes diverging from the conventional results obtained from single-prompt evaluations. This observation suggests that the choice of evaluation metric should be aligned with the specific goals and practical needs of the end users.

Developers and researchers may now have to consider multi-prompt evaluations to better understand LLM strengths and weaknesses. Such a comprehensive assessment could inform decisions on model selection for different applications, offering a more nuanced and reliable basis for integration into real-world systems.

Conclusion

The paper presents a strong case for multi-prompt evaluation as a more robust and meaningful way to assess LLM performance. Acknowledging the variability that comes with natural language instructions, it encourages the adoption of evaluation practices that recognize this diversity, thereby aligning LLM assessment with real-world usage and needs. As the field progresses, this approach may prove crucial in accurately gauging the strengths and deploying LLMs effectively for a broad range of tasks and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Falcon-40b: an open large language model with state-of-the-art performance. Technical report, Technical report, Technology Innovation Institute.
  2. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. OpenAccess AI Collective. 2023. Minotaur.
  6. Gregory W Corder and Dale I Foreman. 2011. Nonparametric statistics for non-statisticians.
  7. Enhancing chat language models by scaling high-quality instructional conversations.
  8. Jon Durbin. 2023. Airoboros.
  9. Lmentry: A language model benchmark of elementary language tasks. arXiv preprint arXiv:2211.02069.
  10. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037.
  11. Gemini Team Google. 2023. Gemini: A family of highly capable multimodal models.
  12. Robustness of learning from task instructions. arXiv preprint arXiv:2212.03813.
  13. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  14. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
  15. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782.
  16. Business statistics: based on schaums outline of theory and problems of business statistics, by leonard j. kazmier. Technical report, McGraw-Hill.
  17. Maurice G Kendall. 1945. The treatment of ties in ranking problems. Biometrika, 33(3):239–251.
  18. Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275–287.
  19. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  20. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  21. Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
  22. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
  23. NousResearch. 2023. Nous-hermes.
  24. OpenAI. 2023. Gpt-4 technical report.
  25. Efficient benchmarking (of language models). ArXiv, abs/2308.11696.
  26. Multitask prompted training enables zero-shot task generalization.
  27. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  28. Evaluating the zero-shot robustness of instruction-tuned language models. arXiv preprint arXiv:2306.11270.
  29. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  30. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  31. MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  33. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  34. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  35. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena.
  37. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Moran Mizrahi (3 papers)
  2. Guy Kaplan (3 papers)
  3. Dan Malkin (5 papers)
  4. Rotem Dror (14 papers)
  5. Dafna Shahaf (33 papers)
  6. Gabriel Stanovsky (61 papers)
Citations (84)
Github Logo Streamline Icon: https://streamlinehq.com