Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PromptBench: A Unified Library for Evaluation of Large Language Models (2312.07910v3)

Published 13 Dec 2023 in cs.AI, cs.CL, and cs.LG

Abstract: The evaluation of LLMs is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  3. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
  4. Google Brain. A new open source flan 20b with ul2, 2023. URL https://www.yitay.net/blog/flan-ul2-20b.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 2–14, Tokyo, Japan, December 14-15 2017. International Workshop on Spoken Language Translation. URL https://aclanthology.org/2017.iwslt-1.1.
  7. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  8. Harrison Chase. Langchain. https://github.com/langchain-ai/langchain, 2022. Date released: 2022-10-17.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  10. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  13. Databricks. Hello dolly: Democratizing the magic of chatgpt with open models, 2023. URL https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html.
  14. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023.
  15. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology.org/I05-5002.
  16. Andreas Eisele and Yu Chen. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf.
  17. Michael Eisenstein. A test of artificial intelligence. Nature Outlook: Robotics and artificial intelligence, 2023.
  18. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56, May 2018. doi: 10.1109/SPW.2018.00016.
  19. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  20. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  21. Jordan Hoffmann et al. Training compute-optimal large language models, 2022.
  22. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  23. Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932, 2019.
  24. Qasc: A dataset for question answering via sentence composition, 2020.
  25. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
  26. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  27. Large language models understand and can be enhanced by emotional stimuli, 2023a.
  28. TextBugger: Generating adversarial text against real-world applications. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society, 2019. doi: 10.14722/ndss.2019.23138. URL https://doi.org/10.14722%2Fndss.2019.23138.
  29. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.500. URL https://aclanthology.org/2020.emnlp-main.500.
  30. Alpacaeval: An automatic evaluator of instruction-following models. Github repository, 2023b.
  31. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023c.
  32. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  33. Birds have four legs?! numersense: Probing numerical commonsense knowledge of pre-trained language models, 2020.
  34. Jerry Liu. LlamaIndex, 11 2022. URL https://github.com/jerryjliu/llama_index.
  35. Generated knowledge prompting for commonsense reasoning, 2022.
  36. Meta semantic template for evaluation of large language models. arXiv preprint arXiv:2310.01448, 2023.
  37. Microsoft. Semantic kernel. https://github.com/microsoft/semantic-kernel, 2023.
  38. Stress test evaluation for natural language inference. In ACL, pages 2340–2353, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/C18-1198.
  39. OpenAI. https://chat.openai.com.chat, 2023a.
  40. OpenAI. Gpt-4 technical report, 2023b.
  41. Know what you don’t know: Unanswerable questions for SQuAD. In ACL, pages 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.
  42. Beyond accuracy: Behavioral testing of NLP models with CheckList. In ACL, pages 4902–4912, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.442. URL https://aclanthology.org/2020.acl-main.442.
  43. Analysing mathematical reasoning abilities of neural models. In ICLR, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
  44. Gabriel Simmons. Moral mimicry: Large language models produce moral rationalizations tailored to political identity. arXiv preprint arXiv:2209.12106, 2022.
  45. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1170.
  46. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
  47. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  48. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  51. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
  52. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  53. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023a.
  54. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In International conference on learning representations (ICLR) workshop on Trustworthy and Reliable Large-Scale Machine Learning Models, 2023b.
  55. Bilateral multi-perspective matching for natural language sentences, 2017.
  56. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
  57. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  58. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL HLT, pages 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
  59. Causal parrots: Large language models may talk causality but are not causal. Transactions on machine learning research (TMLR), 8, 2023.
  60. Expertprompting: Instructing large language models to be distinguished experts, 2023.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  62. Least-to-most prompting enables complex reasoning in large language models, 2023a.
  63. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023b.
  64. Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167, 2023a.
  65. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kaijie Zhu (19 papers)
  2. Qinlin Zhao (5 papers)
  3. Hao Chen (1006 papers)
  4. Jindong Wang (150 papers)
  5. Xing Xie (220 papers)
Citations (10)

Summary

Overview

The development and deployment of LLMs have profound implications across various sectors of human activity. Rigorous evaluation of these models is integral to understanding their capabilities, addressing potential risks, and leveraging their potential benefits. PromptBench emerges as a novel unified codebase specifically designed to facilitate a comprehensive evaluation of LLMs for research purposes.

Key Features and Components

PromptBench is a Python library with a modular structure that offers a broad array of tools and components which address diverse aspects of LLM evaluation. Key elements include:

  • Wide Range of Models and Datasets: Support for a variety of LLMs and datasets covering a range of tasks such as sentiment analysis and duplication detection.
  • Prompts and Prompt Engineering: Provision of different prompt types and a module for integrating innovative prompt engineering methods.
  • Adversarial Prompt Attacks: Integration of attacks to assess model robustness, critical for understanding model performance under real-world conditions.
  • Dynamic Evaluation Protocols: Support for standard, as well as dynamic, protocols to create on-the-fly testing data, enabling evaluation that avoids data contamination issues.
  • Analysis Tools: An array of tools is provided, which can interpret and analyze the performance outputs of LLMs, essential for thorough benchmarking and evaluation.

Evaluation Pipeline Construction

PromptBench allows researchers to easily build an evaluation pipeline in four straightforward steps:

  1. Loading the desired dataset through a streamlined API.
  2. Customizing LLMs for inference using a unified interface compatible with popular frameworks.
  3. Selecting or crafting prompts specific to the task and dataset at hand.
  4. Defining input/output processing functions and selecting appropriate evaluation metrics.

Research and Development Support

Tailored for the research community, PromptBench can be adapted and extended to fit various research topics within LLM evaluation. It covers several research directions, including benchmarks, scenarios, and protocols, with the scope for expansion into areas such as bias and agent-based studies. Researchers are provided with a platform to compare results and contribute new findings, enhancing collaboration in the field.

Conclusion and Future Directions

PromptBench aims to serve as a starting point for more comprehensively assessing the true capabilities and limits of LLMs. As an actively supported project, it invites contributions to evolve and keep pace with the rapidly progressing domain of AI and LLMs. The tool facilitates the exploration and design of more robust and human-aligned LLMs, ultimately contributing to advancements in the field.

Github Logo Streamline Icon: https://streamlinehq.com