PromptBench: A Unified Library for Evaluation of Large Language Models (2312.07910v3)
Abstract: The evaluation of LLMs is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
- Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
- Google Brain. A new open source flan 20b with ul2, 2023. URL https://www.yitay.net/blog/flan-ul2-20b.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 2–14, Tokyo, Japan, December 14-15 2017. International Workshop on Spoken Language Translation. URL https://aclanthology.org/2017.iwslt-1.1.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
- Harrison Chase. Langchain. https://github.com/langchain-ai/langchain, 2022. Date released: 2022-10-17.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Databricks. Hello dolly: Democratizing the magic of chatgpt with open models, 2023. URL https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html.
- Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology.org/I05-5002.
- Andreas Eisele and Yu Chen. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf.
- Michael Eisenstein. A test of artificial intelligence. Nature Outlook: Robotics and artificial intelligence, 2023.
- Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56, May 2018. doi: 10.1109/SPW.2018.00016.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Jordan Hoffmann et al. Training compute-optimal large language models, 2022.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932, 2019.
- Qasc: A dataset for question answering via sentence composition, 2020.
- Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- Large language models understand and can be enhanced by emotional stimuli, 2023a.
- TextBugger: Generating adversarial text against real-world applications. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society, 2019. doi: 10.14722/ndss.2019.23138. URL https://doi.org/10.14722%2Fndss.2019.23138.
- BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.500. URL https://aclanthology.org/2020.emnlp-main.500.
- Alpacaeval: An automatic evaluator of instruction-following models. Github repository, 2023b.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023c.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Birds have four legs?! numersense: Probing numerical commonsense knowledge of pre-trained language models, 2020.
- Jerry Liu. LlamaIndex, 11 2022. URL https://github.com/jerryjliu/llama_index.
- Generated knowledge prompting for commonsense reasoning, 2022.
- Meta semantic template for evaluation of large language models. arXiv preprint arXiv:2310.01448, 2023.
- Microsoft. Semantic kernel. https://github.com/microsoft/semantic-kernel, 2023.
- Stress test evaluation for natural language inference. In ACL, pages 2340–2353, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/C18-1198.
- OpenAI. https://chat.openai.com.chat, 2023a.
- OpenAI. Gpt-4 technical report, 2023b.
- Know what you don’t know: Unanswerable questions for SQuAD. In ACL, pages 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In ACL, pages 4902–4912, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.442. URL https://aclanthology.org/2020.acl-main.442.
- Analysing mathematical reasoning abilities of neural models. In ICLR, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
- Gabriel Simmons. Moral mimicry: Large language models produce moral rationalizations tailored to political identity. arXiv preprint arXiv:2209.12106, 2022.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1170.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023a.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In International conference on learning representations (ICLR) workshop on Trustworthy and Reliable Large-Scale Machine Learning Models, 2023b.
- Bilateral multi-perspective matching for natural language sentences, 2017.
- Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- A broad-coverage challenge corpus for sentence understanding through inference. In NAACL HLT, pages 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
- Causal parrots: Large language models may talk causality but are not causal. Transactions on machine learning research (TMLR), 8, 2023.
- Expertprompting: Instructing large language models to be distinguished experts, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Least-to-most prompting enables complex reasoning in large language models, 2023a.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023b.
- Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167, 2023a.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023b.
- Kaijie Zhu (19 papers)
- Qinlin Zhao (5 papers)
- Hao Chen (1006 papers)
- Jindong Wang (150 papers)
- Xing Xie (220 papers)