Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

49 1

PromptBench: A Unified Library for Evaluation of Large Language Models (2312.07910v3)

Published 13 Dec 2023 in cs.AI, cs.CL, and cs.LG

Abstract: The evaluation of LLMs is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.

References (65)

Authors (5)

Kaijie Zhu (19 papers)
Qinlin Zhao (5 papers)
Hao Chen (1006 papers)
Jindong Wang (150 papers)
Xing Xie (220 papers)

Citations (10)

View on Semantic Scholar

Summary

Overview

The development and deployment of LLMs have profound implications across various sectors of human activity. Rigorous evaluation of these models is integral to understanding their capabilities, addressing potential risks, and leveraging their potential benefits. PromptBench emerges as a novel unified codebase specifically designed to facilitate a comprehensive evaluation of LLMs for research purposes.

Key Features and Components

PromptBench is a Python library with a modular structure that offers a broad array of tools and components which address diverse aspects of LLM evaluation. Key elements include:

Wide Range of Models and Datasets: Support for a variety of LLMs and datasets covering a range of tasks such as sentiment analysis and duplication detection.
Prompts and Prompt Engineering: Provision of different prompt types and a module for integrating innovative prompt engineering methods.
Adversarial Prompt Attacks: Integration of attacks to assess model robustness, critical for understanding model performance under real-world conditions.
Dynamic Evaluation Protocols: Support for standard, as well as dynamic, protocols to create on-the-fly testing data, enabling evaluation that avoids data contamination issues.
Analysis Tools: An array of tools is provided, which can interpret and analyze the performance outputs of LLMs, essential for thorough benchmarking and evaluation.

Evaluation Pipeline Construction

PromptBench allows researchers to easily build an evaluation pipeline in four straightforward steps:

Loading the desired dataset through a streamlined API.
Customizing LLMs for inference using a unified interface compatible with popular frameworks.
Selecting or crafting prompts specific to the task and dataset at hand.
Defining input/output processing functions and selecting appropriate evaluation metrics.

Research and Development Support

Tailored for the research community, PromptBench can be adapted and extended to fit various research topics within LLM evaluation. It covers several research directions, including benchmarks, scenarios, and protocols, with the scope for expansion into areas such as bias and agent-based studies. Researchers are provided with a platform to compare results and contribute new findings, enhancing collaboration in the field.

Conclusion and Future Directions

PromptBench aims to serve as a starting point for more comprehensively assessing the true capabilities and limits of LLMs. As an actively supported project, it invites contributions to evolve and keep pace with the rapidly progressing domain of AI and LLMs. The tool facilitates the exploration and design of more robust and human-aligned LLMs, ultimately contributing to advancements in the field.

PDF Markdown

GitHub

GitHub - microsoft/promptbench: A unified evaluation framework for large language models (2,220 stars)

Tweets

https://twitter.com/jd92wang/status/1787705404941586913

https://twitter.com/learnprompting/status/1882409301802987709

https://twitter.com/1634423520242302977/status/1739539685494444526

https://twitter.com/16700091/status/1739005026457010657

https://twitter.com/878228318977503233/status/1736712301145665668

https://twitter.com/240166866/status/1738358784161153384