OR-Bench: An Over-Refusal Benchmark for Large Language Models (2405.20947v2)

Published 31 May 2024 in cs.CL and cs.AI

Abstract: LLMs require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-LLM/or-bench and the demo can be found at https://huggingface.co/spaces/bench-LLM/or-bench. We hope this benchmark can help the community develop better safety aligned models.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Justin Cui (9 papers)
Wei-Lin Chiang (19 papers)
Ion Stoica (177 papers)
Cho-Jui Hsieh (211 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/maksym_andr/status/1798838694935457860

OR-Bench: An Over-Refusal Benchmark for Large Language Models (2405.20947v2)

Related Papers

Tweets