Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OR-Bench: An Over-Refusal Benchmark for Large Language Models (2405.20947v2)

Published 31 May 2024 in cs.CL and cs.AI

Abstract: LLMs require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-LLM/or-bench and the demo can be found at https://huggingface.co/spaces/bench-LLM/or-bench. We hope this benchmark can help the community develop better safety aligned models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Justin Cui (9 papers)
  2. Wei-Lin Chiang (19 papers)
  3. Ion Stoica (177 papers)
  4. Cho-Jui Hsieh (211 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com