OVERBENCH: Dynamic Benchmark for Oversensitivity

Updated 4 July 2026

OVERBENCH is a dynamic benchmark that measures oversensitivity in LLMs by identifying false refusals of benign prompts.
It uses a three-stage pipeline combining proxy modeling, feature attribution, and adversarial prompt generation to tailor evaluations to each model's behavior.
Empirical findings reveal distinct oversensitivity profiles across 25 LLM families, emphasizing the need for continuous, adaptive safety monitoring.

OVERBENCH is a dynamic benchmark for evaluating oversensitivity in LLMs: the tendency to wrongly refuse harmless prompts because they contain superficially risky wording or resemble unsafe requests at the lexical level. It was introduced to address a limitation of static safety benchmarks, which can become outdated as models evolve and can also suffer from contamination when fixed evaluation items enter training corpora. In the supplied literature, the name also appears as a prompt-level alias for ONEBench, a sample-level benchmarking paradigm for open-ended capabilities; however, OVERBENCH proper denotes the oversensitivity benchmark introduced in "Dynamic Evaluation for Oversensitivity in LLMs" (Pu et al., 21 Oct 2025), while the aliasing remark appears separately in "ONEBench to Test Them All" (Ghosh et al., 2024).

1. Concept and motivation

OVERBENCH is designed to measure false refusals: cases in which a model rejects a prompt whose semantic intent is benign. The benchmark’s motivating example is a query such as “How can I kill time on a long flight?”, where the token “kill” may trigger defensive refusal even though the request is harmless. The benchmark therefore treats oversensitivity as a usability and calibration problem rather than a conventional safety-success measure (Pu et al., 21 Oct 2025).

The benchmark was created because existing oversensitivity evaluations are described as static, quickly outdated, vulnerable to contamination, and narrow. The underlying concern is that as alignment systems become more restrictive, models may overgeneralize from hazardous content categories and reject benign prompts that share superficial features with harmful ones. This makes it harder to distinguish genuinely unsafe content from harmless requests and weakens the practical utility of refusal behavior in deployment (Pu et al., 21 Oct 2025).

A central design principle of OVERBENCH is that oversensitivity should be evaluated with model-specific challenging datasets rather than a single fixed test set. This suggests an evolving benchmark logic: the test distribution is regenerated to track each target model’s current refusal boundary rather than assuming that a static corpus remains informative over time (Pu et al., 21 Oct 2025).

2. Formalization of oversensitivity

The benchmark defines oversensitivity as a model rejecting a prompt that belongs to the benign subset of prompts. Let $M$ denote a LLM, $Q$ a prompt set, and $Q^{\text{benign}}$ the subset whose semantic intent is harmless. Let $Q_M^r$ be the prompts rejected by $M$ , and $Q_M^a$ the prompts accepted by $M$ . The benchmark defines oversensitivity at the prompt level as follows:

$\text{Oversensitivity}(q) = \begin{cases} 1 & \text{if } q \in Q^{\text{benign}} \land q \in Q^r_M \ 0 & \text{otherwise.} \end{cases}$

The primary aggregate metric is the Oversensitivity Rate (OSR), defined as the fraction of benign prompts refused by the model:

$\text{OSR}(M)=\frac{|Q^{\text{benign}} \cap Q^r_M|}{|Q^{\text{benign}}|}$

This formulation makes the benchmark explicitly refusal-centric. It does not ask whether a refusal is safe in the abstract; it asks whether the refusal is incorrect given benign intent. A plausible implication is that OVERBENCH shifts the evaluation target from broad alignment compliance toward refusal calibration at the benign–harmful boundary (Pu et al., 21 Oct 2025).

3. Dynamic construction pipeline

OVERBENCH is built through a three-stage pipeline centered on proxy modeling, feature attribution, and LLM-based adversarial generation. The first stage trains, for each target model $M$ , a lightweight proxy classifier $Q$ 0 that imitates whether the target model would accept or refuse a prompt. The proxy model is DeBERTa-v3-base, trained on query-response labels derived from the target model, and used as a cost-effective filter rather than as the final evaluator. Its objective is written as:

$Q$ 1

where $Q$ 2, $Q$ 3 is the target model’s refusal decision, and $Q$ 4 is the classification loss (Pu et al., 21 Oct 2025).

The second stage performs feature attribution to identify tokens that most strongly influence refusal. The benchmark uses Integrated Gradients:

$Q$ 5

and then applies a frequency correction,

$Q$ 6

with $Q$ 7 in the reported experiments. This adjustment is intended to surface meaningful trigger features rather than merely frequent words (Pu et al., 21 Oct 2025).

The third stage uses those attributed features to generate new benign prompts likely to provoke refusal. The generation step is modeled as

$Q$ 8

and, in practice, uses GPT-4o-mini as the generator with temperature 1.0, top-p 0.8, the top-3 attribution tokens as conditioning features, and a feature usage cap $Q$ 9. This closed-loop procedure repeatedly expands the prompt pool by generating benign prompts that are predicted by the proxy to remain rejectable, thereby tracking emerging refusal triggers for each model (Pu et al., 21 Oct 2025).

4. Benchmark composition and semantic taxonomy

OVERBENCH aggregates model-specific adversarial prompts across 25 LLMs from multiple families and contains 450,000 samples. The benchmark also defines OVERBENCH-Hard, a 30,000-sample distilled subset consisting of prompts that were rejected by at least five models, intended as a more difficult and cost-effective evaluation set (Pu et al., 21 Oct 2025).

The covered model families include GPT, Qwen, DeepSeek, Gemma, Llama, Phi, and Mistral. The listed models include gpt-4o-mini, gpt-3.5-turbo, Qwen-7B-Chat, Qwen-14B-Chat, Qwen-72B-Chat, Qwen3-0.6B, Qwen3-1.7B, Qwen3-8B, Qwen3-14B, Qwen3-32B, DeepSeek-V2-Lite, gemma-3-1b, gemma-3-4b, gemma-3-12b, gemma-3-27b, Llama-3.1-8B, Llama-3.1-70B, Llama-3.2-1B, Llama-3.2-3B, Llama-3.3-70B, Phi-3.5-MoE, Phi-3.5-mini, Phi-4, Mistral-Nemo-Instruct-2407, and Mistral-Small-3.1-24B-Instruct-2503 (Pu et al., 21 Oct 2025).

For semantic analysis, the benchmark groups prompts into four main categories: Illegal Activities, Privacy Invasion, Violence and Harm, and Bias and Discrimination. Low-frequency prompt types such as social engineering are grouped into Others. These categories are used to analyze which kinds of harmless prompts are over-refused and to inspect whether refusal triggers cluster around particular safety themes (Pu et al., 21 Oct 2025).

The benchmark also studies family-specific trigger patterns. The reported analysis notes that Qwen and Gemma often react to tokens such as “sneak” and “ians”, whereas Llama models exhibit a more diverse trigger distribution. It also reports that some tokens related to theft or insults appear salient across multiple families, suggesting partially shared refusal heuristics. This suggests that OVERBENCH is not merely a dataset of prompts, but also a diagnostic instrument for studying the lexical and family-level structure of defensive behavior (Pu et al., 21 Oct 2025).

5. Evaluation protocol and empirical findings

The primary evaluation metric is OSR, measured on both OVERBENCH and OVERBENCH-Hard. The proxy-training stage uses 30,000 prompts sampled from HH-RLHF and ToxiGen, split 90% train / 5% validation / 5% test, with the proxy trained for 3 epochs at learning rate $Q^{\text{benign}}$ 0. Refusal labeling uses phrase matching for clear refusals and GPT-4o-mini for ambiguous responses. Manual verification on 500 samples yields 94% precision and 91% recall for the automatic labeling procedure (Pu et al., 21 Oct 2025).

The main empirical result is that oversensitivity varies substantially across model families. The paper reports that Gemma models show the most severe oversensitivity, Phi models are next most oversensitive, and Llama-70B models show the least tendency to reject harmless prompts. It also reports that scaling does not reliably reduce oversensitivity: in Llama, oversensitivity decreases from 1B to 70B, whereas in Gemma and Qwen the opposite trend appears. This indicates that model scale alone is not a stable predictor of refusal calibration (Pu et al., 21 Oct 2025).

The results also show substantial within-family similarity. Models from the same family and similar generation tend to exhibit comparable oversensitivity profiles, which the paper interprets as evidence of shared alignment strategies. At the same time, some trigger features are shared across families, implying that certain defensive heuristics may be widely learned rather than architecture-specific (Pu et al., 21 Oct 2025).

A broader conclusion drawn by the benchmark is that oversensitivity should be monitored continuously rather than assessed through one-time tests. Because prompts are generated to target each model’s current refusal behavior, OVERBENCH is structured as a benchmark that evolves with the models it evaluates. This contrasts with fixed-test paradigms and suggests a more adaptive approach to safety monitoring (Pu et al., 21 Oct 2025).

6. Significance, limitations, and benchmark context

The principal significance of OVERBENCH lies in its reframing of safety evaluation. Rather than asking only whether a model refuses harmful prompts, it asks whether the model unnecessarily refuses good ones. This makes lengthening refusal behavior or increasing defensiveness not automatically desirable. In deployment terms, the benchmark positions oversensitivity as a failure mode that harms usability, obscures the true harmfulness boundary, and complicates the interpretation of safety improvements (Pu et al., 21 Oct 2025).

The benchmark’s main strengths are described as dynamic and evolving evaluation, model-specific prompt generation, efficient proxy-based filtering, explainable trigger analysis through attribution, and broad coverage across major LLM families. These features make OVERBENCH suitable for monitoring safety drift, diagnosing refusal triggers, and comparing refusal calibration across architectures and scales (Pu et al., 21 Oct 2025).

Its limitations are also explicit. The paper notes that it focuses mainly on false refusals and does not separately analyze true positive refusals, meaning cases where refusal is actually justified. It also depends on a proxy model and generated prompts, so coverage remains approximate, and attribution may not capture every cause of refusal. Dynamic generation mitigates staleness, but the benchmark still requires ongoing refreshment as models change (Pu et al., 21 Oct 2025).

Within the broader benchmark landscape, OVERBENCH stands apart from sample-level open-ended evaluation systems such as ONEBench, which aggregate heterogeneous measurements across reusable data pools and support capability-specific querying (Ghosh et al., 2024). By contrast, OVERBENCH is narrowly focused on a single failure mode—oversensitivity—but treats that failure mode as dynamic, model-aware, and continuously regenerable. This difference is consequential: ONEBench generalizes evaluation infrastructure, whereas OVERBENCH specializes evaluation pressure at a moving safety boundary.

Markdown Report Issue Upgrade to Chat

References (2)

Dynamic Evaluation for Oversensitivity in LLMs (2025)

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OVERBENCH.