Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding

Published 18 Apr 2026 in cs.CL | (2604.17132v1)

Abstract: Safety-aligned LLMs often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non-refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over-refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at https://github.com/OutdoorManofML/AdaCD.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces AdaCD, a novel adaptive contrastive decoding method that mitigates over-refusal in LLMs while preserving safety.
It dynamically adjusts the refusal token distribution using an agreement ratio and confidence constraint to switch decoding modes effectively.
Experiments demonstrate a 10.35% reduction in over-refusal ratios with minimal impact on safety, highlighting AdaCD’s practical, training-free design.

Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding

Overview

The paper "Please refuse to answer me! Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding" (2604.17132) addresses the challenge of over-refusal in safety-aligned LLMs, where LLMs refuse to answer benign queries that may overlap lexically with malicious prompts. While prior work has sought to mitigate this issue via training- or inference-time interventions, these solutions often fail to simultaneously reduce over-refusals on harmless queries and maintain high refusal rates on genuinely harmful queries. The authors introduce AdaCD, a training-free, model-agnostic inference algorithm that dynamically adjusts the refusal token distribution through an adaptive contrastive decoding mechanism, leveraging a novel agreement ratio and confidence constraint for decoding mode switching.

Problem: Over-Refusal in Safety-Aligned LLMs

Over-refusal occurs when LLMs, in their drive for safety, refuse to answer trivial or contextually benign queries due to keyword overlap with malicious instructions. For example, queries such as "How do I kill someone in Call of Duty?" elicit refused responses, even though the context clearly denotes in-game actions.

Figure 1: Over-refusal Example. Here, "kill" refers to a gaming action rather than malicious intent, but the original model exhibits exaggerated safety behavior. With AdaCD, the model can generate a helpful response.

While the literature proposes both training-based (e.g., SafePatching, ACTOR, SSD) and inference-based (e.g., prompt engineering, activation steering, contrastive decoding) mitigation strategies, most fail to adapt to the nuanced differences between harmful and harmless queries containing ambiguous terminology. Specifically, fixed strategies (such as always subtracting or always adding refusal token distributions) do not achieve optimal trade-offs between helpfulness and safety on out-of-distribution (OOD) or adversarially ambiguous prompts.

Core Observations and Methodology

Empirical Analysis of Refusal Behavior

The authors conduct systematic probing by issuing queries with varying safety-level system prompts: Low (helpfulness-prioritized), Medium (balance), High (safety-prioritized), and Extreme ("Please refuse to answer me!"). Experimental results (Figure 2) demonstrate that increasing safety emphasis monotonically increases the refusal ratio, but non-refusal tokens persist in the candidate distribution—implying that the decoder fails to select them despite their presence.

Figure 2: Refusal ratio under various safety level system prompts on over-refusal queries.

AdaCD Architecture

AdaCD consists of two main components: (1) Refusal Token Distribution Extraction, and (2) Adaptive Decoding Mode Switch.

Figure 3: AdaCD workflow. (a) Extraction of refusal token distribution using an extreme safety-oriented prompt; (b) Adaptive decoding mode switching based on agreement ratio and confidence constraint.

Extraction of Refusal Token Distribution: By contrasting logit outputs with and without an extreme "refusal-style" prompt, AdaCD isolates the refusal-directed component of the token distribution. Unlike prior work which does not fully decouple safety emphasis, AdaCD’s extreme prompt produces a clearer signal for the refusal-oriented dimension.
Adaptive Decoding Mode Switch: At each generation step, the system computes an agreement ratio (the inverse rank of the top token from the safety-emphasized distribution within the base distribution), and a confidence constraint. If agreement is high and confidence is adequate, the refusal distribution is added; otherwise, it is subtracted, effectively boosting or suppressing refusal token probabilities dynamically. The decision is controlled by a tunable threshold $\lambda$ .

Experimental Evaluation

Over-Refusal and Safety Benchmarks

Extensive experiments were performed using Llama3-8B, Gemma2-9B, and Qwen3-8B on XSTest-Safe, ORBench, OKTest (over-refusal), and XSTest-UnSafe, AdvBench, JailBench (malicious queries). AdaCD demonstrates:

An average reduction of over-refusal ratios by 10.35% (absolute) compared to default inference, significantly outperforming surgical activation or contrastive decoding baselines such as SelfCD and SafeDecoding.
A marginal increase (+0.13%) in refusal ratio on genuinely malicious queries, confirming the absence of a safety–helpfulness trade-off.
Figure 4: Average agreement ratio under over-refusal (dashed line) and malicious (solid line) scenarios.

Ablation studies reveal that only subtracting (or only adding) the refusal token distribution predictably fails to preserve both axes of performance, confirming the necessity of AdaCD's adaptive switch. Additionally, removal of either the agreement ratio or adaptive confidence constraint degrades selectivity. Hyperparameter sensitivity analysis shows highest efficacy for $\lambda=0.9$ .

Figure 5: Ablation analysis of $\lambda$ on refusal ratio.

Figure 6: Refusal ratio evaluated by GPT-4.

Usability and Efficiency

AdaCD maintains or modestly improves general usability as scored by GPT-4 (helpfulness, engagement, factuality, etc.), and incurs minimal computational overhead relative to baseline greedy decoding (ATGR increase <5%). Notably, AdaCD requires no parameter updates or vector pre-computation, and is directly applicable to black-box LLMs.

Distributional Analysis

Visualization of the extracted refusal token distribution confirms AdaCD’s extreme-prompt-based variant yields maximal separation and selectivity (Figure 7).

Figure 7: Visualization of $\Delta P_1$ sorted by frequency.

Theoretical Implications and Future Directions

AdaCD operationalizes reference-point theory from cognitive science to optimize safety–helpfulness balance dynamically, sidestepping static thresholding and architecture-dependent techniques prevalent in prior art. Its training-free and model-agnostic nature makes it suitable for deployment across aligned models, even when internal parameter access is not feasible.

However, fixed hyperparameters and lack of human evaluations present open questions; future work may explore query-conditioned $\lambda$ , application to multi-modal models, and robust human-centric assessment to complement large-scale LLM-based evaluation.

Further, the core insight—that harmful and harmless queries differing by context can be differentiated by adaptive, step-wise adjustment of contrastively learned token distributions—invites more general application in OOD sensitivity mitigation and adaptive decoding for other forms of misalignment or failure modes in generative models.

Conclusion

AdaCD provides a formally motivated, empirically validated mechanism to address the chronic over-refusal problem in safety-aligned LLMs. Unlike previous static or architectural approaches, AdaCD leverages an adaptive, contrastive scheme, thus mitigating exaggerated refusals without sacrificing safety. Its plug-and-play, training-independent design, and strong quantitative gains across standard benchmarks suggest practicality for current and future LLM deployments.

Markdown Report Issue