- The paper introduces AdaCD, a novel adaptive contrastive decoding method that mitigates over-refusal in LLMs while preserving safety.
- It dynamically adjusts the refusal token distribution using an agreement ratio and confidence constraint to switch decoding modes effectively.
- Experiments demonstrate a 10.35% reduction in over-refusal ratios with minimal impact on safety, highlighting AdaCDโs practical, training-free design.
Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding
Overview
The paper "Please refuse to answer me! Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding" (2604.17132) addresses the challenge of over-refusal in safety-aligned LLMs, where LLMs refuse to answer benign queries that may overlap lexically with malicious prompts. While prior work has sought to mitigate this issue via training- or inference-time interventions, these solutions often fail to simultaneously reduce over-refusals on harmless queries and maintain high refusal rates on genuinely harmful queries. The authors introduce AdaCD, a training-free, model-agnostic inference algorithm that dynamically adjusts the refusal token distribution through an adaptive contrastive decoding mechanism, leveraging a novel agreement ratio and confidence constraint for decoding mode switching.
Problem: Over-Refusal in Safety-Aligned LLMs
Over-refusal occurs when LLMs, in their drive for safety, refuse to answer trivial or contextually benign queries due to keyword overlap with malicious instructions. For example, queries such as "How do I kill someone in Call of Duty?" elicit refused responses, even though the context clearly denotes in-game actions.
Figure 1: Over-refusal Example. Here, "kill" refers to a gaming action rather than malicious intent, but the original model exhibits exaggerated safety behavior. With AdaCD, the model can generate a helpful response.
While the literature proposes both training-based (e.g., SafePatching, ACTOR, SSD) and inference-based (e.g., prompt engineering, activation steering, contrastive decoding) mitigation strategies, most fail to adapt to the nuanced differences between harmful and harmless queries containing ambiguous terminology. Specifically, fixed strategies (such as always subtracting or always adding refusal token distributions) do not achieve optimal trade-offs between helpfulness and safety on out-of-distribution (OOD) or adversarially ambiguous prompts.
Core Observations and Methodology
Empirical Analysis of Refusal Behavior
The authors conduct systematic probing by issuing queries with varying safety-level system prompts: Low (helpfulness-prioritized), Medium (balance), High (safety-prioritized), and Extreme ("Please refuse to answer me!"). Experimental results (Figure 2) demonstrate that increasing safety emphasis monotonically increases the refusal ratio, but non-refusal tokens persist in the candidate distributionโimplying that the decoder fails to select them despite their presence.
Figure 2: Refusal ratio under various safety level system prompts on over-refusal queries.
AdaCD Architecture
AdaCD consists of two main components: (1) Refusal Token Distribution Extraction, and (2) Adaptive Decoding Mode Switch.
Figure 3: AdaCD workflow. (a) Extraction of refusal token distribution using an extreme safety-oriented prompt; (b) Adaptive decoding mode switching based on agreement ratio and confidence constraint.
- Extraction of Refusal Token Distribution: By contrasting logit outputs with and without an extreme "refusal-style" prompt, AdaCD isolates the refusal-directed component of the token distribution. Unlike prior work which does not fully decouple safety emphasis, AdaCDโs extreme prompt produces a clearer signal for the refusal-oriented dimension.
- Adaptive Decoding Mode Switch: At each generation step, the system computes an agreement ratio (the inverse rank of the top token from the safety-emphasized distribution within the base distribution), and a confidence constraint. If agreement is high and confidence is adequate, the refusal distribution is added; otherwise, it is subtracted, effectively boosting or suppressing refusal token probabilities dynamically. The decision is controlled by a tunable threshold ฮป.
Experimental Evaluation
Over-Refusal and Safety Benchmarks
Extensive experiments were performed using Llama3-8B, Gemma2-9B, and Qwen3-8B on XSTest-Safe, ORBench, OKTest (over-refusal), and XSTest-UnSafe, AdvBench, JailBench (malicious queries). AdaCD demonstrates:
Ablation studies reveal that only subtracting (or only adding) the refusal token distribution predictably fails to preserve both axes of performance, confirming the necessity of AdaCD's adaptive switch. Additionally, removal of either the agreement ratio or adaptive confidence constraint degrades selectivity. Hyperparameter sensitivity analysis shows highest efficacy for ฮป=0.9.
Figure 5: Ablation analysis of ฮป on refusal ratio.
Figure 6: Refusal ratio evaluated by GPT-4.
Usability and Efficiency
AdaCD maintains or modestly improves general usability as scored by GPT-4 (helpfulness, engagement, factuality, etc.), and incurs minimal computational overhead relative to baseline greedy decoding (ATGR increase <5%). Notably, AdaCD requires no parameter updates or vector pre-computation, and is directly applicable to black-box LLMs.
Distributional Analysis
Visualization of the extracted refusal token distribution confirms AdaCDโs extreme-prompt-based variant yields maximal separation and selectivity (Figure 7).

Figure 7: Visualization of ฮP1โ sorted by frequency.
Theoretical Implications and Future Directions
AdaCD operationalizes reference-point theory from cognitive science to optimize safetyโhelpfulness balance dynamically, sidestepping static thresholding and architecture-dependent techniques prevalent in prior art. Its training-free and model-agnostic nature makes it suitable for deployment across aligned models, even when internal parameter access is not feasible.
However, fixed hyperparameters and lack of human evaluations present open questions; future work may explore query-conditioned ฮป, application to multi-modal models, and robust human-centric assessment to complement large-scale LLM-based evaluation.
Further, the core insightโthat harmful and harmless queries differing by context can be differentiated by adaptive, step-wise adjustment of contrastively learned token distributionsโinvites more general application in OOD sensitivity mitigation and adaptive decoding for other forms of misalignment or failure modes in generative models.
Conclusion
AdaCD provides a formally motivated, empirically validated mechanism to address the chronic over-refusal problem in safety-aligned LLMs. Unlike previous static or architectural approaches, AdaCD leverages an adaptive, contrastive scheme, thus mitigating exaggerated refusals without sacrificing safety. Its plug-and-play, training-independent design, and strong quantitative gains across standard benchmarks suggest practicality for current and future LLM deployments.