Proactive defense against LLM Jailbreak (2510.05052v1)

Published 6 Oct 2025 in cs.CR and cs.CL

Abstract: The proliferation of powerful LLMs has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with "spurious responses" that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92\%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0\%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

Summary

The paper introduces ProAct, a framework that proactively disrupts iterative LLM jailbreak attacks by generating benign spurious responses.
It employs a multi-agent pipeline—including a User Intent Analyzer, ProAct Defender, and Surrogate Evaluator—to selectively intervene on malicious queries.
Empirical results show up to 92% ASR reduction with minimal impact on utility, and the approach can be integrated with existing defenses for additive security gains.

ProAct: Proactive Defense Against LLM Jailbreak

Motivation and Problem Setting

The increasing deployment of LLMs in real-world applications has led to a surge in adversarial attacks, particularly jailbreaks that circumvent safety guardrails to elicit harmful outputs. Existing defense mechanisms—system-level alignment, inference guidance, and input/output filtering—are predominantly reactive and static, leaving them vulnerable to sophisticated, iterative, multi-turn jailbreak strategies. These attacks leverage feedback loops, often using LLM-based evaluators to iteratively refine prompts until a successful jailbreak is achieved. The ProAct framework introduces a proactive defense paradigm, aiming to disrupt the attacker's optimization process by injecting spurious responses that appear harmful to the attacker's evaluator but are, in fact, benign.

Figure 1: Passive defense allows iterative jailbreaks to succeed, while proactive defense (ProAct) misleads the attacker's evaluator with spurious, benign responses, terminating the attack early.

ProAct Framework Architecture

ProAct is structured as a multi-agent system comprising three core components: the User Intent Analyzer, the ProAct Defender, and the Surrogate Evaluator. The pipeline is designed to selectively intervene only on malicious queries, preserving utility for benign users.

Figure 2: The ProAct pipeline: (1) User Intent Analyzer detects malicious intent, (2) ProAct Defender generates spurious responses, (3) Surrogate Evaluator refines and validates the response, (4) benign queries bypass defense.

1. User Intent Analyzer

This module employs prompt-based intent classification, leveraging conversation history to detect malicious queries. Majority voting is used to enhance reliability, and only queries with confirmed malicious intent are routed to the defense pipeline. This minimizes unnecessary intervention and utility degradation.

2. ProAct Defender

The defender generates topic-relevant spurious responses that mimic harmful outputs without containing unsafe information. It avoids conditioning on raw user queries, instead using the security topic inferred by the analyzer. The defender employs diverse encoding strategies—emoji substitution, Base64, Hex, Morse code, etc.—to craft responses that appear malicious to automated evaluators but are semantically benign.

Figure 3: Examples of spurious response strategies: harmful requests are transformed into benign outputs using encoding schemes that deceive attacker evaluators.

3. Surrogate Evaluator

An independent LLM-based evaluator assesses the generated response, providing feedback for iterative refinement. The process continues until the evaluator classifies the response as harmful or a search budget is exhausted. Majority voting is used to improve decision reliability.

Theoretical Formulation

The attacker's objective is to maximize the expected success of jailbreaks via an internal scoring function $S_j$ , which is an imperfect proxy for the ground-truth safety function $S_g$ . ProAct exploits this imperfection by generating responses $r$ such that $S_j(r) = 1$ (evaluator believes the response is harmful) but $S_g(r) = 0$ (response is actually benign). The defense objective is thus:

$\mathcal{T} = \arg\max_{\mathcal{T}} \mathbb{E}_{p \in P_A}[S_j(\mathcal{T}(p))(1 - S_g(\mathcal{T}(p)))]$

This formulation ensures the attacker's optimization loop is disrupted by false positives, leading to premature termination of the jailbreak process.

Empirical Evaluation

Attack Success Rate (ASR) Reduction

ProAct was evaluated across six LLMs (Vicuna-13B, Qwen-7B, Llama-8B, Qwen-32B, GPT-OSS, GPT-4.1-mini), four attack frameworks (PAIR, TAP, DAGR, X-Teaming), and four benchmarks (HarmBench, AdvBench, JailbreakBench, AIR-Bench). The framework consistently reduced ASR by up to 92%, with average improvements of 27–59% across models and attack strategies. Notably, ProAct reduced ASR to below 5% in 41 out of 96 configurations.

Orthogonality and Additive Gains

ProAct is orthogonal to existing defenses. When combined with inference guidance, input filtering, or output filtering, it further reduced ASR—sometimes to 0%—demonstrating additive robustness. For example, output filtering combined with ProAct reduced ASR from 26% and 8% to 0% in strong attack scenarios.

Safety–Utility Trade-off

Utility was measured using the IFEval benchmark. ProAct preserved instruction-following accuracy, with negligible changes (≤0.01) for GPT-4.1-mini and modest drops (≤0.02) for Qwen-32B. This indicates that proactive defense does not significantly degrade model utility for benign users.

Backend Model Scaling

Ablation studies showed that increasing backend model capacity (from GPT-4.1-nano to GPT-5) in the ProAct Defender led to monotonic improvements in ASR reduction, while the User Intent Analyzer and Surrogate Evaluator exhibited diminishing returns beyond mid-capacity models. This suggests that defender capacity is the primary bottleneck for generating realistic spurious responses.

Spurious Response Strategy Effectiveness

Among encoding strategies, unrestricted (mixed) approaches yielded the highest bypass rates (0.91), while Hex encoding was the most efficient single strategy (average 1.80 attempts per successful bypass). ROT13 was least effective, with the lowest bypass rate (0.73) and highest average attempts (2.30).

Implementation Considerations

Integration: ProAct can be deployed as a middleware layer in LLM inference pipelines, requiring access to conversation history and the ability to intercept and modify outputs.
Resource Requirements: Defender performance scales with backend model capacity; however, the analyzer and evaluator can use lightweight models for efficiency.
Deployment: ProAct is compatible with both open-source and proprietary LLMs, and can be combined with existing safety guardrails for maximal robustness.
Limitations: Attackers may adapt by improving evaluator fidelity or using human-in-the-loop strategies, necessitating ongoing refinement of spurious response generation.

Implications and Future Directions

ProAct shifts the defense paradigm from passive filtering to proactive disruption, leveraging the attacker's reliance on imperfect evaluators. This approach is particularly effective against autonomous, multi-turn jailbreaks and can be combined with other defenses for layered security. Future work may explore adversarial co-evolution, adaptive spurious response generation, and integration with model-internal safety mechanisms. The framework's modularity and orthogonality make it a promising candidate for standardizing proactive LLM safety in production environments.

Conclusion

ProAct introduces a proactive, multi-agent defense framework that significantly reduces jailbreak attack success rates by misleading attacker evaluators with spurious, benign responses. The approach is orthogonal to existing defenses, preserves model utility, and scales with backend model capacity. ProAct represents a robust, dynamic strategy for safeguarding LLMs against evolving adversarial threats, with strong empirical support for its efficacy and practical deployability.