- The paper introduces ProAct, a framework that proactively disrupts iterative LLM jailbreak attacks by generating benign spurious responses.
- It employs a multi-agent pipeline—including a User Intent Analyzer, ProAct Defender, and Surrogate Evaluator—to selectively intervene on malicious queries.
- Empirical results show up to 92% ASR reduction with minimal impact on utility, and the approach can be integrated with existing defenses for additive security gains.
ProAct: Proactive Defense Against LLM Jailbreak
Motivation and Problem Setting
The increasing deployment of LLMs in real-world applications has led to a surge in adversarial attacks, particularly jailbreaks that circumvent safety guardrails to elicit harmful outputs. Existing defense mechanisms—system-level alignment, inference guidance, and input/output filtering—are predominantly reactive and static, leaving them vulnerable to sophisticated, iterative, multi-turn jailbreak strategies. These attacks leverage feedback loops, often using LLM-based evaluators to iteratively refine prompts until a successful jailbreak is achieved. The ProAct framework introduces a proactive defense paradigm, aiming to disrupt the attacker's optimization process by injecting spurious responses that appear harmful to the attacker's evaluator but are, in fact, benign.
Figure 1: Passive defense allows iterative jailbreaks to succeed, while proactive defense (ProAct) misleads the attacker's evaluator with spurious, benign responses, terminating the attack early.
ProAct Framework Architecture
ProAct is structured as a multi-agent system comprising three core components: the User Intent Analyzer, the ProAct Defender, and the Surrogate Evaluator. The pipeline is designed to selectively intervene only on malicious queries, preserving utility for benign users.
Figure 2: The ProAct pipeline: (1) User Intent Analyzer detects malicious intent, (2) ProAct Defender generates spurious responses, (3) Surrogate Evaluator refines and validates the response, (4) benign queries bypass defense.
1. User Intent Analyzer
This module employs prompt-based intent classification, leveraging conversation history to detect malicious queries. Majority voting is used to enhance reliability, and only queries with confirmed malicious intent are routed to the defense pipeline. This minimizes unnecessary intervention and utility degradation.
2. ProAct Defender
The defender generates topic-relevant spurious responses that mimic harmful outputs without containing unsafe information. It avoids conditioning on raw user queries, instead using the security topic inferred by the analyzer. The defender employs diverse encoding strategies—emoji substitution, Base64, Hex, Morse code, etc.—to craft responses that appear malicious to automated evaluators but are semantically benign.
Figure 3: Examples of spurious response strategies: harmful requests are transformed into benign outputs using encoding schemes that deceive attacker evaluators.
3. Surrogate Evaluator
An independent LLM-based evaluator assesses the generated response, providing feedback for iterative refinement. The process continues until the evaluator classifies the response as harmful or a search budget is exhausted. Majority voting is used to improve decision reliability.
The attacker's objective is to maximize the expected success of jailbreaks via an internal scoring function Sj, which is an imperfect proxy for the ground-truth safety function Sg. ProAct exploits this imperfection by generating responses r such that Sj(r)=1 (evaluator believes the response is harmful) but Sg(r)=0 (response is actually benign). The defense objective is thus:
T=argTmaxEp∈PA[Sj(T(p))(1−Sg(T(p)))]
This formulation ensures the attacker's optimization loop is disrupted by false positives, leading to premature termination of the jailbreak process.
Empirical Evaluation
Attack Success Rate (ASR) Reduction
ProAct was evaluated across six LLMs (Vicuna-13B, Qwen-7B, Llama-8B, Qwen-32B, GPT-OSS, GPT-4.1-mini), four attack frameworks (PAIR, TAP, DAGR, X-Teaming), and four benchmarks (HarmBench, AdvBench, JailbreakBench, AIR-Bench). The framework consistently reduced ASR by up to 92%, with average improvements of 27–59% across models and attack strategies. Notably, ProAct reduced ASR to below 5% in 41 out of 96 configurations.
Orthogonality and Additive Gains
ProAct is orthogonal to existing defenses. When combined with inference guidance, input filtering, or output filtering, it further reduced ASR—sometimes to 0%—demonstrating additive robustness. For example, output filtering combined with ProAct reduced ASR from 26% and 8% to 0% in strong attack scenarios.
Safety–Utility Trade-off
Utility was measured using the IFEval benchmark. ProAct preserved instruction-following accuracy, with negligible changes (≤0.01) for GPT-4.1-mini and modest drops (≤0.02) for Qwen-32B. This indicates that proactive defense does not significantly degrade model utility for benign users.
Backend Model Scaling
Ablation studies showed that increasing backend model capacity (from GPT-4.1-nano to GPT-5) in the ProAct Defender led to monotonic improvements in ASR reduction, while the User Intent Analyzer and Surrogate Evaluator exhibited diminishing returns beyond mid-capacity models. This suggests that defender capacity is the primary bottleneck for generating realistic spurious responses.
Spurious Response Strategy Effectiveness
Among encoding strategies, unrestricted (mixed) approaches yielded the highest bypass rates (0.91), while Hex encoding was the most efficient single strategy (average 1.80 attempts per successful bypass). ROT13 was least effective, with the lowest bypass rate (0.73) and highest average attempts (2.30).
Implementation Considerations
- Integration: ProAct can be deployed as a middleware layer in LLM inference pipelines, requiring access to conversation history and the ability to intercept and modify outputs.
- Resource Requirements: Defender performance scales with backend model capacity; however, the analyzer and evaluator can use lightweight models for efficiency.
- Deployment: ProAct is compatible with both open-source and proprietary LLMs, and can be combined with existing safety guardrails for maximal robustness.
- Limitations: Attackers may adapt by improving evaluator fidelity or using human-in-the-loop strategies, necessitating ongoing refinement of spurious response generation.
Implications and Future Directions
ProAct shifts the defense paradigm from passive filtering to proactive disruption, leveraging the attacker's reliance on imperfect evaluators. This approach is particularly effective against autonomous, multi-turn jailbreaks and can be combined with other defenses for layered security. Future work may explore adversarial co-evolution, adaptive spurious response generation, and integration with model-internal safety mechanisms. The framework's modularity and orthogonality make it a promising candidate for standardizing proactive LLM safety in production environments.
Conclusion
ProAct introduces a proactive, multi-agent defense framework that significantly reduces jailbreak attack success rates by misleading attacker evaluators with spurious, benign responses. The approach is orthogonal to existing defenses, preserves model utility, and scales with backend model capacity. ProAct represents a robust, dynamic strategy for safeguarding LLMs against evolving adversarial threats, with strong empirical support for its efficacy and practical deployability.