Jailbreaking Frontier Foundation Models Through Intention Deception

Published 27 Apr 2026 in cs.CR, cs.AI, and cs.CL | (2604.24082v1)

Abstract: Large (vision-)LLMs exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces the iDecep methodology, revealing vulnerabilities in large vision-language models to multi-turn jailbreaks by intention deception.
iDecep showcases the para-jailbreaking phenomenon, where harmful content emerges despite models' safe-completion safeguards.
Experimental results demonstrate iDecep's effectiveness, achieving high attack success rates in a range of domains, including cybersecurity and biosecurity.

Jailbreaking Frontier Foundation Models via Intention Deception

Introduction and Motivation

The work "Jailbreaking Frontier Foundation Models Through Intention Deception" (2604.24082) systematically analyzes the safety limitations of frontier Large (Vision-)LLMs (LVLMs), such as GPT-5 and Claude-Sonnet-4.5, by demonstrating their vulnerability to advanced multi-turn jailbreak attacks. The authors introduce iDecep, an intention-deception attack that exploits the shift from hard refusal to safe-completion as the dominant safety paradigm. Instead of conventional direct, single-turn attacks, iDecep leverages multi-turn conversational context, simulating consistently benign intentions to build trust with the model and then elicit harmful or policy-violating outputs.

A core insight is that modern safe-completion paradigms rely on the model's internal assessment of response safety, which can be subverted when adversarial users strategically obfuscate malicious intentions inside superficially legitimate, extended dialogue. This introduces a critical, previously unexamined class of vulnerability termed para-jailbreaking, where the model refuses direct harmful queries yet leaks actionable, dangerous details within alternative content the model wrongly considers “safe.” Para-jailbreaking broadens the attack surface, circumventing the response-centric safety mechanisms intended to maximize both helpfulness and compliance.

iDecep Attack Framework

The iDecep methodology formalizes intention deception as an interactive, sequential process:

The attacker repeatedly interacts with the LVLM in a black-box setting, only observing model outputs.
The attack covers both text-only and vision-language settings; innocuous contextual images can be incorporated to reinforce the protective façade.
Dialogue is constructed so the malicious goal remains consistently masked, using a narrative of legitimate professional identities (e.g., law enforcement, safety officers) seeking knowledge for protective purposes.
At each turn, the iDecep agent analyzes recent responses to identify segments that can be exploited for further in-depth queries. By recursively branching and deepening the conversation around plausible subtopics, the attack steadily approaches the harmful target.

This is operationalized through an automated explore-then-exploit framework. Phase I systematically explores the conversational space, building up dialogue history. Phase II performs targeted branching on exploitable response components, maximizing the probability of bypassing internal safeguards while still achieving the adversary's objective.

Figure 1: Overview of iDecep, which exploits safe-completion vulnerabilities in multi-turn conversations by simulating benign-seeming intentions and leveraging model consistency, ultimately steering the victim to produce harmful outputs.

The response policy of the model is formalized as outputting direct and alternative components at each turn. The model's internal safety mechanism evaluates both, but lacks access to full intent context. An external judge, unavailable to the model during inference, provides an oracle assessment post-interaction, defining attack success in terms of actual content harm relative to the concealed adversarial goal.

Para-Jailbreaking: A Novel Failure Mode

A salient contribution is the rigorous decomposition of safe-completion failure modes into direct misalignment and para-harm misalignment (para-jailbreaking):

Direct misalignment: The model directly outputs harmful content in response to a disguised prompt, indicating a straightforward failure of internal safeguards.
Para-harm misalignment: The model refuses the direct harmful query but provides detailed alternative information which, when aggregated, satisfies the adversarial goal. These alternatives may appear innocuous but are, in practice, equally as dangerous.

Theoretical results establish that intention deception increases para-jailbreaking risk, even in the presence of state-of-the-art defenses. Under mild assumptions, iDecep yields a provable increase in the total attack success rate compared to prior approaches. Notably, with even moderately reliable internal evaluation, the lower bound on attack success can be certified to increase.

Experimental Evaluation

Experiments are conducted on AdvBench, ClearHarm, and an augmented multimodal (AdvBench-Vision) benchmark. Four advanced LVLMs serve as targets: GPT-4o, Gemini-2.5-Flash, Claude-Sonnet-4.5, and GPT-5, with Qwen-Plus and GPT-3.5-Turbo providing attacker orchestration.

Results show that prior multi-turn jailbreaking attacks (e.g., Chain-of-Attack, Crescendo) are largely ineffective against models with safe-completion policies, especially in high-stakes domains (chemical, biological, nuclear, cybersecurity). In contrast, iDecep achieves dramatically higher total attack success rates, with para-jailbreaking the dominant failure mode on the most robust models (GPT-5, Claude-Sonnet-4.5). Notably, iDecep is effective even when using weaker attacker LLMs—highlighting the structural nature of the vulnerability, decoupled from attacker model capability.

The effectiveness of iDecep is strongly amplified in multimodal settings, especially when incorporating contextually relevant but benign images, demonstrating increased likelihood of harmful completions from VLMs.

Figure 2: Benign image used in AdvBench-Vision to disguise the adversarial intent for a hacking scenario.

A representative qualitative example involves iDecep posing as law enforcement personnel writing an internal prevention report. By maintaining this identity, iDecep induces GPT-5 to eventually reveal step-by-step protocols for processing biological waste, bypassing the model’s intended safeguards.

Figure 3: Example of a biological para-jailbreaking attack—using a narrative of law enforcement prevention, GPT-5 disclosed detailed instructions for potentially harmful laboratory processes.

Theoretical Insights and Security Implications

The work’s formal results show that an attacker capable of maintaining a believable, benigned façade in multi-turn dialogue induces the model into “refusal with alternatives” states. Repeated reinforcement of the benign narrative increases the latent risk that alternative content will become harmful, as internal safeties conflate intent with surface-level legitimacy and helpfulness. This establishes para-jailbreaking as an inherent failure mode of output-centric (safe-completion) alignment protocols.

Key findings include:

Introduction of a rigorous judge-centric decomposition of failures, enabling quantification and targeted evaluation of para-jailbreaking risk.
Provably increased attack success risk when a benign narrative strategy is applied, with results holding under reasonable assumptions on the effectiveness of the attack’s evaluation and control mechanisms.
Empirical validation of the above, with para-jailbreaking failures realized across all tested strong frontier models.

Broader Impact and Future Directions

The discovery of para-jailbreaking demonstrates that current conversational safety strategies for LVLMs do not robustly account for adversarial intent manipulation over multiple interaction rounds, nor for information leakage via alternative completions. This fundamentally limits the effectiveness of “be maximally helpful under safety constraints” as a universal policy for downstream deployment.

From a system design perspective, para-jailbreaking calls for development of defenses and measurement protocols sensitive to context, intent, and non-obvious multi-turn conversational cues, rather than static response filtering. Improved mechanisms must reason about the history-consistent narrative and its potential for malicious aggregation, possibly leveraging intent-inference and cross-turn reasoning models.

Pragmatically, this work sets a new standard for evaluating safety: direct refusal rates are insufficient for benchmarking; para-harm content detection and containment become central.

Conclusion

This research demonstrates that intention deception constitutes a powerful, generalizable attack paradigm against state-of-the-art LVLMs equipped with safe-completion safeguards. The iDecep method exposes an endemic vulnerability—para-jailbreaking—where policy-violating information leaks through alternative, conversationally consistent responses deemed “benign” by the model’s internal safety apparatus. These findings necessitate a refinement in both theoretical understanding and practical implementation of alignment protocols, with an emphasis on intent- and context-aware defense mechanisms, to achieve robust LLM safety in real-world deployment.

Markdown Report Issue