Papers
Topics
Authors
Recent
Search
2000 character limit reached

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Published 17 Apr 2026 in cs.CR | (2604.15717v1)

Abstract: A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.

Summary

  • The paper introduces a taxonomy of context-induced vulnerabilities, including Vertical and General Unlocking, which drastically weaken LLM safety guardrails.
  • The Jargon framework automates adversarial multi-turn dialogues that mimic expert safety research, achieving over 99% attack success on leading models.
  • The study reveals the need for context-aware safety mechanisms and improved boundary calibration to counteract the weaponization of academic discourse.

Domain Contexts and the Erosion of LLM Safety Boundaries

Motivation and Background

LLM alignment is fundamentally constrained by the dual-use nature of knowledge: the same information essential for beneficial purposes is frequently abusable for high harm. This paper systematically dissects how domain-specific contexts—particularly those with the appearance of technical or safety-related academic discussion—can selectively or generally suppress the operation of model safety guardrails. The authors introduce a taxonomy of context-induced vulnerabilities: Vertical Unlocking, where defenses are relaxed for domain-aligned harm, and General Unlocking, where a more generic safety-research context provides broad, cross-category relaxation of safety behaviors.

In contrast to shallow persona-based jailbreaks, the work hypothesizes and empirically substantiates that contextual signals of deep expertise, especially in the form of authentic research paper discourse, are substantially more effective in coercing LLMs into providing restricted information, even on highly defended models. Figure 1

Figure 1

Figure 1

Figure 2: Llama-4-Scout performance across context and attack scenarios, visualizing context-induced vulnerability profiles.

The Jargon Framework: Systematic Contextual Exploitation

To operationalize and study this vulnerability class, the paper proposes Jargon, a framework combining persistent academic safety-research context with multi-turn adversarial probing. Jargon automates dialog that mimics the interaction patterns of genuine safety researchers: an initial phase of benign, technical discussion to build rapport, followed by contextually reframed queries that embed a harmful objective in a seemingly legitimate academic or evaluative exercise. The system features real-time query variant optimization, parallel candidate evaluation, and memory-driven few-shot demonstration selection, maximizing the likelihood of boundary-crossing responses. Figure 3

Figure 1: Jargon’s control and conversation architecture—multi-turn interaction simulates authentic safety research to blur intent, with a judge adjudicating harmful content and successful attack memory guiding future queries.

Empirical Analysis of Contextual Vulnerabilities

Vertical and General Unlocking

A controlled probing study across leading LLMs (Llama-4-Scout, Claude-Sonnet-3.7, Gemini-3-Pro, and commercial frontier models) establishes that domain-specific research papers (e.g., chemistry for chemical weapons) consistently induce significant relaxation of model safeguards only for domain-aligned threats: Vertical Unlocking manifests as high attack scores when context and harmful query are matched.

Conversely, presenting a safety-research paper (e.g., on AI jailbreaks, red-teaming, or defenses) as context triggers a far broader relaxation, achieving high attack effectiveness across disjoint harm categories: General Unlocking transcends domain specificities and relies on the privileged status that authentically technical safety discourse has in model pre-training and alignment. Figure 4

Figure 3: Relative Harm Score distribution by context length and model—longer, richer academic context escalates attack effectiveness and diffuses attention from safety signals.

Main Results: Attack Effectiveness

Jargon is deployed on seven frontier LLMs spanning open-weight and commercial settings, benchmarked against representative state-of-the-art jailbreak attacks (single-turn and multi-turn paradigms). The results show attack success rates in excess of 99% on average, including 93–100% on GPT-5.2 and Claude-4.5, outperforming prominent baselines such as FITD and X-Teaming, which themselves dramatically underperform (e.g., 18–59% on Claude-4.5, 59% on GPT-5.2) in these settings. Domain-specific LLMs are equally compromised, reaching 100% ASR on Meditron3-8B and BioMistral-7B.

Mechanisms: Activation and Attention Diagnostics

Activation Space: The “Gray Zone”

Analysis of hidden activations (layer 24 of Qwen3-8B) via multi-dimensional scaling (MDS) clearly identifies that Jargon-framed attack queries are not outliers in the benign or overtly harmful distributions. Rather, they reliably occupy an intermediate—gray zone—region. Here, the LLM’s internal decision process for refusal or compliance is less reliable: attack queries cluster near, but do not fully cross, the refusal boundary. This substantiates the core hypothesis: the failure of categorical boundaries is not a single-threshold effect but a property of context-induced ambiguity. Figure 5

Figure 4: MDS projection of hidden activations—high-harm attacks (red) are distributed near, but not within, the explicit refusal cluster, indicating a gradual safety transition.

Attention Patterns: Reduced Salience of Harmful Cues

Attention weight analysis in Qwen3-8B demonstrates that, under direct attack conditions, models focus strongly on explicitly harmful tokens, reliably triggering refusal. Proximal context (even of considerable length) does not significantly diminish this sensitivity. However, when the harmful request is reframed within the academic safety context—per Jargon's mechanism—this attention is distributed, drastically reducing the salience of harmful cues. This supports the empirical findings that context length and reframing jointly dilute the model’s safety detection. Figure 6

Figure 5: Attention heatmaps—reframed queries diffuse attention away from sensitive tokens, suppressing explicit safety triggers.

Defense Strategies and Retrospective Alignment

The authors investigate two mitigation paradigms:

  1. Policy-Guided Safeguards: These steer models during inference to comply with academic or safety-focused framing while refusing actionable specifics. When combined with dynamic prompt injection, ASR drops by up to 39% (Qwen3-8B), though risk of over-refusal is noted for certain models.
  2. Alignment Fine-Tuning: Fine-tuning on Jargon-style adversarial trajectories paired with guideline-augmented responses internalizes gray-zone refusal behaviors, reducing ASR by 9–34 points while maintaining high performance on general benchmarks (MMLU, HellaSwag, GSM8K).

However, both strategies are partial, not comprehensive: zero ASR is not reached, and models with excessive sensitivity to warning prompts may become overly cautious, sacrificing helpfulness. Figure 7

Figure 6: Safeguard impact—dynamic policy and alignment fine-tuning lower harmful response rates across multi-turn interactions without unduly compromising utility.

Implications and Future Directions

This work demonstrates a contradictory vector in LLM safety: the very training pathways meant to make models helpful to legitimate experts (e.g., security researchers, biologists, clinicians) are weaponizable for adversarial knowledge extraction. The clear operational gap is not in “faking” expertise, but in contextual authenticity and rapport, as reinforced by models’ alignment to academic language and discourse conventions.

Practical implications are immediate: shallow scenario or persona-based red-teaming is insufficient for safety evaluation. Safety research must systematically integrate context-rich, multi-turn adversarial evaluation, as authentic domain context can nullify sophisticated single-turn and gradient-based defenses, even on the most highly aligned frontier models.

Theoretically, this work draws attention to the need for context-aware, hierarchical safety classifiers and more robust fuzzy-boundary handling at the alignment layer, potentially involving adversarially generated context-query distributions during model fine-tuning. Additionally, the “gray zone” activation regime identified here calls for research into soft, calibrated boundary estimation—distinguishing ambiguous from fully benign inputs—rather than binary harmful-benign classifiers.

Conclusion

The paper provides a rigorous analysis of how domain-specific contexts, especially in the form of authentic academic safety-research discourse, can systematically erode LLM safety boundaries. Jargon, as a principled adversarial attack leveraging this insight, achieves near-universal attack success rates against state-of-the-art models, establishing the practical inadequacy of current safety training and evaluation paradigms. The diagnostic and mitigation experiments underline both the complexity and the incompleteness of available alignment mechanisms, emphasizing an urgent need for context-aware, multi-turn, and activation-informed safety alignments in future LLM research and deployment.

Reference: "Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries" (2604.15717)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.