Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Published 20 Apr 2026 in cs.CL and cs.AI | (2604.18487v1)

Abstract: The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents the Adversarial Humanities Benchmark (AHB) to evaluate stylistic robustness in safety refusals using humanities-inspired prompt transformations.
It employs a meta-prompt-driven, automated framework with triplicate judge models and consensus voting to measure attack success rate gaps across providers and harm categories.
Empirical results reveal a 51.3 percentage point gap between direct and transformed prompts, exposing a structural vulnerability in current LLM safety mechanisms.

Stylistic Robustness in Frontier Model Safety: An Authoritative Review of the Adversarial Humanities Benchmark

Motivation and Background

The Adversarial Humanities Benchmark (AHB) introduces a systematic evaluation of stylistic robustness in frontier LLMs with regard to safety refusals. Prior single-turn jailbreak benchmarks chiefly relied on highly explicit harmful prompts, yielding high refusal rates and instilling confidence in the safety posture of advanced models. However, such benchmarks have overlooked attacks that rely on sophisticated stylistic obfuscation and goal concealment. Drawing from earlier research on "Adversarial Poetry" (Bisconti et al., 19 Nov 2025) and "Adversarial Tales" (2601.08837), AHB extends adversarial evaluation to a meta-prompt-driven, fully automated benchmark family where surface form is deliberately displaced using humanities-inspired transformations.

By reframing known harmful intents through transformations leveraging literary, philosophical, and cultural paradigms—e.g., hermeneutics, cyberpunk narratives, scholastic theology, administrative jargon, and stream-of-consciousness—the benchmark probes whether refusal policies generalize beyond explicit lexical cues. The implication is direct: non-maleficence in current frontier models is primarily rooted in overfit refusal behavior, exposing a structural generalization gap in safety systems.

Figure 1: The construction and evaluation process of Adversarial Humanities Benchmark.

Figure 2: Item split by risk category.

Methodological Design

AHB defines a threat model characterized by a black-box attacker with access restricted to single-turn, text-only prompts. The attack surface comprises five stylistic transformation paradigms targeting vulnerabilities across fifteen harm categories (from MLCommons AILuminate (Ghosh et al., 19 Feb 2025)) mapped to three systemic risk buckets (EU GPAI CoP): CBRN, Cyber Offense, and Harmful Manipulation.

Adversarial Tales: Harmful intent masked as narrative artifacts requiring Proppian structural analysis.
Adversarial Semiosphere: Bureaucratic reframing demanding intrasemiospheric translation tasks.
Adversarial Hermeneutics: Fragmented text necessitating deep interpretative reconstruction.
Adversarial Scholasticism: Theological debate under archaic terminology requiring protocol extraction.
Adversarial Stream of Consciousness: Obfuscated memoirs with psychoanalytic intent deduction.

Each MLCommons AILuminate base prompt is reformulated using deepseek-v3.2 via meta-prompts that maintain fixed harmful intent but radically alter rhetorical surface. Attack templates introduce proxy-tasks that exploit competing objectives and mismatched generalization weaknesses in aligned models.

Evaluation employs an LLM-as-a-judge setup with triplicate open-weight judge models, consensus voting, and human review for ambiguous items. Outputs are classified SAFE/UNSAFE based on operational risk criteria. Attack Success Rate (ASR) is calculated as the proportion $\text{ASR}(X) = \frac{\sum_{i \in X} \mathbbm{1}[\text{label}(i) = U]}{|X|}$. The benchmark spans 31 frontier models from 11 providers.

Figure 3: Original prompts (Light Blue) remain low-ASR, but humanities-style rewrites ((from Yellow to Red)) achieve impressively successful results.

Empirical Results

AHB reveals an unprecedented gap between refusal performance on direct versus stylistically transformed prompts. The original prompt slice yields an ASR of 3.84%. Transformed prompt slices achieve ASRs from 36.83% (Adversarial Stream) to 64.68% (Adversarial Scholasticism), with a combined transformed ASR of 55.14%. The benchmark thus documents a 51.3 percentage point gap, unequivocally demonstrating that frontier models with strong direct-prompt refusal behavior manifest broad vulnerabilities under systematic rhetorical displacement.

Failures are widely distributed across harm categories, with transformed prompts achieving highest ASRs for election-related advice (68.99%), non-violent crimes (66.24%), defamation (62.36%), and violent crimes (60.25%). The vulnerability is not confined to a few risk areas; it is pervasive across the taxonomy.

Figure 4: By hazard, transformed prompts (Red) are consistently riskier than original prompts (Light Blue).

Provider-level analysis substantiates the generalization failure: Anthropic models are the most robust (16.73% ASR), MiniMax is at 29.24%, OpenAI and Qwen around 42%, and most others exceed 65%. Yet even the strongest providers exhibit a non-negligible gap, signifying that improved refusal on direct prompts does not guarantee robust safety under adversarial variation.

Figure 5: By model provider, ASR increases significantly comparing original prompts (Light Blue) with transformed prompts (Red), as highlighted in the red box beside the bars.

Policy-oriented risk bucket analysis maps these deficits to governance contexts. Transformations elevate ASR across all buckets: CBRN at 57.47%, Cyber Offence at 56.46%, and Harmful Manipulation at 54.71%. The heatmap shows provider-specific vulnerabilities but no provider achieves a comprehensive robustness profile compatible with EU AI Act requirements.

Figure 6: By risk category, left: Provider by policy-relevant risk bucket ASR. The same benchmark can be read as a provider-level robustness profile across categories that are closer to governance and systemic-risk discussions.

Theoretical and Practical Implications

The findings provide concrete evidence that current LLM safety training and refusal behavior is anchored in surface-form recognition rather than genuine intent tracking. Refusal overfit has emerged as a critical operational boundary, especially as LLMs are integrated into agentic systems where unsafe outputs propagate into unsafe action plans. The benchmark demonstrates that safety-by-design paradigms and evaluation frameworks are materially incomplete unless stylistic robustness is explicitly measured and addressed.

This has ramifications for:

Safety Alignment: Stylistic displacement exposes competing-objective failures, undermining RLHF and Constitutional AI alignment assumptions (Bai et al., 2022, Bai et al., 2022).
Governance and Compliance: Failure to generalize to adversarial rhetorical forms results in non-compliance with structured risk audits under policy frameworks (EU GPAI CoP).
Model Development and Deployment: Robustness to stylistic variation must become an architectural and training target, not an edge-case evaluation.
Benchmarking: AHB establishes a new class of benchmarks that reliably quantify intent generalization performance.

Future Directions

Addressing the stylistic generalization gap requires redesigned training (e.g., adversarially augmented RLHF datasets, interpretability research for intent/goal tracking), and the development of agent-level operational boundaries where refusal robustness is compositional and transfer-proof across narrative, allegorical, and literary frames. Benchmarking frameworks should integrate automated transformation pipelines to continuously stress-test models as their deployment contexts expand. Systemic audit protocols must move beyond surface form moderation and prioritize underlying goal semantics.

Conclusion

The Adversarial Humanities Benchmark systematically demonstrates that stylistic robustness is a central and unresolved safety challenge for frontier LLMs (2604.18487). The substantial ASR gap between direct and humanities-style transformed prompts, observed across providers, harm categories, and systemic risk buckets, is a signal of structural vulnerability, not a fringe phenomenon. For real-world deployment and policy alignment, LLMs must be evaluated and trained to robustly track harmful intent regardless of rhetorical or stylistic form. Addressing this generalization gap is operationally and theoretically essential for trustworthy, safe, and compliant AI systems.

Markdown Report Issue