Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Jailbreaking Large Language Models Through Content Concretization (2509.12937v1)

Published 16 Sep 2025 in cs.CR, cs.AI, and cs.CL

Abstract: LLMs are increasingly deployed for task automation and content generation, yet their safety mechanisms remain vulnerable to circumvention through different jailbreaking techniques. In this paper, we introduce \textit{Content Concretization} (CC), a novel jailbreaking technique that iteratively transforms abstract malicious requests into concrete, executable implementations. CC is a two-stage process: first, generating initial LLM responses using lower-tier, less constrained safety filters models, then refining them through higher-tier models that process both the preliminary output and original prompt. We evaluate our technique using 350 cybersecurity-specific prompts, demonstrating substantial improvements in jailbreak Success Rates (SRs), increasing from 7\% (no refinements) to 62\% after three refinement iterations, while maintaining a cost of 7.5\textcent~per prompt. Comparative A/B testing across nine different LLM evaluators confirms that outputs from additional refinement steps are consistently rated as more malicious and technically superior. Moreover, manual code analysis reveals that generated outputs execute with minimal modification, although optimal deployment typically requires target-specific fine-tuning. With eventual improved harmful code generation, these results highlight critical vulnerabilities in current LLM safety frameworks.

Summary

The paper introduces a novel multi-stage framework that iteratively refines malicious prompts into production-grade code.
Experiments reveal optimal performance at three refinement iterations with a 62% success rate and up to 83.7% executability for generated code.
The study exposes vulnerabilities in current LLM safety architectures and suggests lightweight classification systems as a potential countermeasure.

Content Concretization: A Multi-Stage Jailbreaking Framework for LLMs

Introduction

This paper presents Content Concretization (CC), a novel methodology for jailbreaking LLMs by systematically transforming abstract malicious requests into concrete, executable code implementations. Unlike prior approaches that rely on prompt engineering or obfuscation, CC leverages a multi-stage pipeline involving both lower-tier and higher-tier LLMs, iteratively refining malicious intent into actionable outputs. The work is situated within the context of cybersecurity, focusing on the generation of offensive code artifacts, and provides a comprehensive evaluation of the technique's effectiveness, cost-efficiency, and implications for LLM safety architectures.

Methodological Framework

CC is architected as a two-stage process:

Draft Generation: A lower-tier LLM (e.g., GPT-4o-mini) with reduced safety constraints is used to generate preliminary solution drafts. This model is selected for its responsiveness to adversarial prompts and cost-effectiveness.
Refinement and Production: A higher-tier LLM (e.g., Claude 3.7 Sonnet) processes both the original prompt and the draft output, synthesizing production-grade code. This stage exploits the superior code generation capabilities of advanced models.

The pipeline is designed to iteratively remove abstraction layers, progressing from high-level requirements to pseudocode, prototype code, and finally executable, production-ready implementations. The process is parameterized by the number of refinement iterations ( $N$ ), with empirical analysis conducted for $N=0$ (baseline) through $N=4$ .

Implementation Details

All stages are automated via Python scripts interfacing with LLM APIs. The instruction sets for each stage are explicitly crafted to avoid prompt obfuscation or engineering, ensuring that observed performance gains are attributable solely to the concretization process. Instructions direct models to focus on offensive tactics, avoid simulation or mitigation content, and produce implementation-oriented outputs.

The evaluation dataset comprises 350 prompts sampled from CySecBench, spanning seven cybersecurity attack categories. The assessment framework combines manual review, automated keyword filtering, and LLM-based jury evaluation, with majority voting used to determine success rates.

Experimental Results

Success Rate and Quality

Baseline (N=0): 7.1% success rate, reflecting strong safety filter effectiveness.
Single Refinement (N=1): 57.1% success rate, an 8-fold increase over baseline.
Optimal Refinement (N=3): 62.0% success rate, with diminishing returns observed beyond three iterations.
Excessive Refinement (N=4): 46.6% success rate, attributed to systematic refusals by the lower-tier model during prototype-to-production transitions.

A/B testing with nine LLM evaluators consistently demonstrates that higher refinement counts yield outputs rated as more malicious and technically superior, with up to 71.8% preference for N=3 over N=1.

Executability

Unit testing of 20 code samples from the highest-performing architecture (N=4) shows an average pass rate of 83.7%, with 30% achieving full executability without modification. Manual analysis of representative attack scripts (SYN-flood, spear-phishing, SQL-injection) confirms functional accuracy and immediate usability, though optimal deployment against hardened targets requires further customization.

Cost Analysis

Token consumption and cost scale predictably with refinement iterations. The use of lower-tier models for intermediate steps maintains cost efficiency, with maximum per-prompt costs reaching 7.5¢ for optimal configurations. This establishes CC as a cost-effective methodology for adversaries.

Implications for LLM Safety

The results highlight a critical vulnerability in current LLM safety architectures: models are significantly more likely to generate harmful outputs when prompted to extend or refine pre-existing malicious content. Safety filters that focus exclusively on final prompt analysis fail to detect the cumulative impact of incremental transformations. The paper proposes lightweight classification systems that analyze response deltas and extension-related keywords as a potential countermeasure, though such systems must balance detection accuracy with computational overhead.

Limitations and Future Directions

Domain Specificity: The methodology is tailored to cybersecurity code generation; extension to other domains requires systematic abstraction layer identification and instruction set development.
Model Diversity: Evaluation is limited to a single low-tier/high-tier model pairing; broader model combinations may yield different results.
Automated Evaluation Bias: Reliance on LLM-based assessment introduces potential for systematic mislabeling, though mitigated by multi-model jury and A/B testing.

Future research should explore integration of prompt obfuscation and engineering techniques within the CC framework, cross-domain applicability, and advanced countermeasure development to address evolving LLM security threats.

Conclusion

Content Concretization represents a conceptually distinct and technically effective approach to LLM jailbreaking, enabling adversaries to systematically transform abstract malicious requests into executable code through iterative multi-model refinement. The methodology achieves substantial improvements in jailbreak success rates and output quality at modest economic cost, exposing a fundamental weakness in current LLM safety mechanisms. Addressing these vulnerabilities will require both architectural enhancements to safety filters and ongoing research into adversarial prompt processing and detection strategies. The principles underlying CC are broadly applicable, though successful deployment in other domains necessitates careful abstraction layer management and instruction engineering.