- The paper introduces a novel multi-stage framework that iteratively refines malicious prompts into production-grade code.
- Experiments reveal optimal performance at three refinement iterations with a 62% success rate and up to 83.7% executability for generated code.
- The study exposes vulnerabilities in current LLM safety architectures and suggests lightweight classification systems as a potential countermeasure.
Content Concretization: A Multi-Stage Jailbreaking Framework for LLMs
Introduction
This paper presents Content Concretization (CC), a novel methodology for jailbreaking LLMs by systematically transforming abstract malicious requests into concrete, executable code implementations. Unlike prior approaches that rely on prompt engineering or obfuscation, CC leverages a multi-stage pipeline involving both lower-tier and higher-tier LLMs, iteratively refining malicious intent into actionable outputs. The work is situated within the context of cybersecurity, focusing on the generation of offensive code artifacts, and provides a comprehensive evaluation of the technique's effectiveness, cost-efficiency, and implications for LLM safety architectures.
Methodological Framework
CC is architected as a two-stage process:
- Draft Generation: A lower-tier LLM (e.g., GPT-4o-mini) with reduced safety constraints is used to generate preliminary solution drafts. This model is selected for its responsiveness to adversarial prompts and cost-effectiveness.
- Refinement and Production: A higher-tier LLM (e.g., Claude 3.7 Sonnet) processes both the original prompt and the draft output, synthesizing production-grade code. This stage exploits the superior code generation capabilities of advanced models.
The pipeline is designed to iteratively remove abstraction layers, progressing from high-level requirements to pseudocode, prototype code, and finally executable, production-ready implementations. The process is parameterized by the number of refinement iterations (N), with empirical analysis conducted for N=0 (baseline) through N=4.
Implementation Details
All stages are automated via Python scripts interfacing with LLM APIs. The instruction sets for each stage are explicitly crafted to avoid prompt obfuscation or engineering, ensuring that observed performance gains are attributable solely to the concretization process. Instructions direct models to focus on offensive tactics, avoid simulation or mitigation content, and produce implementation-oriented outputs.
The evaluation dataset comprises 350 prompts sampled from CySecBench, spanning seven cybersecurity attack categories. The assessment framework combines manual review, automated keyword filtering, and LLM-based jury evaluation, with majority voting used to determine success rates.
Experimental Results
Success Rate and Quality
- Baseline (N=0): 7.1% success rate, reflecting strong safety filter effectiveness.
- Single Refinement (N=1): 57.1% success rate, an 8-fold increase over baseline.
- Optimal Refinement (N=3): 62.0% success rate, with diminishing returns observed beyond three iterations.
- Excessive Refinement (N=4): 46.6% success rate, attributed to systematic refusals by the lower-tier model during prototype-to-production transitions.
A/B testing with nine LLM evaluators consistently demonstrates that higher refinement counts yield outputs rated as more malicious and technically superior, with up to 71.8% preference for N=3 over N=1.
Executability
Unit testing of 20 code samples from the highest-performing architecture (N=4) shows an average pass rate of 83.7%, with 30% achieving full executability without modification. Manual analysis of representative attack scripts (SYN-flood, spear-phishing, SQL-injection) confirms functional accuracy and immediate usability, though optimal deployment against hardened targets requires further customization.
Cost Analysis
Token consumption and cost scale predictably with refinement iterations. The use of lower-tier models for intermediate steps maintains cost efficiency, with maximum per-prompt costs reaching 7.5¢ for optimal configurations. This establishes CC as a cost-effective methodology for adversaries.
Implications for LLM Safety
The results highlight a critical vulnerability in current LLM safety architectures: models are significantly more likely to generate harmful outputs when prompted to extend or refine pre-existing malicious content. Safety filters that focus exclusively on final prompt analysis fail to detect the cumulative impact of incremental transformations. The paper proposes lightweight classification systems that analyze response deltas and extension-related keywords as a potential countermeasure, though such systems must balance detection accuracy with computational overhead.
Limitations and Future Directions
- Domain Specificity: The methodology is tailored to cybersecurity code generation; extension to other domains requires systematic abstraction layer identification and instruction set development.
- Model Diversity: Evaluation is limited to a single low-tier/high-tier model pairing; broader model combinations may yield different results.
- Automated Evaluation Bias: Reliance on LLM-based assessment introduces potential for systematic mislabeling, though mitigated by multi-model jury and A/B testing.
Future research should explore integration of prompt obfuscation and engineering techniques within the CC framework, cross-domain applicability, and advanced countermeasure development to address evolving LLM security threats.
Conclusion
Content Concretization represents a conceptually distinct and technically effective approach to LLM jailbreaking, enabling adversaries to systematically transform abstract malicious requests into executable code through iterative multi-model refinement. The methodology achieves substantial improvements in jailbreak success rates and output quality at modest economic cost, exposing a fundamental weakness in current LLM safety mechanisms. Addressing these vulnerabilities will require both architectural enhancements to safety filters and ongoing research into adversarial prompt processing and detection strategies. The principles underlying CC are broadly applicable, though successful deployment in other domains necessitates careful abstraction layer management and instruction engineering.