SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization

Published 8 May 2026 in cs.CR, cs.CL, and cs.CY | (2605.08382v1)

Abstract: LLM coding agents now generate code at an unprecedented scale, yet LLM-generated code introduces cybersecurity vulnerabilities into codebases without human involvement. Even when frontier models are explicitly asked to write secure production code with relevant weaknesses to avoid in context, we find that they still produce verifiable vulnerabilities on average 23% of the time across a corpus of 250 benign coding prompts. We introduce SecureForge, an automated pipeline that both audits security risks of frontier models and produces auditing-informed secure system prompts that reduce output security vulnerabilities while maintaining unit test performance. SecureForge first identifies benign prompts that produce statically detectable vulnerabilities, and then amplifies them into a large synthetic prompt corpus of diverse scenarios using a Markovian sampling technique to jointly maintain error rates and prompt diversity. This corpus is then used to iteratively optimize the system prompts to reduce output security vulnerabilities. On frontier models, SecureForge yields a statistically significant Pareto improvement in both unit test success and output security, with output vulnerabilities reduced by up to 48%. The resulting system prompts transfer zero-shot to in-the-wild coding agent prompts, without any exposure to real user prompt distributions during optimization.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a novel automated pipeline that combines benign prompt falsification, MCMC-based coverage amplification, and system prompt optimization to significantly reduce vulnerabilities in LLM-generated code.
The paper presents robust evaluation metrics demonstrating up to 48% vulnerability reduction and improved joint security/correctness across various LLMs.
The paper shows that an API-only, inference-time prompt hardening strategy can effectively mitigate security risks without compromising code functionality.

SecureForge: Automated Detection and Prevention of Vulnerabilities in LLM-Generated Code via Prompt Optimization

Motivation and Problem Setting

LLMs have dramatically expanded the scale at which code is automatically generated and deployed, introducing a critical security problem: models produce code that is functionally correct yet contains exploitable vulnerabilities. Unlike overt adversarial attacks, these weaknesses are often triggered by benign user interactions, and survive standard unit testing, entering production codebases undetected. Empirical analyses reveal that even when explicitly instructed to avoid known weaknesses like those in the MITRE CWE taxonomy, state-of-the-art LLMs yield statically verifiable vulnerabilities in up to 23% of outputs, with 12.7% of outputs passing all unit tests but remaining insecure (2605.08382).

Practical hardening methods for these scenarios remain limited. Existing approaches—ranging from fine-tuning to adversarial red-team prompting—require white-box model access or fail to comprehensively address the space of vulnerabilities induced by realistic, non-malicious user prompts. SecureForge addresses these deficits with a fully automated inference-time pipeline that both quantifies benign prompt-induced vulnerabilities and hardens any LLM to mitigate them, relying only on API access and static analysis.

Methodology

SecureForge comprises a three-stage, automated pipeline for LLM code security:

1. Benign Prompt Falsification: Identify naturalistic prompts that consistently elicit vulnerabilities.
2. Coverage Amplification (MCMC): Apply Markov Chain Monte Carlo to paraphrase and diversify prompts, systematically populating the space of benign scenarios that yield insecure code.
3. System Prompt Optimization: Use the Pareto-optimal Genetic Prompt Algorithm (GEPA) to optimize the LLM's system prompt, leveraging static analysis feedback to jointly minimize vulnerabilities and preserve code correctness.
Figure 1: SecureForge's pipeline: from discovering vulnerability-triggering benign prompts, amplifying them via MCMC rephrasing, to optimizing the system prompt for security.

The pipeline details are summarized below:

The falsification step uses CWEs to generate seed prompts, meticulously strips explicit vulnerability or CWE clues, and filters only those which, when sampled through the LLM, produce statically-detectable weaknesses.
MCMC amplification iteratively rephrases these prompts while maintaining their propensity to elicit vulnerabilities, efficiently sampling the high-dimensional benign failure space.
GEPA, a zeroth-order black-box prompt optimizer, is then run against the expanded corpus, using Semgrep as the static analyzer to provide per-prompt reward signals.
Figure 2: SecureForge's prompting-based pipeline, with static analysis guiding each stage.

Experimental Protocol and Evaluation

SecureForge is evaluated across both open and proprietary LLMs, including the GPT-5 scaling series, Claude Sonnet 4.6, and major open-source code models (Qwen2.5 Coder, CodeLlama, Kimi K2). The core evaluation metrics are:

Vulnerability Rate: Empirical probability that a rollout from a benign prompt contains a statically-verifiable CWE.
Test Passing Rate: Proportion of code rollouts passing auto-generated pytest suites for functional correctness.
Joint Security/Correctness Rate: Rate at which rollouts are both secure (CWE-free) and pass all tests.

Evaluation includes held-out, non-overlapping CWE scenarios and in-the-wild coding prompts from the SWE-chat dataset, simulating authentic developer-agent coding workflows.

Key Results

Baseline Vulnerability and SecureForge Intervention

Simply prompting models to write secure, CWE-aware code leaves vulnerability rates above 13%, even for test-passing completions (Figure 3). SecureForge's optimized system prompts achieve up to 48% reduction in vulnerability rates on held-out scenarios, with unit test pass rates preserved or even improved across all tested models.

Figure 3: Left: Weakness rates for GPT models before intervention, after various securitization prompts, and after SecureForge. Right: Joint security/correctness rates—higher is better.

Figure 4: Top: Vulnerability rate reduction before and after SecureForge. Middle: Unit test passage remains stable or improves. Bottom: Joint rate of safe and correct code increases post-intervention.

Generalization to Real-World Coding Agents

The system prompts optimized via SecureForge transfer zero-shot to real-world SWE-chat prompts, enhancing joint security/correctness by up to 13.5% without exposure to such distributions during optimization (Figure 5). This demonstrates both the robustness of the MCMC-based corpus and the breadth of coverage conferred by static-analysis-guided prompt optimization.

Figure 5: SecureForge's effect on real in-the-wild SWE-chat coding tasks, showing up to 13.5% improvement in safe-passing code generation.

Comparison to Other Prompt Optimization Techniques

Against MIPRO and security-aware prompting, SecureForge's use of static analysis in the optimization objective consistently yields lower vulnerability rates and higher joint success on diverse coding scenarios (Figure 6).

Figure 6: SecureForge's GEPA-based optimizer outperforms security-aware prompting and MIPRO on both vulnerability rate and joint secure/correct output rate.

Ablation, Analysis, and Broader Implications

Coverage and Diversity: The MCMC amplification produces a highly diverse prompt corpus, with only marginal increases in code output homogeneity (as measured by Self-BLEU). Amplification discovers roughly twice as many unique vulnerable scenarios relative to simplistic rephrasing strategies.
Effect Uniformity: Vulnerability reduction is broadly uniform across CWEs, with some holdout in code injection and authentication design, highlighting persistent open challenges.
Severity Mitigation: Reduction in vulnerabilities is preserved across Semgrep's severity spectrum (Figure 7), and SecureForge also reduces vulnerabilities in rollouts that initially fail functional tests.
Practicality: System prompt optimization incurs only a modest one-time compute cost (∼$150 for mainstream LLMs), is reproducible via open-source tooling, and does not require model weight access or retraining.

Implications and Future Directions

Theoretical and practical implications are multifaceted:

For Model Providers: SecureForge enables API-only, inference-time hardening of code models, aligning with secure-by-design mandates while avoiding disruptive distribution shifts or high-overhead fine-tuning cycles.
For Practitioners: The open-source pipeline provides a mechanism for ongoing auditing and system prompt alignment, allowing rapid adaptation to new model variants or evolving CWE taxonomies.
For Research: The empirical gap between security-aware prompting and statically-verified hardening highlights fundamental model limitations; future work could integrate SecureForge with dynamic analysis, multi-turn dialogue synthesis, multilingual settings, or direct integration into agent feedback loops.

Deployment at scale would directly mitigate the real-world risk of vulnerable code entering production via LLM coding agents, and the success of black-box methods underscores the continued value of modular, test-driven security pipelines over end-to-end retraining.

Conclusion

SecureForge demonstrates that automated, static-analysis-guided prompt optimization can significantly reduce vulnerabilities in LLM-generated code induced by benign user interactions. The approach yields strong, statistically robust gains—up to 48% reduced vulnerability rates—without sacrificing code quality or coverage, and generalizes effectively to real user environments. By requiring only API access and static analysis, SecureForge constitutes a practical, scalable solution for immediate enhancement of code security in deployed LLMs, while establishing a methodological foundation for future secure agentic AI systems.