LLM-Generated Code Security

Updated 1 December 2025

Security of LLM-Generated Code is an evolving field that defines and investigates vulnerabilities and compliance gaps in code produced by advanced language models.
Evaluation frameworks like CWEval and SafeGenBench use metrics and static analysis tools to jointly assess functional accuracy and security, highlighting significant trade-offs.
Mitigation techniques such as secure fine-tuning, constrained decoding, and iterative prompting reduce vulnerabilities but often introduce functionality compromises and detection gaps.

LLMs have achieved remarkable proficiency in automated code generation across multiple programming languages and domains, leading to widespread adoption among developers and organizations. However, extensive empirical studies indicate that the security of LLM-generated code remains a critical and unsolved challenge. Vulnerability analyses, benchmarking studies, and methodological investigations consistently show that code produced by LLMs frequently exhibits exploitable software vulnerabilities, omits essential security controls, or fails to conform to industry best practices. This article surveys the state of research on the security of LLM-generated code, including the root causes of insecurity, evaluation methodologies, technical advances in mitigation, empirical performance gaps, and future research directions.

1. Prevalence and Nature of Vulnerabilities in LLM-Generated Code

LLM-generated code is frequently vulnerable due to both model and data limitations. Multiple benchmark studies report that a substantial fraction (12–65%) of generated code snippets, depending on task, language, model, and prompting configuration, are non-compliant with basic secure coding standards or explicitly trigger Common Weakness Enumeration (CWE)-classified vulnerabilities (Kharma et al., 3 Feb 2025, Shahid et al., 24 Nov 2025, Peng et al., 14 Jan 2025, Chong et al., 27 Sep 2024, Goetz et al., 13 Aug 2024, Dora et al., 29 Apr 2025). Prevalent weakness classes include buffer overflows (CWE-120/787), unchecked return values (CWE-252/253), hard-coded credentials (CWE-259/798), SQL injection (CWE-89), code injection (CWE-78), cryptographic misuse (CWE-780), path traversal (CWE-22), and improper input validation (CWE-20).

Table: Example CWE Frequencies in LLM-CSEC (Shahid et al., 24 Nov 2025)

CWE	Approx. Occurrences (over 20 codebases)
CWE-252/253 (Unchecked rv)	350
CWE-120 (Buffer copy)	75
CWE-787 (OOB write)	65
CWE-805 (Incorrect len)	60
CWE-190 (Int overflow)	25

Significant risk is also posed by LLMs hallucinating non-existent third-party packages or outdated component references, leading to explicit software supply chain compromise vectors (Li et al., 24 Sep 2025).

2. Evaluation Frameworks and Security Benchmarks

Recent work has focused on developing rigorous benchmarks and evaluation methodologies that jointly assess both security and functionality of LLM-generated code. Notable frameworks include:

CWEval: Defines an outcome-driven approach combining dynamic oracles for security and correctness, and reports func-sec@k metrics reflecting the frequency with which candidate generations are both functionally correct and secure. Empirically, a ~25–35% absolute gap exists between functional code and simultaneously secure, functional code across four major LLMs; e.g., GPT-4o yields func@10 = 90.7%, func-sec@10 = 65.3% (Peng et al., 14 Jan 2025).
SafeGenBench: Combines SAST (e.g., Semgrep) with LLM-based judging over 558 prompts in 12 languages; dual-judge consensus required for a code sample to be considered secure. Zero-shot accuracy averages 37.4%; with safety instructions and few-shot prompting, security accuracy can be raised by 20–25% (Li et al., 6 Jun 2025).
SALLM: Suggests two novel metrics: Vulnerability Injection Rate (VIR = fraction flagged as insecure) and Security Robustness Score (SRS, balancing correctness and VIR). For GPT-4, VIR = 43.9%, secure@1 = 56.0%, whereas StarCoder achieves VIR = 16.3%, secure@1 = 82% (Siddiq et al., 2023).
DualGauge: Implements agentic, fully automated joint functionality and security evaluation across 154 coding tasks and 10 leading LLMs. Secure-pass@1 rates remain <12% for all models tested, even when functional pass@1 exceeds 50% (Pathak et al., 24 Nov 2025).
LLM-CSEC: Uses three static analysis tools (CodeQL, Snyk, CodeShield) and reports that even with explicit “secure code generator” prompting, the median LLM generation contains multiple high-severity vulnerabilities (Shahid et al., 24 Nov 2025).

3. Secure Code Generation Techniques and Their Trade-offs

Efforts to mitigate LLM insecurity span data-level, learning, and inference-time interventions:

Secure Fine-tuning and Preference Learning: Approaches such as SVEN and SafeCoder employ white-box fine-tuning on curated datasets of secure/vulnerable code, while ProSec uses a proactive alignment mechanism by synthesizing CWE-triggering prompts and applying pairwise preference learning (SimPO). ProSec achieves up to a 35% reduction in vulnerability rate over prior instruction-tuning methods with negligible utility loss (Xu et al., 19 Nov 2024).
Constrained Decoding and Iterative Prompting: CodeGuard+ and PromSec employ gray-box or black-box approaches, using static analyzers and prompt refinement to filter vulnerable generations. However, such techniques often induce loss of functionality—by stripping away vulnerable code constructs or outputting trivial/no-op code—to satisfy security constraints. SAFE@k statistics reveal that joint secure-and-functional correctness is rarely achieved (>90% outputs are either functional or secure, but not both) (Dai et al., 18 Mar 2025).
Structured Secure Reasoning: GRASP constrains LLM reasoning through a Directed Acyclic Graph over Secure Coding Practices, orchestrating stepwise code revisions. This raises Security Rate (SR) from ~0.6 to ≥0.8 across all tested LLMs and achieves up to 88% relative improvement for zero-day vulnerabilities over advanced prompting baselines (Patir et al., 8 Oct 2025).
Agentic Multi-Stage Generation and Automated Patching: AutoSafeCoder interleaves synthesis with static/dynamic (fuzzing) security agents, achieving a 13% reduction in vulnerability rate while minimizing correctness loss on SecurityEval (Nunez et al., 16 Sep 2024). Feedback-Driven Security Patching (FDSP) iteratively incorporates static analyzer feedback into LLM re-generation, reducing residual vulnerability rate in PythonSecurityEval from 40.2% to 7.4% for GPT-4 (Alrashedy et al., 2023).

4. Supply Chain and Backdoor Threats

LLM-generated code introduces emergent software supply chain threats beyond classical CWE classes. Large-scale testing with SSCGuard reveals LLMs' tendency to invent (“hallucinate”) non-existent packages (up to 44.7% of package references in some models), recommend deprecated or vulnerable versions, or emit CI/CD configurations that enable code execution or privilege escalation attacks (Li et al., 24 Sep 2025). CI plugin hallucination rates can exceed 90%. Defense mechanisms such as Chain-of-Confirmation prompting and middleware-based supply-chain scanners are shown to halve hallucination rates with negligible utility loss.

Backdoor attacks on LLM-based HDL (hardware description language) code generation are also feasible; RTL-Breaker demonstrates that poisoned models can insert stealthy hardware Trojans triggered by specific prompts/comments, with negligible functional accuracy loss for non-triggered prompts and attack success rates (ASR) near 100% (Mankali et al., 26 Nov 2024).

5. Prompting, Model Scale, and Language Effects

Prompt engineering, in-context learning (ICL), model scale, and language choice all substantially affect security outcomes:

Security-focused instruction and in-context examples, particularly with chain-of-thought (CoT) reasoning, can reduce vulnerability rates by up to 75% on some tasks and models (Goetz et al., 13 Aug 2024, Mohsin et al., 18 Jun 2024, Li et al., 6 Jun 2025). However, improvements plateau and critical vulnerabilities (such as SQLi, path traversal, cryptographic misuse) are rarely eliminated in all samples.
Model scale correlates with modest improvements in joint correctness-security metrics (6–10% gain in func-sec@k for larger variants), but does not close the security gap (Peng et al., 14 Jan 2025, Pathak et al., 24 Nov 2025). Open-source models achieve security parity with or outperform commercial models on certain tasks.
Security effectiveness varies sharply by language: Python typically shows lower rates of memory safety flaws, while C/C++ generations are systematically vulnerable to buffer overflows, unchecked returns, and integer overflow (Kharma et al., 3 Feb 2025, Shahid et al., 24 Nov 2025). Java and C++ LLM outputs often rely on outdated, deprecated APIs and fail to leverage modern secure abstractions.

6. Limitations, Open Problems, and Directions for Future Research

Major limitations and open questions identified across the literature:

Functional-Security Trade-offs: Securing LLM outputs often results in degraded functionality, removal of code, or outputs that are non-compliant with task specifications. SAFE@k and secure-pass@k metrics reveal that current methods rarely achieve high security without regression in utility (Dai et al., 18 Mar 2025, Pathak et al., 24 Nov 2025).
Assessment Gaps: No single static analyzer reliably detects all vulnerability classes; dynamic analysis, fuzzing, and LLM-based judging are needed for broader coverage but remain resource-intensive or imperfect (Peng et al., 14 Jan 2025, Li et al., 6 Jun 2025).
Robustness against Adversarial Attacks: Poisoning and backdoor insertion into LLMs for both software and hardware code generation are practical and nearly undetectable by standard evaluation pipelines (Mankali et al., 26 Nov 2024).
Educational and Human Factors: Even experienced users rarely identify vulnerabilities in code suggested by LLMs, with >95% of students accepting insecure code in representative tasks; education frameworks such as Bifröst measurably increase skepticism but do not ensure secure adoption (Park et al., 25 Nov 2025).

Recommendations for advancing research and practice include:

Assembling multi-analyzer, multi-judge, multi-language benchmarking suites with both dynamic and static oracles (Peng et al., 14 Jan 2025, Li et al., 6 Jun 2025, Pathak et al., 24 Nov 2025).
Developing preference learning and graph-based secure reasoning frameworks that retain utility (Xu et al., 19 Nov 2024, Patir et al., 8 Oct 2025).
Integrating LLM-aware supply chain defenses and embedding security-guided prompt protocols into IDEs and CI/CD (Li et al., 24 Sep 2025, Kavian et al., 2 May 2024).
Designing evaluation metrics and pipelines (e.g., SAFE@k, func-sec@k, SRS) that reflect the joint need for correctness and security, instead of independent “pass rates” (Siddiq et al., 2023, Dai et al., 18 Mar 2025).
Pursuing research into quantization effects, adversarial robustness, and automatic augmentation of secure coding practices in response to emerging CWEs (Pathak et al., 24 Nov 2025, Patir et al., 8 Oct 2025).

LLM-generated code must be treated as untrusted by default, subjected to rigorous multi-stage vetting, and accompanied by both automated and human-in-the-loop security review before deployment. Ensuring the security of code synthesis by LLMs remains a moving target, demanding continuous benchmarking, proactive defense, and architectural innovation.