- The paper finds that high test pass rates do not guarantee robust code quality or security, with models producing 1.45–2.11 static issues per task.
- It employs diverse Java benchmarks and SonarQube analysis to quantify bugs, vulnerabilities, and code smells across multiple LLMs.
- The study demonstrates the need for integrated static analysis and SCA in development workflows to mitigate inherent risks in AI-assisted coding.
Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
Introduction
This paper presents a comprehensive quantitative evaluation of the code quality and security of Java code generated by five leading LLMs: Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. The paper moves beyond functional correctness, which has been the primary focus of prior LLM code generation research, and instead systematically analyzes the prevalence and nature of software defects—bugs, security vulnerabilities, and code smells—using static analysis (SonarQube) across 4,442 Java programming tasks. The central findings challenge the assumption that high functional performance correlates with high code quality and security, and highlight systemic limitations in current LLM-based code generation.
Experimental Design and Methodology
The evaluation leverages a large and diverse benchmark suite, including MultiPL-E-mbpp-java, MultiPL-E-humaneval-java, and ComplexCodeEval, to ensure coverage of a broad spectrum of Java programming challenges. Each LLM was prompted with identical instructions (temperature=0) to generate compilable Java solutions for all tasks. Functional correctness was measured via Pass@1 unit test rates on the MultiPL-E benchmarks, while static analysis was performed on all generated code using the full SonarWay Java ruleset (~550 rules).
The models under test represent a cross-section of proprietary (Claude, GPT-4o) and open-weight (Llama, OpenCoder) architectures, as well as a range of model sizes. This design enables the identification of both model-specific and systemic patterns in code quality and security.
Structural and Functional Characteristics of LLM-Generated Code
The paper reveals significant variance in the structural properties of generated code across models. For example, Claude Sonnet 4 produced the largest codebase (370,816 LOC) with the highest cyclomatic and cognitive complexity, while OpenCoder-8B generated the most concise code (120,288 LOC) with the lowest complexity. Comment density also varied widely, indicating differing tendencies in code documentation.
Functional performance, as measured by test pass rates, ranged from 77.04% (Claude Sonnet 4) to 60.43% (OpenCoder-8B). However, a critical finding is the lack of correlation between functional correctness and code quality: even code that passes all tests contains, on average, 1.45–2.11 static analysis issues per passing task, depending on the model. This decoupling of functional and non-functional quality attributes is a central result.
Defect Taxonomy: Bugs, Vulnerabilities, and Code Smells
Issue Density and Distribution
All models produced a substantial number of static analysis issues, with issue densities ranging from 19.48 to 32.45 per KLOC. The distribution of issue types was remarkably consistent: 90–93% code smells, 5–8% bugs, and ~2% security vulnerabilities. This uniformity across diverse architectures and training regimes suggests systemic limitations in current LLM code generation paradigms.
Code Smells
The most prevalent code smell categories were dead/unused/redundant code, design/framework best-practice violations, assignment/scope visibility issues, and improper use of collections/generics. Notably, dead code was especially common in open-weight models (e.g., 42.74% of code smells in OpenCoder-8B), likely due to the lack of whole-project reference analysis in LLMs. High cognitive complexity and excessive conditional logic were also frequent, reflecting the inability of autoregressive models to optimize for global code structure.
Bugs
Control-flow mistakes dominated the bug category, particularly in GPT-4o (48.15% of its bugs), indicating challenges in non-local path reasoning. API contract violations, exception handling errors, resource management lapses, and type-safety issues were also common. The prevalence of concurrency/threading bugs in larger models (e.g., Claude Sonnet 4) highlights the difficulty of capturing complex, underrepresented programming concepts in training data.
Security Vulnerabilities
The most critical vulnerabilities included path traversal/content injection, hard-coded credentials, cryptography misconfiguration, and XML External Entity (XXE) flaws. For instance, path traversal vulnerabilities accounted for over 30% of vulnerabilities in several models, and hard-coded credentials were especially frequent in open-weight models (up to 29.85% in OpenCoder-8B). These issues stem from the inability of LLMs to perform non-local taint analysis and to semantically distinguish security-sensitive constants. The presence of deprecated API usage further underscores the risk of training on outdated code corpora.
Severity Analysis
A significant proportion of bugs and vulnerabilities were classified as BLOCKER or CRITICAL by SonarQube. For vulnerabilities, 60–70% were BLOCKER-level, indicating immediate and severe security risks. Notably, improvements in functional performance (e.g., from Claude 3.7 Sonnet to Claude Sonnet 4) were accompanied by an increase in the proportion of severe bugs and vulnerabilities, contradicting the expectation that newer or larger models inherently produce safer code.
Static Analysis as a Safeguard
The paper demonstrates that static analysis tools such as SonarQube are effective in systematically identifying latent defects in LLM-generated code, including those that are not apparent from functional testing. Automated detection of hardcoded credentials, resource leaks, and excessive complexity provides a critical safety net, especially as LLMs are increasingly integrated into CI/CD pipelines. However, static analysis alone is insufficient for managing risks associated with outdated dependencies; complementary Software Composition Analysis (SCA) is required to detect vulnerable third-party libraries.
Implications and Future Directions
Practical Implications
- LLM-generated code is not production-ready by default: All evaluated models, regardless of size or provenance, introduce a non-trivial number of maintainability, reliability, and security defects.
- Functional correctness is not a proxy for code quality or security: High test pass rates do not guarantee the absence of critical bugs or vulnerabilities.
- Model selection should consider defect profiles, not just benchmark scores: Smaller or open-weight models may produce cleaner code for certain tasks, challenging the assumption that larger models are always preferable.
- Automated static analysis and SCA are essential: These tools must be integrated into development workflows to mitigate the systemic risks of LLM-generated code.
Theoretical Implications
The findings suggest that current LLMs, trained to replicate statistical patterns in code, lack the semantic reasoning required for robust software engineering. The persistence and evolution of defect profiles across model generations indicate that improvements in functional performance do not translate to improvements in non-functional attributes. This highlights the need for research into architectures and training methodologies that explicitly target code quality and security, potentially incorporating program analysis, formal verification, or hybrid neuro-symbolic approaches.
Future Research Directions
- Investigating prompt engineering and fine-tuning strategies to reduce defect rates.
- Longitudinal studies on the maintainability and technical debt of LLM-augmented codebases.
- Evaluating the ability of LLMs to autonomously refactor or repair their own code in response to static analysis feedback.
- Cross-lingual studies to determine whether observed weaknesses are language-agnostic or language-specific.
- Developing LLM architectures with explicit mechanisms for semantic reasoning and security awareness.
Conclusion
This paper provides a rigorous, quantitative assessment of the quality and security of LLM-generated Java code, revealing systemic limitations that persist across model architectures and sizes. The decoupling of functional correctness from code quality and security underscores the necessity of integrating automated analysis tools and expert review into AI-assisted software development workflows. The results call for a shift in both research and practice: from a focus on functional benchmarks to a holistic evaluation of code quality, security, and maintainability. Addressing these challenges will be critical for the responsible and effective adoption of LLMs in software engineering.