Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis (2508.14727v1)

Published 20 Aug 2025 in cs.SE and cs.LG

Abstract: This study presents a quantitative evaluation of the code quality and security of five prominent LLMs: Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce a range of software defects, including bugs, security vulnerabilities, and code smells. These defects do not appear to be isolated; rather, they may represent shared weaknesses stemming from systemic limitations within current LLM code generation methods. In particular, critically severe issues, such as hard-coded passwords and path traversal vulnerabilities, were observed across multiple models. These results indicate that LLM-generated code requires verification in order to be considered production-ready. This study found no direct correlation between a model's functional performance (measured by Pass@1 rate of unit tests) and the overall quality and security of its generated code, measured by the number of SonarQube issues in benchmark solutions that passed the functional tests. This suggests that functional benchmark performance score is not a good indicator of overall code quality and security. The goal of this study is not to rank LLM performance but to highlight that all evaluated models appear to share certain weaknesses. Consequently, these findings support the view that static analysis can be a valuable instrument for detecting latent defects and an important safeguard for organizations that deploy AI in software development.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper finds that high test pass rates do not guarantee robust code quality or security, with models producing 1.45–2.11 static issues per task.
  • It employs diverse Java benchmarks and SonarQube analysis to quantify bugs, vulnerabilities, and code smells across multiple LLMs.
  • The study demonstrates the need for integrated static analysis and SCA in development workflows to mitigate inherent risks in AI-assisted coding.

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis

Introduction

This paper presents a comprehensive quantitative evaluation of the code quality and security of Java code generated by five leading LLMs: Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. The paper moves beyond functional correctness, which has been the primary focus of prior LLM code generation research, and instead systematically analyzes the prevalence and nature of software defects—bugs, security vulnerabilities, and code smells—using static analysis (SonarQube) across 4,442 Java programming tasks. The central findings challenge the assumption that high functional performance correlates with high code quality and security, and highlight systemic limitations in current LLM-based code generation.

Experimental Design and Methodology

The evaluation leverages a large and diverse benchmark suite, including MultiPL-E-mbpp-java, MultiPL-E-humaneval-java, and ComplexCodeEval, to ensure coverage of a broad spectrum of Java programming challenges. Each LLM was prompted with identical instructions (temperature=0) to generate compilable Java solutions for all tasks. Functional correctness was measured via Pass@1 unit test rates on the MultiPL-E benchmarks, while static analysis was performed on all generated code using the full SonarWay Java ruleset (~550 rules).

The models under test represent a cross-section of proprietary (Claude, GPT-4o) and open-weight (Llama, OpenCoder) architectures, as well as a range of model sizes. This design enables the identification of both model-specific and systemic patterns in code quality and security.

Structural and Functional Characteristics of LLM-Generated Code

The paper reveals significant variance in the structural properties of generated code across models. For example, Claude Sonnet 4 produced the largest codebase (370,816 LOC) with the highest cyclomatic and cognitive complexity, while OpenCoder-8B generated the most concise code (120,288 LOC) with the lowest complexity. Comment density also varied widely, indicating differing tendencies in code documentation.

Functional performance, as measured by test pass rates, ranged from 77.04% (Claude Sonnet 4) to 60.43% (OpenCoder-8B). However, a critical finding is the lack of correlation between functional correctness and code quality: even code that passes all tests contains, on average, 1.45–2.11 static analysis issues per passing task, depending on the model. This decoupling of functional and non-functional quality attributes is a central result.

Defect Taxonomy: Bugs, Vulnerabilities, and Code Smells

Issue Density and Distribution

All models produced a substantial number of static analysis issues, with issue densities ranging from 19.48 to 32.45 per KLOC. The distribution of issue types was remarkably consistent: 90–93% code smells, 5–8% bugs, and ~2% security vulnerabilities. This uniformity across diverse architectures and training regimes suggests systemic limitations in current LLM code generation paradigms.

Code Smells

The most prevalent code smell categories were dead/unused/redundant code, design/framework best-practice violations, assignment/scope visibility issues, and improper use of collections/generics. Notably, dead code was especially common in open-weight models (e.g., 42.74% of code smells in OpenCoder-8B), likely due to the lack of whole-project reference analysis in LLMs. High cognitive complexity and excessive conditional logic were also frequent, reflecting the inability of autoregressive models to optimize for global code structure.

Bugs

Control-flow mistakes dominated the bug category, particularly in GPT-4o (48.15% of its bugs), indicating challenges in non-local path reasoning. API contract violations, exception handling errors, resource management lapses, and type-safety issues were also common. The prevalence of concurrency/threading bugs in larger models (e.g., Claude Sonnet 4) highlights the difficulty of capturing complex, underrepresented programming concepts in training data.

Security Vulnerabilities

The most critical vulnerabilities included path traversal/content injection, hard-coded credentials, cryptography misconfiguration, and XML External Entity (XXE) flaws. For instance, path traversal vulnerabilities accounted for over 30% of vulnerabilities in several models, and hard-coded credentials were especially frequent in open-weight models (up to 29.85% in OpenCoder-8B). These issues stem from the inability of LLMs to perform non-local taint analysis and to semantically distinguish security-sensitive constants. The presence of deprecated API usage further underscores the risk of training on outdated code corpora.

Severity Analysis

A significant proportion of bugs and vulnerabilities were classified as BLOCKER or CRITICAL by SonarQube. For vulnerabilities, 60–70% were BLOCKER-level, indicating immediate and severe security risks. Notably, improvements in functional performance (e.g., from Claude 3.7 Sonnet to Claude Sonnet 4) were accompanied by an increase in the proportion of severe bugs and vulnerabilities, contradicting the expectation that newer or larger models inherently produce safer code.

Static Analysis as a Safeguard

The paper demonstrates that static analysis tools such as SonarQube are effective in systematically identifying latent defects in LLM-generated code, including those that are not apparent from functional testing. Automated detection of hardcoded credentials, resource leaks, and excessive complexity provides a critical safety net, especially as LLMs are increasingly integrated into CI/CD pipelines. However, static analysis alone is insufficient for managing risks associated with outdated dependencies; complementary Software Composition Analysis (SCA) is required to detect vulnerable third-party libraries.

Implications and Future Directions

Practical Implications

  • LLM-generated code is not production-ready by default: All evaluated models, regardless of size or provenance, introduce a non-trivial number of maintainability, reliability, and security defects.
  • Functional correctness is not a proxy for code quality or security: High test pass rates do not guarantee the absence of critical bugs or vulnerabilities.
  • Model selection should consider defect profiles, not just benchmark scores: Smaller or open-weight models may produce cleaner code for certain tasks, challenging the assumption that larger models are always preferable.
  • Automated static analysis and SCA are essential: These tools must be integrated into development workflows to mitigate the systemic risks of LLM-generated code.

Theoretical Implications

The findings suggest that current LLMs, trained to replicate statistical patterns in code, lack the semantic reasoning required for robust software engineering. The persistence and evolution of defect profiles across model generations indicate that improvements in functional performance do not translate to improvements in non-functional attributes. This highlights the need for research into architectures and training methodologies that explicitly target code quality and security, potentially incorporating program analysis, formal verification, or hybrid neuro-symbolic approaches.

Future Research Directions

  • Investigating prompt engineering and fine-tuning strategies to reduce defect rates.
  • Longitudinal studies on the maintainability and technical debt of LLM-augmented codebases.
  • Evaluating the ability of LLMs to autonomously refactor or repair their own code in response to static analysis feedback.
  • Cross-lingual studies to determine whether observed weaknesses are language-agnostic or language-specific.
  • Developing LLM architectures with explicit mechanisms for semantic reasoning and security awareness.

Conclusion

This paper provides a rigorous, quantitative assessment of the quality and security of LLM-generated Java code, revealing systemic limitations that persist across model architectures and sizes. The decoupling of functional correctness from code quality and security underscores the necessity of integrating automated analysis tools and expert review into AI-assisted software development workflows. The results call for a shift in both research and practice: from a focus on functional benchmarks to a holistic evaluation of code quality, security, and maintainability. Addressing these challenges will be critical for the responsible and effective adoption of LLMs in software engineering.