Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 88 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 207 tok/s Pro

2000 character limit reached

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis (2508.14727v1)

Published 20 Aug 2025 in cs.SE and cs.LG

Abstract: This study presents a quantitative evaluation of the code quality and security of five prominent LLMs: Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce a range of software defects, including bugs, security vulnerabilities, and code smells. These defects do not appear to be isolated; rather, they may represent shared weaknesses stemming from systemic limitations within current LLM code generation methods. In particular, critically severe issues, such as hard-coded passwords and path traversal vulnerabilities, were observed across multiple models. These results indicate that LLM-generated code requires verification in order to be considered production-ready. This study found no direct correlation between a model's functional performance (measured by Pass@1 rate of unit tests) and the overall quality and security of its generated code, measured by the number of SonarQube issues in benchmark solutions that passed the functional tests. This suggests that functional benchmark performance score is not a good indicator of overall code quality and security. The goal of this study is not to rank LLM performance but to highlight that all evaluated models appear to share certain weaknesses. Consequently, these findings support the view that static analysis can be a valuable instrument for detecting latent defects and an important safeguard for organizations that deploy AI in software development.

Collections

Summary

The paper presents a thorough quantitative analysis of LLM-generated Java code, revealing systemic defects in quality and security despite high functional performance.
The paper employs rigorous static analysis via SonarQube on 4,442 coding assignments, providing detailed metrics on issue density, code smells, and critical vulnerabilities.
The paper demonstrates that high functional correctness does not guarantee production readiness, advocating for combined static analysis, SCA, and manual review.

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis

Introduction

This paper presents a comprehensive quantitative evaluation of code quality and security in Java code generated by five leading LLMs: Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder-8B. The paper leverages static analysis via SonarQube on 4,442 Java coding assignments, focusing on the prevalence and nature of software defects, including bugs, security vulnerabilities, and code smells. The analysis is motivated by the increasing adoption of LLMs in software engineering and the need to understand the latent risks associated with their output, especially as functional correctness alone is insufficient for production-readiness.

Experimental Design and Methodology

The evaluation utilizes three benchmark datasets (MultiPL-E-mbpp-java, MultiPL-E-humaneval-java, ComplexCodeEval) to ensure coverage of diverse Java programming challenges. Each LLM is prompted identically (temperature=0) to generate compilable Java solutions for all tasks. Functional correctness is measured by Pass@1 unit test rates on MultiPL-E benchmarks, while SonarQube's default SonarWay Java ruleset (≈550 rules) is applied to all generated code for static analysis. This dual approach enables assessment of both functional and non-functional attributes.

Structural and Functional Characteristics

The models exhibit significant variance in code volume, complexity, and comment density. For example, Claude Sonnet 4 produces the most code (370,816 LOC) and highest cyclomatic complexity, while OpenCoder-8B generates the most concise code (120,288 LOC) with the lowest complexity. Comment density ranges from 4.4% (GPT-4o) to 16.4% (Claude 3.7 Sonnet). These differences have direct implications for maintainability and technical debt, independent of functional performance.

Functional pass rates vary (Claude Sonnet 4: 77.04%, OpenCoder-8B: 60.43%), but all models generate code with latent defects even when passing all tests. The normalized metric "issues per passing task" reveals that higher functional performance does not equate to higher code quality or security, contradicting common assumptions in LLM evaluation.

Defect Density and Distribution

Static analysis reveals that all models produce a mix of code smells (≈90–93%), bugs (5–8%), and security vulnerabilities (≈2%). Issue density per KLOC ranges from 19.48 (Claude Sonnet 4) to 32.45 (OpenCoder-8B). The distribution of defect types is consistent across models, indicating systemic weaknesses in current LLM code generation paradigms.

Severity analysis shows that a substantial fraction of bugs and vulnerabilities are classified as BLOCKER or CRITICAL. For vulnerabilities, 60–70% are BLOCKER-level, underscoring the risk of deploying LLM-generated code without rigorous review.

Analysis of Code Smells

The most prevalent code smell categories are dead/unused/redundant code (up to 42.74% in OpenCoder-8B), design/framework best-practices, assignment/field/scope visibility, and collection/generics/type issues. These reflect LLM limitations in non-local reference analysis, context-specific design conventions, and semantic API understanding. The frequent use of deprecated APIs and generic exception handling further highlights the need for complementary Software Composition Analysis (SCA) and project-specific review.

Analysis of Bugs

Control-flow mistakes dominate bug categories in GPT-4o (48.15%) and are significant in other models. API contract violations, exception handling errors, resource management lapses, and type-safety issues are common, reflecting challenges in deep path reasoning, error propagation, and lifecycle management. Concurrency bugs are more frequent in Claude Sonnet 4, likely due to underrepresentation in training corpora. Null/data-value issues and performance/structure bugs are persistent across models.

Analysis of Security Vulnerabilities

Path traversal and content injection vulnerabilities are the most common (≈30–34%), followed by hard-coded credentials (up to 29.85% in OpenCoder-8B) and cryptography misconfiguration. These vulnerabilities require non-local taint analysis and semantic understanding of sensitive data, which current LLMs lack. The presence of XML External Entity (XXE) flaws, inadequate I/O error handling, and certificate-validation omissions further demonstrates the inability of LLMs to consistently generate secure code, especially when training data includes insecure patterns.

SonarQube Issue Examples

Concrete examples include:

Generic exception usage (java:S112): Indicates lack of specificity in error handling.
Resource management lapses (java:S2095): Failure to close streams/resources.
Hardcoded credentials (java:S6437): Direct embedding of sensitive data.
High cognitive complexity (java:S3776): Methods exceeding maintainability thresholds.
Redundant code structures (java:S2094): Empty classes/methods.

These issues are systemic and not isolated to specific models, reflecting limitations in context awareness and semantic reasoning.

Model Evolution and Defect Trends

Comparative analysis of Claude 3.7 Sonnet and Claude Sonnet 4 shows that improved functional performance can coincide with increased severity of bugs and vulnerabilities. The newer model has a higher proportion of BLOCKER bugs and vulnerabilities, indicating that advances in benchmark scores do not guarantee improvements in non-functional attributes. The defect profile evolves rather than resolves, necessitating ongoing vigilance.

Static Analysis as a Safeguard

Static analysis tools like SonarQube are effective in detecting critical flaws in LLM-generated code, providing automated, consistent, and scalable quality assurance. Their integration into CI/CD pipelines is essential for responsible AI adoption in software engineering. However, static analysis must be complemented by SCA to address risks from outdated dependencies and known CVEs, which are frequently reproduced by LLMs due to training data limitations.

Implications and Future Directions

The findings have several practical and theoretical implications:

Functional correctness is not a proxy for code quality or security. Model selection should consider defect profiles, not just benchmark scores.
Systemic weaknesses are inherent to current LLM methodologies. Improvements in architecture or scale do not eliminate latent risks.
Automated static analysis and SCA are mandatory for production-readiness. Manual review alone is insufficient given the volume and subtlety of defects.
Prompt engineering and fine-tuning may mitigate but not eliminate risks. Research into architectures that incorporate semantic analysis and security awareness is needed.

Future work should explore prompt and training strategies for defect mitigation, longitudinal studies on technical debt in LLM-assisted projects, autonomous LLM-driven refactoring, cross-language analyses, and architectural innovations targeting robust code generation.

Conclusion

This paper demonstrates that LLM-generated code, while functionally capable, is consistently affected by maintainability, reliability, and security defects. These issues are systemic, not model-specific, and are best understood as features of current generative paradigms. The lack of correlation between functional performance and code quality underscores the necessity of rigorous static analysis and SCA in any workflow involving LLMs. As AI becomes integral to software engineering, informed vigilance and automated safeguards are essential to uphold standards of quality, security, and maintainability.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (3)

Tweets

https://twitter.com/karpathy/status/1959703967694545296

https://twitter.com/kapmani/status/1959763165572665544

https://twitter.com/tariqshaukat/status/1959787232375517691

https://twitter.com/ComputerPapers/status/1958661476882755712

YouTube

Show All Videos

alphaXiv

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis (14 likes, 0 questions)