- The paper presents a thorough quantitative analysis of LLM-generated Java code, revealing systemic defects in quality and security despite high functional performance.
- The paper employs rigorous static analysis via SonarQube on 4,442 coding assignments, providing detailed metrics on issue density, code smells, and critical vulnerabilities.
- The paper demonstrates that high functional correctness does not guarantee production readiness, advocating for combined static analysis, SCA, and manual review.
Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
Introduction
This paper presents a comprehensive quantitative evaluation of code quality and security in Java code generated by five leading LLMs: Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder-8B. The paper leverages static analysis via SonarQube on 4,442 Java coding assignments, focusing on the prevalence and nature of software defects, including bugs, security vulnerabilities, and code smells. The analysis is motivated by the increasing adoption of LLMs in software engineering and the need to understand the latent risks associated with their output, especially as functional correctness alone is insufficient for production-readiness.
Experimental Design and Methodology
The evaluation utilizes three benchmark datasets (MultiPL-E-mbpp-java, MultiPL-E-humaneval-java, ComplexCodeEval) to ensure coverage of diverse Java programming challenges. Each LLM is prompted identically (temperature=0) to generate compilable Java solutions for all tasks. Functional correctness is measured by Pass@1 unit test rates on MultiPL-E benchmarks, while SonarQube's default SonarWay Java ruleset (≈550 rules) is applied to all generated code for static analysis. This dual approach enables assessment of both functional and non-functional attributes.
Structural and Functional Characteristics
The models exhibit significant variance in code volume, complexity, and comment density. For example, Claude Sonnet 4 produces the most code (370,816 LOC) and highest cyclomatic complexity, while OpenCoder-8B generates the most concise code (120,288 LOC) with the lowest complexity. Comment density ranges from 4.4% (GPT-4o) to 16.4% (Claude 3.7 Sonnet). These differences have direct implications for maintainability and technical debt, independent of functional performance.
Functional pass rates vary (Claude Sonnet 4: 77.04%, OpenCoder-8B: 60.43%), but all models generate code with latent defects even when passing all tests. The normalized metric "issues per passing task" reveals that higher functional performance does not equate to higher code quality or security, contradicting common assumptions in LLM evaluation.
Defect Density and Distribution
Static analysis reveals that all models produce a mix of code smells (≈90–93%), bugs (5–8%), and security vulnerabilities (≈2%). Issue density per KLOC ranges from 19.48 (Claude Sonnet 4) to 32.45 (OpenCoder-8B). The distribution of defect types is consistent across models, indicating systemic weaknesses in current LLM code generation paradigms.
Severity analysis shows that a substantial fraction of bugs and vulnerabilities are classified as BLOCKER or CRITICAL. For vulnerabilities, 60–70% are BLOCKER-level, underscoring the risk of deploying LLM-generated code without rigorous review.
Analysis of Code Smells
The most prevalent code smell categories are dead/unused/redundant code (up to 42.74% in OpenCoder-8B), design/framework best-practices, assignment/field/scope visibility, and collection/generics/type issues. These reflect LLM limitations in non-local reference analysis, context-specific design conventions, and semantic API understanding. The frequent use of deprecated APIs and generic exception handling further highlights the need for complementary Software Composition Analysis (SCA) and project-specific review.
Analysis of Bugs
Control-flow mistakes dominate bug categories in GPT-4o (48.15%) and are significant in other models. API contract violations, exception handling errors, resource management lapses, and type-safety issues are common, reflecting challenges in deep path reasoning, error propagation, and lifecycle management. Concurrency bugs are more frequent in Claude Sonnet 4, likely due to underrepresentation in training corpora. Null/data-value issues and performance/structure bugs are persistent across models.
Analysis of Security Vulnerabilities
Path traversal and content injection vulnerabilities are the most common (≈30–34%), followed by hard-coded credentials (up to 29.85% in OpenCoder-8B) and cryptography misconfiguration. These vulnerabilities require non-local taint analysis and semantic understanding of sensitive data, which current LLMs lack. The presence of XML External Entity (XXE) flaws, inadequate I/O error handling, and certificate-validation omissions further demonstrates the inability of LLMs to consistently generate secure code, especially when training data includes insecure patterns.
SonarQube Issue Examples
Concrete examples include:
- Generic exception usage (java:S112): Indicates lack of specificity in error handling.
- Resource management lapses (java:S2095): Failure to close streams/resources.
- Hardcoded credentials (java:S6437): Direct embedding of sensitive data.
- High cognitive complexity (java:S3776): Methods exceeding maintainability thresholds.
- Redundant code structures (java:S2094): Empty classes/methods.
These issues are systemic and not isolated to specific models, reflecting limitations in context awareness and semantic reasoning.
Model Evolution and Defect Trends
Comparative analysis of Claude 3.7 Sonnet and Claude Sonnet 4 shows that improved functional performance can coincide with increased severity of bugs and vulnerabilities. The newer model has a higher proportion of BLOCKER bugs and vulnerabilities, indicating that advances in benchmark scores do not guarantee improvements in non-functional attributes. The defect profile evolves rather than resolves, necessitating ongoing vigilance.
Static Analysis as a Safeguard
Static analysis tools like SonarQube are effective in detecting critical flaws in LLM-generated code, providing automated, consistent, and scalable quality assurance. Their integration into CI/CD pipelines is essential for responsible AI adoption in software engineering. However, static analysis must be complemented by SCA to address risks from outdated dependencies and known CVEs, which are frequently reproduced by LLMs due to training data limitations.
Implications and Future Directions
The findings have several practical and theoretical implications:
- Functional correctness is not a proxy for code quality or security. Model selection should consider defect profiles, not just benchmark scores.
- Systemic weaknesses are inherent to current LLM methodologies. Improvements in architecture or scale do not eliminate latent risks.
- Automated static analysis and SCA are mandatory for production-readiness. Manual review alone is insufficient given the volume and subtlety of defects.
- Prompt engineering and fine-tuning may mitigate but not eliminate risks. Research into architectures that incorporate semantic analysis and security awareness is needed.
Future work should explore prompt and training strategies for defect mitigation, longitudinal studies on technical debt in LLM-assisted projects, autonomous LLM-driven refactoring, cross-language analyses, and architectural innovations targeting robust code generation.
Conclusion
This paper demonstrates that LLM-generated code, while functionally capable, is consistently affected by maintainability, reliability, and security defects. These issues are systemic, not model-specific, and are best understood as features of current generative paradigms. The lack of correlation between functional performance and code quality underscores the necessity of rigorous static analysis and SCA in any workflow involving LLMs. As AI becomes integral to software engineering, informed vigilance and automated safeguards are essential to uphold standards of quality, security, and maintainability.