Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Published 6 May 2026 in cs.SE and cs.AI | (2605.05267v1)

Abstract: LLMs frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From-Data-to-Code.

Summary

  • The paper provides a nine-dimensional taxonomy for categorizing code defects, including errors in correctness, security, and maintainability.
  • It introduces a causal mapping framework with 18 propagation mechanisms that detail how training data flaws translate directly and indirectly into code issues.
  • State-of-the-art detection and mitigation strategies are analyzed, emphasizing proactive data governance and end-to-end quality pipelines for robust code generation.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code (2605.05267)

Introduction and Motivation

LLMs have become integral to code generation, driving advances in automated code completion, repair, and documentation. However, LLM-generated code still suffers from a broad spectrum of quality defects, including logical errors, maintainability issues, and critical security vulnerabilities. The systematic review "Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code" provides a comprehensive synthesis of how training data quality fundamentally shapes, and often undermines, the reliability of model outputs. The work integrates cross-disciplinary perspectives and formalizes the propagation mechanisms by which data-level technical debt is amplified in LLM-generated code.

Taxonomies of Code and Data Quality Issues

Generated Code Quality Issues

Through synthesis of 114 primary studies, the paper proposes a nine-dimensional taxonomy for classifying LLM-generated code defects:

  • Correctness: Logical errors, syntactic defects, deprecated API use.
  • Security: Vulnerabilities such as SQL injection or unsafe serialization.
  • Compliance: Legal, ethical, or privacy violations.
  • Robustness: Lack of boundary condition handling, poor fault tolerance.
  • Maintainability: Complex, monolithic, or poorly modularized code.
  • Understandability: Obfuscated logic, poor naming, missing documentation.
  • Efficiency: Redundant operations, non-optimal data structures/algorithms.
  • Parsimony of Output: Overly verbose, repetitive code blocks.
  • Miscellaneous: Misalignment with prompt intent, hallucinated APIs.

These dimensions are not isolated but often co-occur and reinforce one another, resulting in compounded risk when deploying LLM-generated code. Figure 1

Figure 1: LLM-generated code with multiple quality defects.

Training Data Quality Issues

The review codifies training data issues into two meta-categories: code attribute defects (e.g., buggy code, unsafe idioms, obsolete APIs) and non-code attribute defects (distribution imbalance, duplication, benchmark contamination, textual noise). Figure 2

Figure 2: Training data quality issues propagated to generated code.

Non-code defects, such as severe class/language/task imbalance and excessive redundancy, distort underlying distributions and bias model behavior in ways that are often difficult to detect post hoc.

Propagation Mechanisms: Mapping Data Problems to Generated Defects

The core technical contribution is the causal mapping framework, detailing 18 propagation mechanisms between training data flaws and eventual code artifacts. These are divided into:

  • Direct mappings: Explicit flaws (such as insecure idioms or deprecated APIs) are memorized and reproduced.
  • Indirect mappings: Statistical biases (e.g., duplication, imbalance) induce representational or optimization bias, reducing model robustness or diversity. Figure 3

    Figure 3: A Sankey diagram illustrating the mappings from training data quality issues (left) to generated code quality issues (right). The intermediate layer details the specific propagation mechanisms.

Direct mappings yield clear โ€œgarbage in, garbage outโ€ patterns. For example, if deprecated, obsolete API patterns are overrepresented in the pretraining corpus, LLMs generate code that is not forward-compatible or secure. Indirect mappings are less immediately visible: duplicated content and imbalanced distributions can cause overfitting, mode collapse, or memorization, which limits the generative creativity and adaptability of the final model. Data contaminationโ€”overlap between training and evaluation setsโ€”artificially inflates benchmark accuracy without guaranteeing generalization.

Evolution of Detection and Mitigation Strategies

State-of-the-art quality assurance has shifted from the reactive post-hoc filtering of outputs towards proactive, data-centric governance, leveraging integrated pipelines for defect detection and mitigation across the model lifecycle.

Detection now combines static rule-based analysis, dynamic execution-based verification, and semantic/model-based evaluation for both code outputs and training data. LLMs are increasingly being utilized to judge code quality ("LLM-as-judge"), complementing human review and lightweight ML classifiers. Static provenance tracking, performance drift monitoring, and membership inference attacks are advancing the ability to trace dataset contamination or memorization.

Mitigation strategies form a hierarchy:

  • Data-level: Cleaning, filtering, balancing, and augmentation.
  • Model-level: Instruction-tuning, reward-based optimization, and regularization.
  • Generation-level: Prompt engineering, retrieval-augmented generation, iterative self-refinement, and automated post-processing.

The review documents that successful interventions are typically holisticโ€”spanning data, model, and generation stagesโ€”and that isolated, stage-specific fixes provide only partial or transient benefits. Figure 4

Figure 4: Conceptual framework of quality issues and mitigation in the LLM lifecycle.

The review presents strong quantitative trends:

  • Functional correctness remains the most frequently studied code quality dimension, but recent years have seen a surge in security, efficiency, and maintainability research.
  • A pronounced methodological shift is observed post-2023, with industrial LLM deployments driving up both the number and sophistication of quality studies.

In terms of data, redundancy, imbalance, and contamination issues are especially prevalent and persistent. Systematic data cleaning and deduplication consistently improve downstream metrics (e.g., up to 40% of raw open-source data is functionally invalid/unexecutable after filtering). However, current benchmarks may overstate LLM generalization ability due to hidden contamination. Figure 5

Figure 5

Figure 5: Distribution of Included Studies by Quality Score.

Practical and Theoretical Implications

This review underscores that generation failures are, in practice, more often symptoms of upstream training data deficiencies than shortcomings of generative inference or architectural limitations. Data-level technical debt is thus the primary driver of code-level defects and risks. The findings highlight that traditional correctness-centric evaluation is insufficientโ€”security, maintainability, and compliance defects often evade static or functional benchmarks but have catastrophic implications in industrial codebases.

For practitioners, the results argue for continuous, end-to-end quality pipelines, incorporating dynamic, provenance-aware benchmarks and automated unlearning or influence tracking to enable traceable, robust interventions. For researchers, implications include the need for formal causal attribution between data and predictions, and for frameworks that can represent and evaluate higher-order propertiesโ€”such as robustness, compositionality, and social riskโ€”at scale.

Prospects for Future AI Systems

Future LLMs for code will likely be developed within closed-loop, data/model co-training infrastructures, where provenance tracking, quality-based data shaping, and reward modeling become first-class primitives. Advances in traceability, causal influence analysis, and automated dynamic benchmarking will be integral to both trustworthy code generation and the principled evaluation of progress. Integrated human-in-the-loop validation is expected to persist in high-risk domains, given the current limitations of automated semantic and legal compliance checks.

Conclusion

The systematic synthesis provided in this review establishes that the path to robust, secure, and maintainable LLM-generated code begins with rigorous, holistic data governance. Generation failures observed in outputs are overwhelmingly downstream effects of upstream training data flaws, both in code and non-code attributes. A foundational paradigm shiftโ€”away from reactive, output-centric filtering to proactive, data-centered lifecycle controlโ€”is essential for building trustworthy, high-quality code generation systems. This review offers a formal, actionable blueprint for future research and applied engineering aiming to close the gap between data curation practices and rigorous software quality standards.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.