On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o (2502.07399v1)

Published 11 Feb 2025 in cs.SE and cs.AI

Abstract: This paper introduces CodeQUEST, a novel framework leveraging LLMs to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.

Collections

Summary

The paper presents CodeQUEST, a framework using GPT-4o to automate and iteratively improve code quality across 10 dimensions.
It employs a zero-shot chain-of-thought evaluator that provides both quantitative scores and qualitative feedback for systematic code assessment.
Experimental results demonstrate a 52.6% mean improvement and stronger correlation with established metrics compared to baseline evaluation methods.

This paper introduces CodeQUEST (Code Quality Understanding and Enhancement System Toolkit), a framework utilizing the GPT-4o LLM to automatically evaluate and improve code quality. The framework addresses the challenges of traditional code quality assessment, which is often subjective, manual, time-consuming, and reliant on language-specific tools.

CodeQUEST consists of two main components:

Evaluator: This component assesses code quality across ten dimensions: Readability, Maintainability, Testability, Efficiency, Robustness, Security, Documentation, Modularity, Scalability, and Portability.

For each dimension, it uses a set of five predefined questions/statements designed to be language-agnostic and non-overlapping.
Using a zero-shot Chain-of-Thought (CoT) prompt, the LLM answers each statement as True (+1), False (-1), or Not Applicable (0).
These answers generate a quantitative score (-5 to +5) for each dimension, which are averaged for an overall code score.
The Evaluator also produces a qualitative text summary for each dimension and an overall code summary, justifying the scores.
It supports optional self-consistency reasoning (multiple runs) to reduce score variance due to LLM stochasticity.

Evaluator Prompt Template:

--- System ---
You are a helpful and harmless AI software engineer.
You must provide an answer to the following request.
Be brief and precise.

--- Human ---
### CODE:

{code}

### STATEMENTS:
{dimension_statements}

### TASK:
Think step by step to assess the veracity of each STATEMENT
in light of the CODE provided.
Your answer to each statement must come from one of the following:
* -1 if the statement is false, * 1 if the statement is true, * 0 if the statement is not applicable or there is not enough
evidence in the CODE to address it.

You must also provide a short summary about the quality of the
code from a {quality_dimension} perspective, justifying your
answers across the various statements.

### OUTPUT:
Return your answer in valid JSON as shown below:

json { "insight": <code quality summary:str>, "scores": [<score_to_statement1:int>, ...] }

    ```

2.  **Optimizer:** This component iteratively enhances the code based on the Evaluator's feedback. Each iteration involves:
    *   **Code Quality Improvement:** GPT-4o receives the current code and the Evaluator's qualitative feedback, prompted to address the highlighted improvement areas (prioritizing dimensions with lower scores) and generate a revised code version along with explanations.
    *   **Code Validation:** The revised code is checked for compilation errors. Optionally, predefined test cases can be run to ensure functionality remains intact. Failure leads to rejection of the revision for that iteration.
    *   **Evaluator Assessment:** If validation passes, the Evaluator re-assesses the revised code. The revision is accepted only if the overall quality score increases compared to the previous best version.

    *Optimizer Prompt Template:*

--- Human ---

Code:

{code}

Quality Dimensions Feedback:

{quality_insight}

TASK:

You are provided with a code script and detailed feedback for each quality dimension. For each quality dimension, you are provided with: * A score from -5 to 5.The higher the score, the better the quality. * Dimension insights,highlighting potential areas of improvement.

Think step by step to complete the following: 1) For each dimension, reflect on the score and insights. 2) Condense a list of improvement points, so that the code would be evaluated at a higher score for each dimension. 3) Improve the code script according to the improvement points, prioritizing dimensions with lower scores. 4) Return: * the improvement points identified * the improved version of the code script * explanations for each of the changes you've made Note: * ALL improvement points MUST be addressed via meaningful changes to the code.

OUTPUT:

Your final output contains two parts: Return your answer in a valid JSON as shown below:

1
2
3

{
"improvement_points": List[str], "explanation_report": List[str]
}

Then quote your code in the following section:

improved_code
{improved_code_here}

``` The process terminates when a target quality score is reached or a maximum number of iterations is completed.

Experiments and Results:

Dataset: A curated set of 42 Python (28) and JavaScript (14) code examples from open-source libraries (Scipy, AWS-CDK-Examples, MBPP, etc.), some manually modified to introduce quality issues.
Baseline Comparison: CodeQUEST's Evaluator was compared to a simpler baseline prompt asking GPT-4o for a direct quality score and summary. CodeQUEST provided more detailed, comprehensive, and arguably more accurate assessments, while the baseline tended to overestimate quality.
Improvement Effectiveness: The Optimizer significantly improved code quality over 5 iterations, achieving a mean Relative Percentage Improvement (RPI) of 52.6%. Most improvements occurred in the first iteration, diminishing rapidly after the third.
Validation with Proxies: For Python code, CodeQUEST's score changes per iteration showed a stronger correlation (Pearson $r_p=0.53$ , Spearman $r_s=0.23$ ) with score changes from proxy metrics (Pylint, Radon MI, Bandit) than the baseline's score changes (Pearson $r_p=0.27$ , Spearman $r_s=0.21$ ). This suggests CodeQUEST's evaluations better reflect underlying quality variations captured by established tools, although the relationship is noisy due to differences in evaluation scope (e.g., CodeQUEST identified security issues missed by Bandit).

Practical Implications:

CodeQUEST demonstrates a practical method for automating code quality assessment and improvement using LLMs.
The framework is configurable (dimensions, questions, iterations, target score) and applicable across different programming languages (tested on Python and JavaScript).
The diminishing returns after a few iterations suggest cost-effective application by limiting the number of improvement cycles.
The explicit structure of the Evaluator (10 dimensions, specific questions) provides more reliable and actionable feedback compared to simple prompting.
Code validation (compilation, testing) is crucial to ensure the LLM's modifications are functional.
The approach can potentially reduce the burden of manual code reviews and standardize quality assessment.

Limitations:

Code quality remains inherently subjective.
LLM stochasticity can affect results (though self-consistency helps).
Potential for LLM hallucinations requires robust validation (e.g., test cases).
Performance might vary on highly specialized or proprietary codebases.
Validation relied on a limited set of proxy metrics primarily for Python.

In conclusion, the paper presents CodeQUEST as a robust and effective LLM-based framework for iteratively evaluating and enhancing code quality, offering a significant step towards automating and improving software development workflows. The implementation is available on GitHub.