Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization (2507.16587v1)

Published 22 Jul 2025 in cs.SE

Abstract: LLMs have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. LLMs-as-a-judge, if proven effective for a specific task, can also unlock new possibilities for automation, with several LLMs proposing a solution for a given instance of the task and others judging and deciding what is the best output to show the user. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. The rationale for choosing these tasks is two-fold. First, quantitative metrics are usually not enough for the assessment of code summarizers/generators. For example, it is well documented that metrics such as BLEU are quite weak proxies for the quality of the generated summaries. Second, even state-of-the-art techniques still struggle with handling complex instances of these tasks, making them good candidates for benefiting from more advanced solutions envisioning collaboration among LLMs. For code generation, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For code summarization, we compare the judgment of five LLMs to those provided by nine humans for ~1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with "smaller" LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.

Summary

  • The paper empirically demonstrates that only large LLMs (e.g., GPT-4-turbo) consistently deliver valid judgments for automated code evaluation tasks.
  • The paper finds that even the best performing models achieve only fair agreement (e.g., Cohen’s Kappa of 0.21 for Java), with significant bias and robustness issues in code generation.
  • The paper shows that GPT-4-turbo aligns moderately with human ratings in code summarization, offering a promising alternative to traditional, less reliable metrics.

Evaluating LLM-as-a-Judge for Code Generation and Summarization

This paper presents a comprehensive empirical paper on the effectiveness of LLMs as automated judges for two core software engineering tasks: code generation and code summarization. The paper systematically benchmarks eight LLMs, including both open-source (DeepSeek Coder, CodeLlama) and proprietary (GPT-3.5-turbo, GPT-4-turbo) models, across Java and Python, using rigorous evaluation protocols and large-scale datasets. The analysis covers prompt engineering, model size, agreement with human and test-based oracles, and bias phenomena.

Experimental Design and Datasets

The evaluation is structured around two tasks:

  • Code Generation: LLMs are prompted to judge the correctness of 1,405 Java and 1,281 Python functions, generated either by LLMs or by humans, using the CoderEval benchmark. The ground truth is established via test suite execution, with extensive pre-filtering to remove unreliable test cases.
  • Code Summarization: Five LLMs assess 1,163 code summaries (Java and Python), each previously rated by three human experts on content adequacy, conciseness, and fluency/understandability. The dataset is constructed to ensure high inter-annotator agreement and includes both human- and LLM-generated summaries.

Multiple prompt variants are tested for both tasks, including zero-shot, chain-of-thought (CoT), and instruction-augmented prompts. The best-performing prompt for each task is selected based on inter-rater agreement with the oracle.

Key Findings

Code Generation Judging

  • Model Size and Performance: Only the largest models (GPT-4-turbo, GPT-3.5-turbo, CodeLlama 34B, DeepSeek Coder 33B) consistently output valid judgments. Smaller models (≤13B parameters) frequently fail to produce meaningful outputs or exhibit near-random agreement with the ground truth.
  • Agreement with Ground Truth: The highest Cohen's Kappa achieved is 0.21 (GPT-4-turbo, Java), indicating only fair agreement. For Python, the best Kappa is 0.10, at the threshold of weak agreement. Most models, including GPT-3.5-turbo, perform substantially worse.
  • Error Analysis: GPT-4-turbo misjudges 50% of incorrect Java implementations as correct (false positives), and 35% of incorrect Python implementations. Manual analysis attributes errors to uncaught bugs, lack of context, ambiguous requirements, and hallucinations.
  • Bias: LLMs systematically overestimate the correctness of LLM-generated code compared to human-written code (negative bias towards human code, large effect size). Self-bias (overrating own generations) is minimal for GPT-4-turbo but pronounced in smaller models.
  • Robustness: When presented with semantically equivalent code transformations, GPT-4-turbo changes its judgment in 21–36% of cases, indicating limited robustness to superficial code changes. However, it correctly identifies injected bugs in 88–97% of cases.

Code Summarization Judging

  • Model Size and Performance: Only GPT-4-turbo achieves moderate agreement with human ratings (Krippendorff's α up to 0.63 for content adequacy). All CodeLlama variants and smaller models exhibit near-zero or negative agreement.
  • Prompting: Zero-shot prompts suffice for GPT-4-turbo to reach peak performance; more complex prompts do not yield significant improvements.
  • Bias: No significant self-bias is observed for GPT-4-turbo. All LLMs tend to overrate the fluency/understandability of human-written summaries.
  • Ranking Consistency: GPT-4-turbo's ranking of summary generators aligns closely with human rankings, except for a tendency to rate LLM-generated summaries higher than human-written ones.
  • Implications for Metrics: Given the poor correlation between human judgment and standard metrics (BLEU, ROUGE, METEOR), LLM-as-a-judge (specifically GPT-4-turbo) offers a more reliable automated evaluation for code summarization.

Practical and Theoretical Implications

  • Code Generation: The current generation of LLMs, including GPT-4-turbo, is not reliable enough to replace test-based or human evaluation for code correctness. The high false positive rate and lack of robustness to code transformations limit their utility in automated code review, bug fixing, or large-scale code generation pipelines.
  • Code Summarization: GPT-4-turbo demonstrates sufficient alignment with human judgment to be considered a viable automated judge for code summarization quality, particularly for content adequacy. This opens the door to scalable, cost-effective evaluation of code summarization systems, especially as standard metrics are shown to be inadequate proxies for human assessment.
  • Model Size and Cost: The effectiveness of LLM-as-a-judge is strongly dependent on model size. Smaller, more cost-effective models are not currently suitable for judging tasks, which has implications for deployment at scale.
  • Prompt Engineering: While prompt design has some impact, model capacity is the dominant factor. Zero-shot and CoT prompts are generally sufficient for high-capacity models.
  • Bias and Fairness: The observed negative bias towards human-written code and summaries suggests that LLMs may not generalize well to out-of-distribution or less "natural" code, raising concerns for fairness and generalizability in real-world settings.

Limitations and Future Directions

  • Task and Language Scope: The paper is limited to code generation and summarization in Java and Python. Generalization to other tasks (e.g., bug fixing, code review) and languages remains to be established.
  • Fine-Tuning: The potential of fine-tuned, task-specific judge models is not explored here. Prior work suggests that such models may improve in-domain performance but at the cost of generalizability and fairness.
  • Human-Like Reasoning: The limited robustness to code transformations and the tendency to overrate LLM-generated outputs indicate that LLMs do not yet exhibit human-like reasoning or critical assessment capabilities in code evaluation.

Speculation on Future Developments

  • Specialized Judge Models: Fine-tuning or instruction-tuning LLMs specifically for code judgment tasks may yield improvements, but care must be taken to avoid overfitting and loss of generalizability.
  • Hybrid Evaluation Pipelines: Combining LLM-as-a-judge with traditional metrics and test-based evaluation may provide more robust and scalable assessment frameworks.
  • Bias Mitigation: Addressing the systematic biases observed in LLM judgments will be critical for fair and reliable deployment in software engineering workflows.
  • Scaling and Cost: As LLMs continue to grow in size and capability, cost-effective deployment strategies (e.g., distillation, retrieval-augmented evaluation) will be necessary for practical adoption.

In summary, the paper provides strong empirical evidence that, while LLMs—particularly GPT-4-turbo—are promising as automated judges for code summarization, their application to code correctness evaluation remains limited by accuracy, robustness, and bias issues. These findings have direct implications for the design of automated evaluation pipelines in software engineering and highlight key challenges for future research in LLM-based assessment.

Youtube Logo Streamline Icon: https://streamlinehq.com