Code Summarization Judging

Updated 25 July 2025

Code summarization judging is the systematic assessment of natural language summaries for source code based on content adequacy, conciseness, fluency, and semantic alignment.
Metrics like BLEU-DC, BERTScore, and human ratings are used to quantify how well summaries capture developer intent and code functionality.
Recent trends leverage large language models as automated judges to scale evaluations, enhance calibration, and mitigate inherent biases.

Code summarization judging refers to the systematic assessment of the quality, accuracy, and human-likeness of natural language summaries generated for source code by automated systems. This judging process is central to the development, evaluation, and deployment of automated code summarization models, as it determines the extent to which these models align with human developers’ expectations and practical documentation needs. The research landscape spans classic metric-based approaches, sophisticated human evaluation studies, recent models leveraging LLMs as automated judges, and recent experiments on calibration, reliability, and robustness.

1. Judging Criteria and Quality Dimensions

Code summarization outputs are evaluated along multiple dimensions, each reflecting aspects of developer intent, readability, and usability:

Content Adequacy: Whether the summary accurately captures the functionality, purpose, and key operations of the code. For instance, does the summary identify “action words” that are central to code behavior (Haque et al., 2021)?
Conciseness: The brevity and succinctness of the generated summary without unnecessary verbosity or over-generalization (Crupi et al., 22 Jul 2025).
Fluency/Understandability: The degree to which the summary is grammatical, well-formed, and comprehensible to a human reader (Shrivastava, 2021).
Coverage of Dependencies and Context: Advanced evaluation frameworks recognize the importance of code context, including method dependencies, class relationships, and hierarchical organization (e.g., intra-class/inter-class context in CoCoSum (Wang et al., 2021) or higher-level units (Sun et al., 13 Mar 2025)).
Semantic Alignment: Whether the summary semantically matches code intent, even if the wording does not exactly match a reference summary (Mastropaolo et al., 2023, Haque et al., 2022).

These dimensions are reflected in both automated metrics and human assessment protocols.

2. Metrics, Human Judgment, and Their Limitations

A variety of metrics have been proposed and adopted for code summarization judging:

Reference-based Metrics: BLEU, ROUGE, METEOR, BLEU-DC (sentence-level BLEU with smoothing 4), and variants, which measure n-gram overlap between generated and reference summaries. However, these are known to be only weak proxies for human assessment, since they focus on superficial overlap rather than semantic adequacy (Shi et al., 2021, Crupi et al., 22 Jul 2025).
Semantic Similarity Metrics: Embedding-based measures such as BERTScore, SentenceBERT, and the SIDE metric (“Summary alIgnment to coDe sEmantics”). SIDE, for example, aligns the generated summary and the code directly in an embedding space via contrastive learning, avoiding dependence on sometimes low-quality reference summaries (Mastropaolo et al., 2023). Studies indicate that such metrics correlate more strongly with human judges than traditional word-overlap metrics (Virk et al., 30 Apr 2024, Haque et al., 2022).
Human Ratings: Direct assessment by expert or crowd annotators remains the gold standard, especially for dimensions such as content adequacy and naturalness. Inter-rater agreement is measured with statistics such as Krippendorff’s α; high values (e.g., α ≈ 0.8) indicate reliable consensus (Crupi et al., 22 Jul 2025).
Qualitative Error Analysis: Manual coding and error taxonomy construction reveals common types of model errors (e.g., omission of information, identifier misrecognition, repetition), which quantitative metrics may not distinguish (Mahmud et al., 2021).

A key finding is that summary “goodness” is often inadequately captured by word-overlap metrics alone, especially when reference summaries are noisy or outdated (Mastropaolo et al., 2023). As a result, there is a trend toward using a combination of metrics, including those that reflect direct code-summary semantic alignment.

3. LLMs as Automated Judges: Reliability and Challenges

Recent studies have systematically examined the use of LLMs as automatic judges of code summary quality. The core findings are:

Capability: LLMs such as GPT-4 and GPT-4-turbo provide judgments that moderately to strongly agree with aggregated human ratings on adequacy, conciseness, and fluency (Crupi et al., 22 Jul 2025, Sun et al., 13 Mar 2025). For example, GPT-4 reaches Krippendorff’s α ≈ 0.58–0.63 for content adequacy, approaching the agreement between individual humans.
Assessment Setup: LLMs are typically prompted with code and candidate summaries and asked to provide scalar ratings or categorize the summary according to predefined criteria (Crupi et al., 22 Jul 2025). Multiple prompts (zero-shot, chain-of-thought, slow-thinking) have been tested; results are prompt-dependent but model ranking remains stable.
Limitations and Bias: Smaller LLMs (<34B parameters) underperform and may inconsistently rate summaries, sometimes misunderstanding rating tasks. Even state-of-the-art LLMs exhibit “self-enhancement” bias (assigning higher ratings to their own outputs), though this is generally minimal for the best models (Crupi et al., 22 Jul 2025). Further, agreement for aspects like conciseness and fluency remains weaker than for adequacy.
Automated Judging for Large-Scale Evaluation: The use of LLMs as judges is supported in settings where human evaluation would be cost-prohibitive. For higher-level code units such as files and modules, LLM judges (e.g., GPT-4) can replicate the ranking of strategies observed in human evaluations, facilitating scalable research (Sun et al., 13 Mar 2025).

4. Calibration, Confidence Measures, and Human-Likeness

Judging not only involves assigning a score but also understanding how confident one can be in such a judgment, especially with LLM-generated summaries:

Calibration Problem: Judging frameworks increasingly focus on the calibration of confidence scores—whether probability or likelihood measures produced by a model accurately reflect the chance that a summary is “good enough” (Virk et al., 30 Apr 2024).
Confidence Measures: Approaches include using the average probability assigned to output tokens, log-probability of auto-critique statements (“Is this summary correct?”), or model-generated self-reflective ratings. However, raw token probabilities correlate only weakly with true summary adequacy (Spearman ρ < 0.5), and self-reflective ratings (where the LLM rates its own outputs) may be poorly calibrated (Virk et al., 30 Apr 2024).
Rescaling and Calibration Improvement: Platt scaling and related logistic regression adjustments can significantly improve alignment between model confidence and actual correctness, reducing Brier scores (mean squared difference between confidence and correctness) to near 0.03–0.09 in practical settings (Virk et al., 30 Apr 2024).

A central challenge remains—judging whether a summary would resemble a high-quality human annotation, given the inherent subjectivity and diversity of human-written summaries.

5. Integration of Judging into Model Development and Selection

Judging frameworks contribute directly to research and deployment by:

Model Comparison: Precise and transparent reporting of metric variants, code pre-processing, and dataset characteristics is essential for reproducibility and valid comparison among systems (Shi et al., 2021). BLEU-DC (sentence-level, smoothing 4) is recommended as the most correlated with human judgment.
Meta-Learning and Ensemble Selection: Meta-models that select the best summary among outputs from competing models can outperform the best single model, improving BLEU by up to 2.1 points in optimized selection settings (Rauf et al., 2022). This ensemble judging is itself trained using standard metrics as supervision.
Industrial Deployment: In practical contexts, such as commercial software development at Ericsson, statistically robust and efficient judging is crucial to selecting reliable summarization methods for integration into development pipelines (Sridhara et al., 19 Aug 2024).
Evaluation Toolkits and Benchmarks: Community toolboxes have been released to support rigorous, standardized evaluation, incorporating metric variants, pre-processing options, and a range of datasets (Shi et al., 2021).

6. Open Challenges and Future Directions

Semantic Equivalence Beyond Reference Summaries: There is increasing consensus that the field must move beyond word overlap and embrace metrics that capture direct code-summary semantic alignment, independent of potentially flawed references (e.g., via SIDE) (Mastropaolo et al., 2023, Haque et al., 2022).
Higher-Level Code Summarization: Judging strategies must adapt as research moves from method-level to file- and module-level summarization. Optimal strategies differ: complete code input works for files, hierarchical summarization excels at module level, and judging methods must account for trade-offs between quality, cost, and LLM context limits (Sun et al., 13 Mar 2025).
Robustness to Code Variability: Studies show that summarization models are often overly reliant on lexical cues like function names, leading to brittleness under realistic code transformations. Judging frameworks should assess semantic robustness in the face of identifier renaming, dead code, and other perturbations (Mondal et al., 2023, Sridhara et al., 19 Aug 2024).
Bias, Prompt Sensitivity, and Automation Pitfalls: LLM-based judgment, although promising for scaling evaluation, requires careful attention to prompt selection, model bias, and agreement with diverse human judgments, particularly for tasks with inherent subjectivity (Crupi et al., 22 Jul 2025).
Combining Human and Automated Judging: Hybrid evaluation frameworks leveraging both human ratings and calibrated LLM judges are increasingly emphasized for both research and industrial practice, ensuring both reliability and scalability.

7. Summary Table: Representative Metrics and their Correlation with Human Judgment

| Metric/Approach | Nature

Correlates with Human Assessment	Notes
BLEU-DC	N-gram overlap (sentence, smoothing 4)
ROUGE, METEOR	N-gram/sequence overlap
BERTScore, SentenceBERT	Embedding-based semantic similarity
SIDE	Contrastive code-summary alignment
Human, LLM-as-judge	Direct rating/scoring
Token Confidence	Model token log-probs, self-eval by LLM

Conclusion

Code summarization judging is a rapidly advancing research domain, moving from basic metric-based comparison toward context-aware, semantically grounded, and now LLM-automated assessment frameworks. Modern judging protocols recognize the limitations of overlap metrics, the necessity of robust semantic assessment (especially at higher code abstraction levels), the trade-offs between comprehensiveness and resource efficiency, and the practical possibilities and remaining caveats of large-scale automated evaluation. Ongoing developments in metric design, model calibration, and multi-level automated judgment will continue to shape the standards and practices of code summarization research and deployment.