LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

Published 30 Apr 2026 in cs.SE | (2604.27727v1)

Abstract: LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be reliable, comparable across models, and interpretable over multi-turn interaction. To address this gap, a rubric-driven LLM-as-a-Judge framework is presented for contest-style human-AI co-creation in coding and software engineering (SE). The framework is built around schema-constrained judge outputs, validation and repair mechanisms, grouped and split by user and problem to prevent trajectory leakage, and participant-level NONBLIND context. Multiple LLM judges are assessed through a multi-metric protocol covering discrimination (ROC-AUC, PR-AUC), thresholded decision quality (MCC), probabilistic reliability (LogLoss, Brier score, ECE), and inter-judge agreement (Cohen's and Fleiss' k). Human-AI co-creation is further examined through trajectory-level signals, including turn-wise confidence, Success-at-Turn, time-to-success, revision churn, and CodeBLEU. Co-creation success is found to concentrate early, with Success-at-Turn rising to 0.8533 at the first observed turn and stabilizing at 0.8641 by turn 6. Revision behavior, however, remains heterogeneous, suggesting that productive progress can emerge through either incremental refinement or broader restructuring. On the judging side, the best held-out scores reach 0.5937 for ROC-AUC, 0.6904 for PR-AUC, and 0.5000 for MCC test, while inter-judge consistency remains modest overall (mean pairwise Cohen's k = 0.1592, Fleiss' k = 0.0696). Taken together, this work offers an auditable and reproducible evaluation methodology that links reliability-aware LLM judging with trajectory-based analysis of human-AI co-creation, providing a practical evaluation template for future AI-assisted coding and SE.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel LLM-as-Judge framework that operationalizes evaluation of iterative human-AI co-creation in coding by employing schema enforcement and reliability metrics.
It employs multi-metric analysis including ROC-AUC, MCC, and calibration scores to assess judge performance and code revision dynamics.
Results reveal significant inter-judge variability and highlight the importance of detailed, trajectory-level logging for effective evaluation in programming contests.

LLM-as-a-Judge for Human-AI Co-Creation in Coding: A Reliability-Aware Evaluation Framework

Introduction and Motivation

The intersection of LLM evaluation and human-AI co-creation introduces substantial methodological challenges, particularly in programming and software engineering (SE) domains. Traditional code evaluation—focused narrowly on correctness—underrepresents the complexity of iterative, co-creative problem solving, where human partners interact with AI agents across multiple attempts and under contest constraints. The reviewed work formulates a rigorously schema-constrained, reliability-aware LLM-as-a-Judge paradigm designed to address the auditability, replicability, and interpretability deficits prevalent in extant LLM-based evaluation pipelines, particularly for iterative coding workflows.

Figure 1: Overview of the LLM-as-a-Judge and human-AI co-creation workflow.

Framework and Methodology

Task Setting and Data

Empirical evaluation is anchored in a competitive real-world environment: a coding contest with parallel tracks—one conventional, the other permitting unrestricted AI assistance. This yields a rich dataset comprising chronological submission logs, participant-level prompt histories, source code revisions, and external verdicts for each attempt. Each $(\text{user},\text{problem})$ pair forms a trajectory, allowing for analysis of both artifact quality and multi-turn co-creation dynamics.

Rubric-Driven Judging Pipeline

The framework operationalizes LLM-as-a-Judge by enforcing a schema-constrained output: judges are required to provide rubric items comprising probabilistic acceptance ( $p \in [0,1]$ ), ordinal sub-scores (algorithmic adequacy, robustness/constraint handling on $[1,5]$ ), and a rationale. Valid outputs are ensured via systematic verification and repair, including checkpointed API inference, bounded retries, and canonical attempt IDs. The design guarantees reproducibility and prevents data leakage via grouped split strategies at the trajectory level.

Multi-Metric Judge Evaluation

Evaluation incorporates an array of complementary metrics:

Discrimination: ROC-AUC, PR-AUC quantify threshold-free ranking.
Thresholded Quality: MCC (with thresholds set on validation, frozen for test).
Calibration and Reliability: LogLoss, Brier Score, ECE.
Inter-Judge Agreement: Pairwise Cohen’s $\kappa$ , Fleiss’ $\kappa$ .
Revision Dynamics: NED, CodeBLEU for quantifying code changes.

This suite enables a granular diagnosis of judge behavior beyond binary accuracy.

Experimental Results

Judge Performance: Discrimination, Calibration, Agreement

Among the judges assessed (OpenAI/gpt-5.2, DeepSeek, Gemini, Claude), performance is non-uniform across metrics:

Figure 2: Judge-level curves for (a) ROC-AUC, (b) PR-AUC.

DeepSeek attains the highest PR-AUC (0.6904) and ties for best ROC-AUC (0.5937), outstripping OpenAI and Gemini in both discrimination and calibration. Nevertheless, calibration and thresholded metrics show strong divergence; for instance, DeepSeek’s calibrated decisions (MCC $_\text{test}$ 0.5000) are markedly superior to OpenAI (0.0755) and Claude (0.3371), illustrating the necessity of explicit threshold selection and validation.

Figure 3: Judge-level curves for the reliability (calibration), showing ECE behavior across bins.

Inter-judge agreement is modest: mean pairwise Cohen's $\kappa$ is 0.1592, and Fleiss’ $\kappa$ across all four judges is only 0.0696, indicating that LLMs diverge materially in their accept/reject judgments even when schema and labeling policies are controlled.

Figure 4: Pairwise Cohen's kappa heatmap on TEST.

Participant-Side Co-Creation Dynamics

Progress Distribution and Timing

Human-AI dyads exhibit highly front-loaded success distributions: Success@Turn is 0.8533 on the first recorded attempt, climbing only marginally to 0.8641 by turn 6.

Figure 5: Success@Turn curve showing the cumulative fraction of trajectories resolved by or before each observed turn.

Survival analysis confirms that time-to-success is clustered at the start of most trajectories, with diminishing success rates on subsequent turns.

Figure 6: Kaplan-Meier survival curve estimating time-to-success across participant-problem trajectories.

Revision and Code Churn

Code revision behavior is heterogeneous: both high and low-magnitude code changes (measured via NED and CodeBLEU) are observed across positive and negative progress increments, and successful convergence to accepted code is characterized by high structural similarity as captured by CodeBLEU.

Figure 7: Relationship between code churn and turn-wise improvement across consecutive observed attempts.

Figure 8: Relationship between CodeBLEU-based churn and turn-wise progress across consecutive observed turns $(k \geq 2)$ .

Figure 9: Distribution of CodeBLEU similarity to the first accepted code across attempt-level observations.

Participant prompt space, visualized via TF-IDF and t-SNE, does not reveal clear groupings predictive of outcome, further reinforcing the conclusion that multiple revision and prompt strategies are viable.

Figure 10: Participant-level prompt-space map using TF-IDF and t-SNE.

Revision–Outcome Relationship

The mapping from code churn (both surface-level and code-aware) to outcome is diffuse: neither aggressive rewrites nor low delta guarantee improved solution quality, consistent with a multi-path co-creation model wherein both incremental refinement and structural redesign are valuable.

Implications

Practical Implications

Multi-Metric Reporting is Mandatory: No single metric sufficiently captures judge quality in this setting; combined discrimination, calibration, thresholded accuracy, and agreement analyses are required for actionable evaluation.
Auditability and Repair are Essential: Strict schema, verification, and repair pipelines are necessary to produce outputs fit for downstream research or operations.
Intermediate Trajectories Matter: Analyses relying only on final submissions are insufficient; only detailed attempt-level logging enables analysis of trajectory structure and learning/revision dynamics.

Theoretical and Methodological Implications

Persistent departures in judge agreement reveal that LLMs encode model-specific biases, making judge interchangeability an invalid assumption.
Heterogeneity in co-creation progress and revision strategies underscores the importance of moving beyond deterministic evaluation pipelines in programming education and SE.
The integration of trajectory-based and judge-side reliability analysis forms a template for further methodological development, particularly in educational/contest or collaborative SE settings.

Future Work

Directions for future research include scaling the framework to larger, more diverse participant cohorts and problem sets, integrating human raters for comparison, and exploring continual calibration protocols to mitigate cross-model divergence.

Conclusion

The study establishes a robust, reproducible framework for evaluating both human-AI co-creation trajectories and LLM-based code artifact judging. Strong success clustering at early turns, alongside widely varying revision dynamics, highlight the complexity inherent in mixed-initiative coding workflows. On the judge side, substantive variation in discrimination, calibration, and agreement reinforces the necessity of multi-dimensional, reliability-aware reporting. The rubric-driven, schema-constrained pipeline advanced here addresses key measurement gaps and supplies a blueprint for future empirical work in LLM-based evaluation of coding and SE workflows.

Markdown Report Issue