- The paper presents a novel LLM-as-Judge framework that operationalizes evaluation of iterative human-AI co-creation in coding by employing schema enforcement and reliability metrics.
- It employs multi-metric analysis including ROC-AUC, MCC, and calibration scores to assess judge performance and code revision dynamics.
- Results reveal significant inter-judge variability and highlight the importance of detailed, trajectory-level logging for effective evaluation in programming contests.
LLM-as-a-Judge for Human-AI Co-Creation in Coding: A Reliability-Aware Evaluation Framework
Introduction and Motivation
The intersection of LLM evaluation and human-AI co-creation introduces substantial methodological challenges, particularly in programming and software engineering (SE) domains. Traditional code evaluation—focused narrowly on correctness—underrepresents the complexity of iterative, co-creative problem solving, where human partners interact with AI agents across multiple attempts and under contest constraints. The reviewed work formulates a rigorously schema-constrained, reliability-aware LLM-as-a-Judge paradigm designed to address the auditability, replicability, and interpretability deficits prevalent in extant LLM-based evaluation pipelines, particularly for iterative coding workflows.
Figure 1: Overview of the LLM-as-a-Judge and human-AI co-creation workflow.
Framework and Methodology
Task Setting and Data
Empirical evaluation is anchored in a competitive real-world environment: a coding contest with parallel tracks—one conventional, the other permitting unrestricted AI assistance. This yields a rich dataset comprising chronological submission logs, participant-level prompt histories, source code revisions, and external verdicts for each attempt. Each (user,problem) pair forms a trajectory, allowing for analysis of both artifact quality and multi-turn co-creation dynamics.
Rubric-Driven Judging Pipeline
The framework operationalizes LLM-as-a-Judge by enforcing a schema-constrained output: judges are required to provide rubric items comprising probabilistic acceptance (p∈[0,1]), ordinal sub-scores (algorithmic adequacy, robustness/constraint handling on [1,5]), and a rationale. Valid outputs are ensured via systematic verification and repair, including checkpointed API inference, bounded retries, and canonical attempt IDs. The design guarantees reproducibility and prevents data leakage via grouped split strategies at the trajectory level.
Multi-Metric Judge Evaluation
Evaluation incorporates an array of complementary metrics:
- Discrimination: ROC-AUC, PR-AUC quantify threshold-free ranking.
- Thresholded Quality: MCC (with thresholds set on validation, frozen for test).
- Calibration and Reliability: LogLoss, Brier Score, ECE.
- Inter-Judge Agreement: Pairwise Cohen’s κ, Fleiss’ κ.
- Revision Dynamics: NED, CodeBLEU for quantifying code changes.
This suite enables a granular diagnosis of judge behavior beyond binary accuracy.
Experimental Results
Among the judges assessed (OpenAI/gpt-5.2, DeepSeek, Gemini, Claude), performance is non-uniform across metrics:

Figure 2: Judge-level curves for (a) ROC-AUC, (b) PR-AUC.
DeepSeek attains the highest PR-AUC (0.6904) and ties for best ROC-AUC (0.5937), outstripping OpenAI and Gemini in both discrimination and calibration. Nevertheless, calibration and thresholded metrics show strong divergence; for instance, DeepSeek’s calibrated decisions (MCCtest​ 0.5000) are markedly superior to OpenAI (0.0755) and Claude (0.3371), illustrating the necessity of explicit threshold selection and validation.
Figure 3: Judge-level curves for the reliability (calibration), showing ECE behavior across bins.
Inter-judge agreement is modest: mean pairwise Cohen's κ is 0.1592, and Fleiss’ κ across all four judges is only 0.0696, indicating that LLMs diverge materially in their accept/reject judgments even when schema and labeling policies are controlled.
Figure 4: Pairwise Cohen's kappa heatmap on TEST.
Participant-Side Co-Creation Dynamics
Progress Distribution and Timing
Human-AI dyads exhibit highly front-loaded success distributions: Success@Turn is 0.8533 on the first recorded attempt, climbing only marginally to 0.8641 by turn 6.
Figure 5: Success@Turn curve showing the cumulative fraction of trajectories resolved by or before each observed turn.
Survival analysis confirms that time-to-success is clustered at the start of most trajectories, with diminishing success rates on subsequent turns.
Figure 6: Kaplan-Meier survival curve estimating time-to-success across participant-problem trajectories.
Revision and Code Churn
Code revision behavior is heterogeneous: both high and low-magnitude code changes (measured via NED and CodeBLEU) are observed across positive and negative progress increments, and successful convergence to accepted code is characterized by high structural similarity as captured by CodeBLEU.
Figure 7: Relationship between code churn and turn-wise improvement across consecutive observed attempts.
Figure 8: Relationship between CodeBLEU-based churn and turn-wise progress across consecutive observed turns (k≥2).
Figure 9: Distribution of CodeBLEU similarity to the first accepted code across attempt-level observations.
Participant prompt space, visualized via TF-IDF and t-SNE, does not reveal clear groupings predictive of outcome, further reinforcing the conclusion that multiple revision and prompt strategies are viable.
Figure 10: Participant-level prompt-space map using TF-IDF and t-SNE.
Revision–Outcome Relationship
The mapping from code churn (both surface-level and code-aware) to outcome is diffuse: neither aggressive rewrites nor low delta guarantee improved solution quality, consistent with a multi-path co-creation model wherein both incremental refinement and structural redesign are valuable.
Implications
Practical Implications
- Multi-Metric Reporting is Mandatory: No single metric sufficiently captures judge quality in this setting; combined discrimination, calibration, thresholded accuracy, and agreement analyses are required for actionable evaluation.
- Auditability and Repair are Essential: Strict schema, verification, and repair pipelines are necessary to produce outputs fit for downstream research or operations.
- Intermediate Trajectories Matter: Analyses relying only on final submissions are insufficient; only detailed attempt-level logging enables analysis of trajectory structure and learning/revision dynamics.
Theoretical and Methodological Implications
- Persistent departures in judge agreement reveal that LLMs encode model-specific biases, making judge interchangeability an invalid assumption.
- Heterogeneity in co-creation progress and revision strategies underscores the importance of moving beyond deterministic evaluation pipelines in programming education and SE.
- The integration of trajectory-based and judge-side reliability analysis forms a template for further methodological development, particularly in educational/contest or collaborative SE settings.
Future Work
Directions for future research include scaling the framework to larger, more diverse participant cohorts and problem sets, integrating human raters for comparison, and exploring continual calibration protocols to mitigate cross-model divergence.
Conclusion
The study establishes a robust, reproducible framework for evaluating both human-AI co-creation trajectories and LLM-based code artifact judging. Strong success clustering at early turns, alongside widely varying revision dynamics, highlight the complexity inherent in mixed-initiative coding workflows. On the judge side, substantive variation in discrimination, calibration, and agreement reinforces the necessity of multi-dimensional, reliability-aware reporting. The rubric-driven, schema-constrained pipeline advanced here addresses key measurement gaps and supplies a blueprint for future empirical work in LLM-based evaluation of coding and SE workflows.