ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Published 5 Apr 2026 in cs.LG | (2604.03922v1)

Abstract: Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces ACES, which uses a leave-one-out strategy to compute AUC scores that distinguish informative tests from misleading ones in code generation evaluation.
It presents two algorithms—ACES-C with a closed-form solution and ACES-O using gradient ascent—to weight tests based on their discriminative power.
Empirical results demonstrate that ACES achieves high Pass@1 scores on benchmarks like HumanEval, HumanEval+, and MBPP, effectively enhancing code selection performance.

ACES: Leave-One-Out AUC Consistency for Test Quality Assessment in Code Generation

Problem Formulation and Circularity in Code/Test Evaluation

The ACES framework is motivated by the challenge of post-hoc code selection for LLM-generated code using LLM-generated tests, where both code and test correctness are unknown and highly variable. The canonical approach aggregates test results in a binary pass matrix $B \in \{0,1\}^{n \times m}$ with $n$ candidate codes and $m$ tests. The central technical obstacle is the circular dependency: reliable tests are needed to determine code correctness, but reliable codes are necessary to appraise test quality.

Existing approaches, such as majority voting and consensus-set scoring (e.g., CodeT), treat all tests equally or apply heuristic filtering; other methods (e.g., MBR-exec, SRank) require expensive pairwise output analysis. No prior execution-only post-hoc method provides formal guarantees for reliable test identification from the pass matrix alone.

LOO-AUC: Theory and Discriminative Power

The principal insight of ACES is that for the purpose of ranking, a test's value lies not in its ground-truth correctness but in its discriminative power: the ability to separate correct codes from incorrect codes. The taxonomy of tests based on their class-conditional pass rates ( $\alpha_j$ , $\beta_j$ ) and discriminative power ( $\delta_j = \alpha_j - \beta_j$ ) is depicted below.

Figure 1: For ranking, only $\delta_j$ matters. Taxonomy of tests by correctness $\alpha_j$ and discriminative power $\delta_j$ .

Rather than attempting to estimate the absolute correctness of each test, ACES breaks the evaluation circularity via a leave-one-out (LOO) approach: for each test $t_j$ , codes are ranked using the votes from the remaining tests; the degree to which $n$ 0's outcomes agree with this ranking is measured via the AUC between the held-out test's column and the aggregate code ranking ("LOO-AUC").

Formally, define the LOO-AUC for test $n$ 1 as $n$ 2, where $n$ 3 is the leave- $n$ 4-out code score vector. Theoretical analysis (Theorem 1 in the paper) shows the expected LOO-AUC is proportional to the unknown ground-truth discriminative power of $n$ 5 ( $n$ 6), where the coefficient is a function of marginal pass rates and task difficulty. This result, the LOO-AUC Identity, is the first provable criterion for identifying both informative and misleading tests directly from the binary pass matrix.

ACES Algorithms: Closed-form and Optimized Weighting

ACES introduces two algorithms operating exclusively on the binary execution matrix:

ACES-C (Closed-form): Test weights are proportional to the pass-rate-corrected LOO-AUC excess ( $n$ 7). This weighting is provably near-oracle optimal under weak conditions on the average test quality, assigning zero weight to misleading tests. ACES-C involves no hyperparameters and negligible additional cost over the majority voting baseline.
ACES-O (Optimized): Instead of relying on the average-quality assumption, ACES-O directly maximizes a differentiable surrogate of the sum of weighted LOO-AUCs via gradient ascent. This iterative method is robust to the presence of a large fraction of misleading tests and adapts the weights jointly, recovering informative tests that ACES-C may underweight in adversarial regimes.

The complementarity of the two methods arises from their operating regimes: ACES-C is highly effective when the average test discriminative power is bounded away from zero (the typical case in practical code generation test sets), while ACES-O excels when this assumption is violated.

Empirical Evaluation

ACES is evaluated on HumanEval, HumanEval $n$ 8, and MBPP, using GPT-3.5-Turbo-generated codes and tests, compared against established reranking methods and recent execution-based and static analysis baselines. On all benchmarks, ACES-O achieves the highest Pass@k among execution-only methods and is competitive or superior to methods exploiting static analysis or LLM-based "verifiers".

Notably, ACES achieves:

HumanEval: 84.15% Pass@1 (ACES-O), +15.8 over GPT-3.5-Turbo zero-shot
HumanEval $n$ 9: 74.39% Pass@1 (ACES-O), outperforming execution-based and most hybrid methods
MBPP: 72.37% Pass@1 (ACES-O), consistent with HumanEval, though static analysis still provides an orthogonal boost

When combined in an ensemble with static-analysis-based signals (e.g., DS^3), ACES further improves absolute Pass@1 on all datasets, confirming the orthogonality and practical salience of its execution-derived test quality weights.

Figure 3: Pass@k versus $m$ 0 on all three benchmarks; ACES, combined with static analysis pre-filtering, yields the best results, especially at low $m$ 1 where ranking quality is paramount.

Robustness, Test Quality Detection, and Analysis

Detailed ablations reveal:

Assumption Analysis: Empirically, the average test quality assumption is satisfied for 70–83% of non-trivial tasks (i.e., tasks with both correct and incorrect codes present). For these cases, ACES-C achieves nearly optimal weighting; for the remainder, ACES-O brings further gains.
Sensitivity to Test/Candidate Pool: ACES-O improves with larger candidate/test pools due to richer pairwise signal; ACES-C is robust even with small samples due to its closed form.
Test Quality Identification: The sign of the ACES-C weight ( $m$ 2) identifies informative (versus misleading) tests with $m$ 3 accuracy across all benchmarks. False positives/negatives are concentrated among tests with weak discriminative power (near $m$ 4), limiting their practical impact.
Figure 5: Test quality detection: ACES-C attribute ( $m$ 5 vs. ground-truth discriminative power $m$ 6, aggregated over three benchmarks; saturated colors indicate misclassifications.
Robustness to Misleading Tests: ACES-O is substantially less sensitive to misleading-test removal than MV, confirming misleading tests are already heavily down-weighted in ACES's learned weights.
Computational Cost: Both ACES variants scale efficiently: ACES-C requires $m$ 7 operations (negligible vs. code execution); ACES-O converges in $m$ 81s per task in practice for standard problem sizes.

Implications and Future Directions

The ACES framework demonstrates that leave-one-out (internal) consistency is a theoretically sound and practically effective criterion for distilling discriminative signal from a pool of noisy, machine-generated tests, without access to ground-truth labels or expensive external verifier models. This principle should readily transfer to other contexts featuring noisy evaluators, such as crowdsourced labeling or LLM-as-judge ensembles, and suggests new directions for stable online test set curation, active test generation, and LLM evaluation design.

Moreover, the explicit decomposition of test quality into discriminative power offers a foundation for future adaptive test generation, dynamic test selection, and more robust model-based evaluation pipelines, particularly as code LLMs generate increasingly diverse and non-i.i.d. outputs.

Conclusion

ACES (AUC Consistency Scoring) establishes the first provable, execution-only framework for identifying both informative and misleading tests for code generation solely from binary evaluation outcomes. The LOO-AUC identity enables robust, theoretically justified test weighting that translates to strong practical code selection performance, scalable computation, and seamless integration with orthogonal selection signals. The ACES methodology opens a class of principled meta-evaluation approaches for noisy-response verification and selection in program synthesis and beyond.

Markdown Report Issue