Evaluation Completion Rate (ECR@1)

Updated 8 December 2025

ECR@1 is a reliability metric defined as the proportion of first-attempt completions that pass rigorous execution or schema tests.
It applies to both repository-level code completion and LLM-based evaluation agents, ensuring practical model dependability.
ECR@1 addresses the limitations of static matching by quantifying operational correctness through execution and test-based validations.

Evaluation Completion Rate at Top-1 (ECR@1) is a reliability and functional correctness metric for model outputs, measuring the fraction of first-attempt completions that satisfy a benchmark’s strong, scenario-specific validity criterion. ECR@1 has become especially prominent in two distinct application domains: executable code completion in repository-level benchmarks, and structured output validation for automated evaluation agents. While its core mathematical form is consistent—a normalized count of “successful” top-1 responses—the operationalization of “success” varies by context, capturing, for example, program pass rates under rigorous unit tests or syntactic/schema-conformance for evaluation agents. The metric directly addresses the gap inherent in purely static string-matching protocols by quantifying the operational dependability of model outputs under practical constraints.

1. Formal Definition and Variants

ECR@1 is formally defined as the proportion of test cases for which a model’s first-choice output satisfies all required postconditions. In repository-level code completion, this entails compilation, integration, and passage of all unit tests; in LLM-based evaluation agents, it reduces to the generation of valid, schema-conformant outputs on the first try. The general form is: $\mathrm{ECR@1} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\text{top-1 output for } d_i \text{ is valid})$ where $N$ is the total number of evaluation problems and $\mathbf{1}$ is the indicator function. In (Huang et al., 1 Dec 2025), the metric is: $\mathrm{ECR@1} = \frac{\#\{\text{evaluations with valid JSON on first API call}\}}{N} \times 100\%$ while in repository-level code completion (Yang et al., 16 Dec 2024), the success condition is:

parses/compiles,
integrates into the benchmarked repository,
passes all associated unit tests.

2. Execution-based Measurement Protocols

In repository-level code completion benchmarks such as ExecRepoBench (Yang et al., 16 Dec 2024), ECR@1 is grounded in an execution-first philosophy. Each sample comprises a masked snippet (function, statement, or expression), context files from the same repository, and a suite of pytest-based unit tests specific to the masked region. For each sample:

The top-1 model completion is inserted into the masked region with all context files assembled.
The modified repository is run in a clean execution environment, typically invoking pytest under a strict, per-sample time limit.
A “pass” is reported only if the completion compiles, executes without import/syntax errors, and all unit tests pass without uncaught exceptions. This protocol enforces a strict operational contract for “success,” ensuring real-world viability of model completions.

Conversely, for schema-conformant automated evaluation (e.g., LLM-as-a-Judge (Huang et al., 1 Dec 2025)), the process focuses on first-pass output parsing:

Given $N$ evaluation prompts, each LLM response is checked for syntactic JSON validity and presence of all mandatory fields.
Any failure to produce a parseable, schema-valid response necessitates a retry.
ECR@1 reports the fraction of samples that succeed at the first attempt, reflecting practical reliability and cost efficiency.

3. Representative Datasets and Problem Settings

ExecRepoBench (Yang et al., 16 Dec 2024) is a repository-level code completion benchmark with 1,200 completion problems from 50 actively maintained open-source Python repositories. Problems are stratified by masking strategy:

Random spans (single/multiline)
AST-based: expressions (407), statements (266), functions (377)

Each evaluation problem is equipped with hand-engineered pytest suites targeting the completion site and allows up to ~32,000 code-token context for realistic dependencies. In addition to ExecRepoBench, ECR@1 is computed on the MultiPL-E multilingual code generation suite, using similar execution-based test harnesses and eight supported languages.

In structured evaluation agent benchmarks (Huang et al., 1 Dec 2025), the data comprises expert-annotated test scripts and prompts requiring LLMs to output valid structured data per a strict schema. Evaluations are repeated across multiple model families (GPT-4, GPT-5, open-weight GPT-OSS) and reasoning-effort settings to assess robustness and cost.

4. Empirical Results and Model Comparisons

Table 1: Example ECR@1 Results for Selected Models

Model	ECR@1: ExecRepoBench (%)	ECR@1: MultiPL-E (%)	ECR@1: LAJ (Gherkin bench) (%)
Qwen2.5-Coder-Instruct-C (7B)	44.2	76.4	—
StarCoder (7B)	33.8	—	—
GPT-4o Mini	—	—	96.6
GPT-4o	—	—	100.0
GPT-OSS 20B (high reasoning)	—	—	85.4

Qwen2.5-Coder-Instruct-C, tuned with multi-level AST masking and instruction data, achieves substantial ECR@1 improvements over its base model and prior baselines, more than doubling the repository-level pass rate (44.2% vs. 19.8%). On the LAJ benchmark (Huang et al., 1 Dec 2025), system reliability varies widely: the best open-weight model at high reasoning achieves only 85.4% ECR@1, while GPT-4 and GPT-5 configurations with low-to-moderate reasoning reliably exceed 96%, up to 100%. This diversity illustrates the direct operational impact of model and prompt configuration on real-world completion reliability.

5. Contrasts With Static String Metrics

ECR@1 is categorically distinct from static metrics such as Exact Match, Edit Similarity, or CodeBLEU (Liu et al., 28 Oct 2024, Yang et al., 16 Dec 2024). Static string metrics evaluate surface-level token overlap or structural similarity but cannot verify that output is executable, semantically correct, or robust to unseen test cases. ECR@1, by contrast, directly assesses “functional correctness” under authentic deployment constraints. For example, Granite-Coder-8B attains only 2% Edit Similarity (ES) but 29% ECR@1 on ExecRepoBench—demonstrating that outputs can be structurally divergent from reference solutions yet correct and executable. A plausible implication is that overreliance on surface metrics risks misestimating model deployment readiness.

6. Operational Trade-offs and Practical Significance

High ECR@1 guarantees predictable cost, latency, and deployment stability. In LLM-based agent settings, retries inflate both wall-clock time and cost per evaluation; for example, an 85.4% ECR@1 inflates adjusted cost by 18.1% relative to nominal pricing (Huang et al., 1 Dec 2025). Perfect ECR@1 indicates no retries or operational penalty, while lower values signal “non-deterministic” deployment with greater headroom for failure and cost overruns. Further, model size and reasoning configuration exert complex, family-dependent effects: increasing reasoning effort improves accuracy in some families (GPT-5) but reduces both accuracy and ECR@1 in others (open-weight models), underlining the need for targeted benchmarking.

7. Concrete Computation Examples

ECR@1 computation in practice:

For code completion: the model’s top-1 function body is inserted into the corresponding masked site, and the repository test suite is invoked; if all tests pass, that sample is scored 1, otherwise 0.
For LLM evaluation agents: for each test script, if the first API call yields a valid, schema-confirming JSON object, mark as a success (1); else, discard and retry until a valid response is produced (ECR@1 only counts the first attempt).

Concrete sample scenarios clarify the metric’s granularity—success hinges not merely on plausible outputs but on strict adherence to the operational “contract” (compilation success, test passage, schema validation) in its intended use case (Yang et al., 16 Dec 2024, Huang et al., 1 Dec 2025).