CoderEval: Context-Aware Code Generation Benchmark

Updated 6 September 2025

CoderEval is a benchmark for evaluating generative models on synthesizing real-world, context-dependent code from large-scale projects.
It features a multi-level taxonomy of context dependencies and employs an automated, project-level testing platform for both Python and Java tasks.
Empirical findings highlight significant performance drops as context complexity increases, emphasizing the need for context-integrated model design.

CoderEval is a code generation benchmark specifically constructed to evaluate the capability of generative pre-trained models in synthesizing pragmatic, context-dependent code as it appears in real-world projects. Distinct from benchmarks that focus on standalone functions, CoderEval is designed to capture the challenges intrinsic to practical software development, where code frequently relies on external types, APIs, variables, and project-scale dependencies. It features a rigorously curated set of 460 tasks (230 Python, 230 Java) drawn from genuine open-source repositories and introduces a novel execution-and-context-aware evaluation framework. This benchmark is central in the landscape of code evaluation due to its systematic treatment of contextual dependency and rigorous, project-level automated testing.

1. Motivation and Benchmark Scope

CoderEval addresses the mismatch between prior code generation benchmarks and the actual distribution of function dependencies in real-world codebases. Earlier evaluations—exemplified by HumanEval—test models only on standalone functions restricted to built-ins and standard library usage. However, empirical analysis shows that over 70% of functions in real projects have dependencies on external libraries, local class members, or cross-file/project constructs. Thus, performance on standalone benchmarks overestimates a model’s practical competence.

CoderEval’s 460 tasks are sourced directly from popular open-source projects in both Python and Java, ensuring domain diversity and authenticity. In addition to function signatures and ground truth implementations, each problem includes two versions of the task description—(1) the original project docstring, and (2) a human-labeled docstring—for controlled prompt variation and to assess the effects of data leakage.

2. Granular Context Dependency Taxonomy

A core innovation in CoderEval is its multi-level taxonomy of context dependencies, whereby each generation task is annotated according to the external elements required for correct synthesis and execution. The taxonomy is structured into six mutually exclusive levels:

Level	Description	Example Context Required
self-contained	Uses only built-ins	None
slib-runnable	Needs standard libraries	math, os modules
plib-runnable	Needs public third-party libraries	numpy, requests, guava
class-runnable	Needs class-level context	instance variables/methods
file-runnable	Needs file-local context	helpers in same file
project-runnable	Needs cross-file/project context	utilities in other files

Any function is uniquely classified into one level based on which external elements it references. The first two levels correspond to typical standalone benchmarks; the latter four reflect realistic, context-heavy scenarios. Consequently, CoderEval systematically targets the underexplored challenges posed by class, file, and project-scoped dependencies.

3. Evaluation Methodology and Platform

CoderEval goes beyond static comparison or lexical similarity. Each generated candidate solution is injected into a full project replica—cloned, set up, and instrumented in a Dockerized environment using pyenv/venv for Python or canonical Java build commands. The original target function is replaced by the model’s output. The evaluation system then executes all relevant test cases, which are automatically collated based on a function call graph analysis and converted into a unified testing schema (“NewTestFile”).

Evaluation utilizes two central metrics:

Pass@K: For $K$ generated samples per task (with typical $n=10$ ), Pass@K is the proportion of problems for which at least one sample passes all automated test cases. The unbiased estimator follows the formula (for $k=1$ ):

$\text{Pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

where $n$ is the number of samples and $c$ is the number of passing samples.

Acc@K: Assesses if the required “oracle_context” tokens (i.e., context-specific types, APIs, or variable/constant names) are correctly included in the generated code among the top $K$ samples. This metric is designed to evaluate not just functional correctness, but whether the model has successfully utilized all context dependencies.

In both Python and Java, model outputs are comprehensively validated using these rigorous, side-effect-limiting, project-level test executions.

4. Empirical Findings and Analysis

The benchmark paper reports extensive experimentation with three advanced systems: CodeGen (both mono- and multi-lingual), PanGu-Coder (language-specific), and ChatGPT (gpt-3.5-turbo). Key observations include:

All models perform substantially better on standalone (self-contained, slib-runnable) problems than on context-heavy (plib-, class-, file-, project-runnable) ones.
The performance gap is particularly pronounced given that non-standalone functions constitute the majority in production repositories.
ChatGPT outperforms both CodeGen and PanGu-Coder across both HumanEval and CoderEval, but its performance also declines sharply with increased context demands.

Token-level analysis divides context references into three types: - TypeReference (external types/classes) - APIInvocation (API or method calls) - VarReference (non-local variable usage)

In Python, models are stronger at generating correct TypeReferences; in Java, they excel at APIInvocation. However, consistent use of all required context remains a challenge. Prompt format sensitivity also differs: monolingual models are less affected by docstring rephrasing than multilingual ones.

5. Implications for Model Design and Evaluation

CoderEval’s systematic taxonomy and dual-metric approach expose the need to explicitly incorporate context retrieval and integration in model design and prompt engineering. The shortcomings highlighted by Acc@K—where models often omit critical context elements—underscore the limitations of current pretrained transformers when exposed to realistic development scenarios.

Notable forward-looking strategies include:

Engineering prompts to include all accessible context (“all_context”) as explicit input, potentially improving both Pass@K and Acc@K.
Hybrid modeling or ensemble strategies: analysis indicates that different models exhibit complementary strengths, opening the possibility for ensemble approaches.
Innovation in training objectives: for instance, including a “context penalty” term such as

$L_{\text{context}} = 1 - \frac{|\mathcal{C} \cap S|}{|\mathcal{C}|}$

where $\mathcal{C}$ is the set of required context tokens and $S$ the generated token stream—to directly train for correct context utilization.

Extension of the corpus beyond Python and Java and iterative refinement of evaluation methodologies.

6. Context within the Benchmarking Ecosystem

CoderEval occupies a transitional position between purely standalone, toy benchmarks and more recent repository- or project-scale evaluations. It bridges the gap by systematically stratifying tasks according to context demand and by providing a robust, automated project-level execution platform. This design aligns closely with current criticisms in the field regarding overestimation of model capability on narrowly defined, context-free benchmarks and provides a template for the next generation of context-aware, execution-driven evaluation frameworks.

Complex benchmarks such as ComplexCodeEval and EvoCodeBench have since expanded on these foundations, with further task variety and explicit measures for avoiding data leakage (Feng et al., 16 Sep 2024, Li et al., 30 Oct 2024). Nevertheless, CoderEval remains a foundational reference for studies of context-intensive code generation, particularly due to its principled context taxonomy and dual-metric evaluation design.

7. Directions for Future Research

Ongoing challenges identified by CoderEval inform several research directions:

Developing decentralized or retrieval-augmented modeling architectures to handle extensive, multi-level context input.
Advancing automatic context summarization so that prompts remain tractable even as project scale increases.
Refining metrics beyond Pass@K—potentially adopting ideas from pass depth, structural correctness, and long-range dependency tracking.
Regularly updating evaluation datasets to minimize contamination and follow evolving code practices.

By isolating the impact of various context levels, CoderEval provides empirical grounding for these research threads and establishes clear benchmarks for future progress in pragmatic code generation.