Controllable Code Completion Benchmark

Updated 4 July 2026

The paper introduces C3-Bench, which formalizes controllable code completion by requiring completions to pass unit tests and adhere to specific natural language instructions.
It categorizes tasks into Implementation-Control Completion (ICC) and Scale-Control Completion (SCC), assessing both detailed implementation requirements and completion granularity.
The benchmark employs metrics like Pass@1, Instruction-Following Rate (IF), and Edit Similarity (ES) to provide actionable insights on model performance under explicit behavioral constraints.

Searching arXiv for the benchmark paper and closely related repository-level code completion benchmarks to ground the article in the primary literature. Controllable Code Completion Benchmark (C3-Bench) is an instruction-guided code completion benchmark designed to evaluate whether a model can complete code not only correctly but also in the specific manner requested by a natural-language instruction. It formalizes a shift from conventional completion, which evaluates only functional correctness under code context, to Controllable Code Completion (CCC), where the missing middle code must satisfy both executable requirements and instruction adherence. In its initial release, C3-Bench comprises 2,195 Python instances and is presented as the first benchmark explicitly targeting this capability in infilling-style code completion (Zhang et al., 22 Jan 2026).

1. Conceptual definition and task formulation

The benchmark defines a conventional completion instance as a tuple $(P, S, G, T)$ , where $P$ is the prefix code, $S$ the suffix code, $G$ the ground-truth middle code, and $T$ the unit tests. C3-Bench extends this to $(P, S, G, T, I)$ , adding an instruction $I$ that constrains how the completion should be written. The learning objective is stated as

$M(P_i, S_i, I_i) \rightarrow G_i$

for a model $M$ over a dataset of controllable completion instances (Zhang et al., 22 Jan 2026).

This formulation separates two dimensions that conventional completion benchmarks typically conflate. The first is functional correctness, evaluated by whether the completed program satisfies tests. The second is instruction adherence, evaluated by whether the implementation follows the requested method, structure, or scope. The benchmark is motivated by the observation that many implementations can satisfy the same tests while differing substantially in implementation strategy or completion granularity. This makes correctness alone insufficient when code completion is used in interactive systems such as Copilot Chat or Cursor, where developers often request that the missing code be implemented in a particular way (Zhang et al., 22 Jan 2026).

A central implication is that C3-Bench evaluates infilling under explicit behavioral constraints rather than generic hole filling. This distinguishes it from repository-level executable benchmarks that improve realism through multi-file context and execution, but do not evaluate whether the model obeys user instructions during completion (Yang et al., 2024).

2. Benchmark composition and task taxonomy

C3-Bench contains 2,195 high-quality Python CCC instances. It is divided into two task families: Implementation-Control Completion (ICC) with 1,286 instances, and Scale-Control Completion (SCC) with 909 instances. ICC evaluates whether a model can implement the missing code according to a specified implementation requirement. SCC evaluates whether a model can generate code of the requested scope or granularity; because SCC targets structural scope rather than full functional behavior, it does not use unit tests (Zhang et al., 22 Jan 2026).

Task family	Category	Instances
ICC	Structural Specification Requirements	111
ICC	Algorithmic Implementation Requirements	502
ICC	Control Flow Requirements	547
ICC	Critical Parameter Requirements	126
SCC	Line Span Completion	97
SCC	Multi-line Completion	467
SCC	Statement Block Completion	345

The ICC categories cover four implementation requirement types. Structural Specification Requirements include data structure definitions, composite type design, class or interface structure, and data model design. Algorithmic Implementation Requirements cover specific algorithmic approaches, computational logic, transformation procedures, and optimization strategies. Control Flow Requirements concern execution flow, branch logic, loop structure, and exception handling. Critical Parameter Requirements concern variable definitions, parameter passing, state variable management, and configuration settings (Zhang et al., 22 Jan 2026).

The SCC categories cover three scope-control modes. Line Span Completion targets partial code-line completion. Multi-line Completion requires generation of a specified number of complete lines. Statement Block Completion requires a specific control structure block, explicitly including IF STATEMENT BLOCK, FOR STATEMENT BLOCK, and WHILE STATEMENT BLOCK (Zhang et al., 22 Jan 2026).

Each instance consists of prefix code $P$ , suffix code $P$ 0, ground-truth middle code $P$ 1, instruction $P$ 2, and unit tests $P$ 3 for ICC only. Token statistics reported in the benchmark indicate that the tasks are not limited to trivial one-line infills. For ICC overall, instruction, prefix, middle, and suffix tokens are reported as 4 / 27 / 10, 27 / 1447 / 413, 5 / 709 / 73, and 1 / 1455 / 57 in min/max/mean form; for SCC overall, the corresponding values are 6 / 11 / 8, 133 / 2717 / 655, 2 / 1083 / 66, and 3 / 1825 / 77 (Zhang et al., 22 Jan 2026).

3. Construction methodology

C3-Bench is constructed from HumanEval and SAFIM, which are used as executable source datasets for deriving controllable infilling tasks. The construction process begins with AST-based extraction using tree-sitter-languages to create more substantial middle-code spans than those present in the original datasets. The extraction traverses and manipulates ASTs to mask nodes at multiple levels and extract “logically complete code blocks that maintain semantic coherence.” For 30% of instances, the pipeline additionally masks 3–5 consecutive lines, specifically to support SCC tasks (Zhang et al., 22 Jan 2026).

For ICC, the benchmark then generates functionally equivalent but implementation-diverse versions of the extracted middle code. The models used for this stage are GPT-4o-2024-11-20, Claude3.5-Sonnet-20241022, DeepSeek-V3, and Qwen2.5-Coder-32B-Instruct. Multiple CCC instances may therefore share the same prefix and suffix while differing in both instruction and target implementation. Generated implementations are retained only if they pass all unit tests, and the paper reports that over 50% of ICC cases contain three distinct implementations for the same code context (Zhang et al., 22 Jan 2026).

Further filtering imposes PEP8-consistent readability, a length constraint requiring the middle code to be at most 30% of total context, significant implementation diversity, and algorithmic efficiency. Instruction creation differs by task family: ICC instructions are manually crafted implementation specifications, whereas SCC instructions are generated by Claude3.5-Sonnet. Instruction quality is then validated through expert review by five senior Python developers and automated consistency checking with Claude3.5-Sonnet (Zhang et al., 22 Jan 2026).

The benchmark examples make the intended form of controllability concrete. One example uses the same SPFA code context with alternative instructions requiring either Small Label First (SLF) optimization using a deque or Large Label Last (LLL) optimization using a queue and queue statistics. Another appendix example uses the same labyrinth path-finding context with alternative instructions for iterative DFS with a stack, recursive DFS with parent pointers, or BFS with a queue. An SCC example requires generating only a single for statement block and nothing else. These examples are intended to show that multiple semantically plausible completions may exist for the same code context, while only some satisfy the requested control (Zhang et al., 22 Jan 2026).

4. Evaluation protocol and metrics

C3-Bench evaluates models under two prompting modes: FIM special tokens for models that natively support fill-in-the-middle completion, and a ChatML-style prompt format for other, mostly chat-oriented models. For open-source models, inference uses vLLM, greedy sampling, and an output limit of 1024 tokens. The study evaluates 40+ models, including families such as Qwen2.5-Coder, DeepSeek-Coder, DeepSeek-V3, StarCoder2, CodeLlama, Yi-Coder, Codestral, OpenCoder, GPT-4/GPT-4o, Claude 3.5, Gemini, and the o1 series (Zhang et al., 22 Jan 2026).

Three metrics are used: Pass@1, Instruction-Following Rate (IF), and Edit Similarity (ES). Pass@1 is used for ICC only and measures whether the generated completion passes the associated unit tests. IF is the benchmark’s central new metric. For ICC, IF is evaluated only among functionally correct cases, so a completion must first pass tests and then be judged for compliance with the instruction. For SCC, IF is measured by fully automatic structural checks: AST-based node type matching for statement or block requirements, and length-based verification for line-count requirements. ES is used as a supplementary static-analysis metric (Zhang et al., 22 Jan 2026).

For ICC instruction adherence, the benchmark uses LLM-as-judge semantic validation. The primary judge is Claude3.5-Sonnet and an alternative cheaper judge is Qwen2.5-32B-Instruct. The judgment prompt requests a binary assessment based on instruction adherence and alignment with ground-truth implementation aspects such as function definitions, data structures, algorithm steps, and control flow. The paper reports that the judge achieved 98% agreement with senior Python developers across 10 independent assessment rounds (Zhang et al., 22 Jan 2026).

Methodologically, this evaluation protocol departs from prior completion benchmarks that typically report only exact match, edit similarity, or unit-test pass rate. The key claim is that controllable completion requires an explicit measure of whether the requested implementation or scope was followed. This is also where C3-Bench differs most sharply from adjacent realism-focused benchmarks such as R²C²-Bench, which varies context perturbation and retrieval conditions but does not benchmark user-specified constraints or instruction-following behavior (Deng et al., 2024).

5. Empirical findings and the Qwen2.5-Coder-C3 model

The main empirical finding is that strong performance on conventional completion benchmarks does not imply strong performance on controllable completion. The paper reports that several open-source code LLMs remain competitive or superior on traditional benchmarks such as CrossCodeEval, RepoEval, CrossCodeLongEval, and SAFIM, yet perform poorly on C3-Bench, especially on SCC. For example, Qwen2.5-Coder-32B achieves ICC Pass@1 58.1, ICC IF 38.7, but SCC IF 5.2; DeepSeek-Coder-33B-Base achieves ICC Pass@1 48.1, ICC IF 32.0, and SCC IF 5.2 (Zhang et al., 22 Jan 2026).

The benchmark’s strongest open model is Qwen2.5-Coder-32B-C $P$ 4, which is fine-tuned specifically for controllable completion. Its reported scores are ICC ES 49.3, ICC Pass@1 62.0, ICC IF 52.5, SCC ES 44.2, SCC IF 80.7, and Average IF 66.6. Among the reported proprietary systems, the highest average IF values are lower: o1-2024-12-17 reaches 61.3, Claude3.5-Sonnet-20241022 reaches 55.8, and o1-preview reaches 53.0 (Zhang et al., 22 Jan 2026).

To obtain this model, the authors construct a synthetic supervised fine-tuning pipeline on Python GitHub code from The Stack v2 / StarCoder2 data source. Phase 1 uses Claude3.5-Sonnet to generate 1,000 high-quality instruction-completion pairs as seed examples. Phase 2 uses Qwen2.5-Coder-32B-Instruct with those seeds as few-shot demonstrations to produce a large synthetic corpus. The final training set contains 200,000 synthetic instruction-completion pairs. Training uses 64 NVIDIA A100-80GB GPUs, Adam, learning rate $P$ 5, 50 warmup steps, global batch size 1024, tensor parallel size 2, and 4K token truncation, with 10-gram decontamination against C3-Bench (Zhang et al., 22 Jan 2026).

The gains from this specialization are large. For Qwen2.5-Coder-32B-Instruct versus Qwen2.5-Coder-32B-C $P$ 6, ICC Pass@1 improves from 49.8 to 62.0, ICC IF from 28.8 to 52.5, SCC IF from 16.9 to 80.7, and Average IF from 22.8 to 66.6. For Qwen2.5-Coder-1.5B-Instruct versus Qwen2.5-Coder-1.5B-C $P$ 7, ICC Pass@1 improves from 22.9 to 39.7, ICC IF from 0.7 to 29.6, SCC IF from 8.0 to 66.8, and Average IF from 4.3 to 48.2 (Zhang et al., 22 Jan 2026).

The paper also reports measurable trade-offs on conventional benchmarks. For Qwen2.5-Coder-32B-C $P$ 8 relative to the base Qwen2.5-Coder-32B, average EM/ES on CrossCodeEval changes from 57.1 / 86.8 to 55.9 / 86.2; on RepoEval, average EM changes from 51.6 to 51.8 while ES drops from 78.5 to 77.0; on CrossCodeLongEval, average EM/ES drops from 36.9 / 66.4 to 29.1 / 60.5; and on SAFIM, average pass drops from 71.2 to 67.5. This suggests that training for controllable completion is not equivalent to optimizing general completion performance, and may redistribute capability across evaluation regimes (Zhang et al., 22 Jan 2026).

6. Position within the benchmark landscape and stated limitations

C3-Bench occupies a distinct position relative to neighboring completion benchmarks. Conventional completion benchmarks such as HumanEval, CrossCodeEval, RepoEval, and SAFIM primarily evaluate whether code is functionally correct given context. Repository-level benchmarks such as ExecRepoBench strengthen realism by using multi-file context, AST-aligned masking granularity, and repository-level executable checking, but they do not evaluate whether the model can follow explicit user instructions during completion (Yang et al., 2024). Developer-centric benchmarks such as Codev-Bench move toward realistic cursor-centric and boundary-sensitive completion behavior, yet still operationalize intent largely through scenario structure rather than explicit instruction-conditioned control (Pan et al., 2024). C3-Bench’s distinctive contribution is therefore not repository realism alone, but the addition of explicit instruction adherence as a first-class evaluation target (Zhang et al., 22 Jan 2026).

The paper states several limitations. First, C3-Bench is currently Python-only. Second, it is presently in-file rather than repository-level. Third, broader context scenarios remain future work, including multi-language and repository-level controllable completion. The paper also notes that ICC performance remains constrained by the base model’s coding ability, so instruction tuning cannot fully compensate for weak functional competence (Zhang et al., 22 Jan 2026).

Other limitations are partly explicit and partly methodological. ICC instruction adherence depends on an LLM judge, even though the reported agreement with senior Python developers is high. SCC is evaluated through structural checks rather than semantic tests, so it measures granularity compliance rather than executable correctness. Benchmark construction also relies partly on synthesized implementations and instructions, though these are filtered by unit tests, expert review, and automated consistency checks (Zhang et al., 22 Jan 2026).

In benchmark-design terms, C3-Bench proposes an evaluation philosophy rather than only a dataset. The benchmark argues that future code-completion evaluation should measure at least functional correctness, instruction adherence, and structural or granularity control when appropriate. A plausible implication is that controllable completion benchmarking will eventually need to combine C3-Bench’s explicit instruction layer with the repository-level execution realism developed in adjacent work, yielding benchmarks that test whether a model can complete code correctly, in context, and under specified developer controls (Zhang et al., 22 Jan 2026).