CRUXEval: Code Reasoning Benchmark
- CRUXEval is a benchmark that evaluates fine-grained code execution reasoning in LLMs by requiring simulation of both forward output prediction and inverse input inference.
- It employs a formal evaluation protocol using deterministic Python functions with accuracy metrics and pass@k to assess model performance.
- The benchmark drives advances in process supervision, multilingual evaluation, and diagnostic tools to enhance transparent code simulation and debugging.
CRUXEval is a code reasoning benchmark and evaluation protocol designed to assess the fine-grained code execution and reasoning capabilities of LLMs. Its primary focus is to move beyond superficial code generation and probe models for their ability to mentally simulate program execution, covering both forward (“what does this function return?”) and inverse (“what input yields this output?”) reasoning. By foregrounding execution correctness over natural-language specification or partial trace prediction, CRUXEval serves as a challenging testbed for models and a foundation for program analysis frameworks.
1. Formal Evaluation Protocol and Task Structure
CRUXEval consists of a dataset where each is a Python function, is an input tuple, and is its exact output. It defines two execution-reasoning tasks:
- Output Prediction (OP): Given and , predict the exact output .
- Input Prediction (IP): Given and , predict an such that 0.
Accuracy is evaluated via exact match after canonicalizing Python literals, ignoring trivial formatting variations: 1
2
Overall accuracy is: 3 CRUXEval uses accuracy as the primary metric, since each instance has a unique ground-truth answer. For generative models, pass@k is also reported, especially for stochastic sampling (Gu et al., 2024, Liu et al., 30 Jan 2025).
2. Dataset Composition, Complexity, and Coverage
The original CRUXEval benchmark comprises 800 deterministic Python functions, each paired with one designated test input–output pair. Function features include:
- Lines of code: 5–20 LOC (mean ≈10).
- Cyclomatic complexity (CC): 4 (mean ≈3), measured as 5 by control-flow graph structure.
- Construct categories (post-hoc grouping):
- Basic arithmetic (e.g.,
return 3*x+5) - Conditionals (
if/else) - Loops (
fororwhile) - Recursion (tail recursion)
- Nested constructs (e.g.,
forwithinif)
- Basic arithmetic (e.g.,
Example code snippets: 1 Dataset construction combined LLM-driven synthesis, coverage bootstrapping, and manual curation to ensure diversity in control-flow, data types, and function complexity. All functions are deterministic, side-effect free, and easy to evaluate (Gu et al., 2024, Liu et al., 30 Jan 2025).
3. Design Objectives and Rationale
CRUXEval was devised to systematically test whether LLMs can “simulate” program execution, rather than rely on memorization or surface-level matching:
- Forward and Inverse Reasoning: Output prediction probes symbolic (mental) execution. Input prediction requires models to “invert” a program (solve for 6).
- Focus on Code Reasoning: By abstracting away from natural-language specification or code synthesis, CRUXEval isolates the execution reasoning sub-problem, in contrast to HumanEval and similar tasks.
- Coverage of Constructs: An LLM-driven function generator ensured broad coverage of Python’s basic arithmetic, branching, looping, and nesting, while avoiding degenerate or non-terminating cases.
- Generalization: CRUXEval emphasizes behaviors and error patterns not captured in prior I/O-only or trace-comparison benchmarks, facilitating fine-grained diagnosis and curriculum design (Gu et al., 2024, Liu et al., 30 Jan 2025).
CRUXEval-X (Xu et al., 2024) generalizes this paradigm to 19 languages and introduces a fully automated transition and repair pipeline, enabling robust multilingual evaluation of reasoning, input/output inference, and code generation.
4. Empirical Results, Model Performance, and Failure Modes
Across multiple evaluations, CRUXEval reveals a substantial performance gap between leading closed-source and open-source models:
| Model | Output Acc (%) |
|---|---|
| CodeLlama-Instruct-34B | 41.5 |
| DeepSeekCoder-33B | 58.9 |
| SemCoder-S-6.7B | 52.0 |
| StarCoder-15B | 53.0 |
| Gemini-1.5-Pro | 74.4 |
| GPT-4-Turbo | 86.6 |
- Trends: Instruction-tuned, closed models (e.g., GPT-4, Gemini) outperform open-source by 20–45 percentage points.
- Complexity Sensitivity: Accuracy degrades strongly with cyclomatic complexity (7, 8).
- Common Failures: Nested loops and deep recursion defeat most models. Off-by-one and edge-case errors are frequent on branches.
- Input Inversion: Even state-of-the-art models rarely exceed 40% on input prediction.
- Fine-tuning and CoT: Trace-based supervision and chain-of-thought (CoT) steps (especially when grounded in actual execution traces) yield up to 30-point gains in output prediction (Thakur et al., 28 Nov 2025). However, despite process supervision (e.g., on execution traces or scratchpads), the benchmark remains unsolved, indicating LLMs’ persistent weaknesses in transparent program simulation.
5. Extensions and Analytical Tools
ExeRScope
ExeRScope (Liu et al., 30 Jan 2025) enables in-depth analysis of CRUXEval outcomes:
- Construct Tagging: Assigns each function static/dynamic tags (Basic, If, For, While, Nested Loops, etc.), aggregates per-construct accuracy.
- Complexity Measures: Reports CC, loop depth, recursion depth, lines of code; establishes correlations (e.g., average accuracy on nested loops is 15 points lower than on flat functions).
- Type Sensitivity: Differentiates between “Type-Match” and “Value-Match,” revealing that list-valued outputs yield lower accuracy than simple integers.
- Visualization: Offers statistical plots, significance testing (e.g., 9) to rapidly test hypotheses (e.g., does nesting more than one
ifclause reduce accuracy by more than 10 percentage points? Yes, at 95% confidence). - Generalization Analysis: Enables extrapolation of observed weaknesses to unseen code, facilitating targeted model improvement strategies.
Multilingual and Cross-Task Extensions
CRUXEval-X (Xu et al., 2024) builds on CRUXEval to provide a fully automatic, test-guided, and multilingual variant encompassing 19 programming languages. The pipeline consists of signature and test translation, type-mapping (including dynamic-to-static conversions), iterative LLM-based code generation and repair, and cross-validation. Each language’s subject pool is filtered for canonical test success, culminating in a 19,000-task, 19-language, three-task-per-problem (generation, input reasoning, output reasoning) suite.
6. Role in LLM Training, Diagnosis, and Future Directions
- Benchmark for Model Diagnosis: CRUXEval isolates execution reasoning as a core model capability orthogonal to code generation. It exposes qualitative differences masked by traditional benchmarks, revealing where process supervision and data augmentation improve “understanding” vs. memorization.
- Training with Trace and CoT: Verified CoT rationales grounded in actual execution traces—using pipelines that instrument code and narrate line-by-line behavior—substantially reduce model hallucination and improve both accuracy and rationale consistency (Thakur et al., 28 Nov 2025). Bi-directional (forward+inverse) training amplifies these gains.
- Neural Debugging and Simulation: The benchmark underlies recent work on neural debuggers: LLMs trained to simulate step-into/step-over/step-out debugging sessions, supporting conditional, partial, and jump execution, as well as input inference (inverse execution) (Beck et al., 10 Mar 2026).
- Scaling and Multilingualism: CRUXEval-X demonstrates persistent gaps in cross-language generalization. Even models trained only on Python achieve nontrivial reasoning scores on other languages, but performance is limited (e.g., 0 pass@1 at best) (Xu et al., 2024).
- Unresolved Challenges: Accurate input inversion, robustness to increased control-flow complexity, and avoidance of subtle off-by-one or data structure corner-case errors remain open problems. No currently available LLM approaches 100% on CRUXEval, and even the best CoT-augmented GPT-4 variants plateau well below this ceiling (Gu et al., 2024, Liu et al., 30 Jan 2025).
7. Broader Significance and Ongoing Evolution
CRUXEval functions as a diagnostic instrument for “mental execution” in code LLMs, complementing established I/O or code generation benchmarks. It has catalyzed the development of process supervision techniques (execution traces, dynamic scratchpads, CoT grounded in real execution), advanced program analysis tooling (ExeRScope), and informed the multilingual benchmarking landscape (CRUXEval-X). Ongoing work proposes further directions, such as richer, multi-reference evaluation for inverse tasks, tracing with bytecode-level granularity, curriculum learning over structural program properties, and expansion to broader code domains and languages (Gu et al., 2024, Thakur et al., 28 Nov 2025, Armengol-Estapé et al., 10 Feb 2025, Beck et al., 10 Mar 2026).
CRUXEval and its extensions thus represent the state of the art for direct, discriminative assessment of code execution reasoning—a prerequisite for robust, interpretable, and agentic code LLMs.