Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

SX-Bench: Code Intelligence Evaluation Benchmark

Updated 9 August 2025
  • SX-Bench is a code intelligence benchmark designed to evaluate LLMs’ multi-function comprehension and dynamic execution reasoning.
  • It composes atomic functions into composite programs using sequential, conditional, and loop-based paradigms to simulate realistic code behavior.
  • The benchmark employs automated generation, symbolic execution, and LLM-aided validation to rigorously assess complex control and data flow challenges.

STEPWISE-CODEX-Bench (SX-Bench) is a code intelligence benchmark purpose-built to evaluate LLMs’ (LLMs) capabilities in complex multi-function comprehension and fine-grained execution reasoning. Unlike prior benchmarks focused primarily on single-function correctness via input/output (I/O) matching, SX-Bench shifts the evaluation paradigm toward detailed understanding of composite program behavior, control/data flow intricacies, and the simulation of dynamic execution—essential for assessing next-generation code reasoning systems.

1. Motivation and Conceptual Framework

SX-Bench addresses the limitations of mainstream code generation benchmarks such as HumanEval and MBPP, where advanced LLMs now attain near-saturated scores (>95%), offering diminishing power for discriminating model reasoning depth. Benchmarks like CRUXEVAL, while oriented toward reasoning, are confined to single low-complexity functions. SX-Bench was designed to fill this gap by testing models on tasks with multiple sub-functions, comprehensive control and data flow dependencies, and explicit requirements to simulate execution steps. The explicit definition of "computation steps" as minimal execution units and the requirement that models predict total step counts, rather than mere program output, raises the standard for demonstrating genuine comprehension of code dynamics (Yan et al., 7 Aug 2025).

2. Task Design and Composition Paradigms

SX-Bench tasks are generated by composing atomic functions—selected from operations on integers, logic, and strings—into higher-order composite programs that reflect realistic computational scenarios. Three principal paradigms govern composition:

  • Sequential composition: Functions are chained in application (fghf \circ g \circ h), requiring models to simulate intermediate state transitions.
  • Selective (conditional) execution: Control flow branches (e.g., f(x)f(x) if P(x)P(x) else g(x)g(x)), demanding path-sensitive reasoning.
  • Loop-based composition: Iterative application over collections (e.g., for xx in XX execute f(x)f(x)), increasing both control-flow and data-flow complexity.

Each composite program integrates a global counter (commonly named run_steps) that is incremented atomically for every core computation event (arithmetic operation, conditional, or loop iteration). This mechanism translates dynamic execution into a quantitative measure: for a sequence of atomic operations {Opi}\{\text{Op}_i\}, the total number of computation steps is

Total run_steps=iOpi\text{Total run\_steps} = \sum_{i} \text{Op}_i

forcing models to simulate execution rather than rely on static I/O heuristics.

3. Evaluation Methodology and Criteria

SX-Bench stratifies its evaluation across three canonical subsets:

  • Predict: Assessing whether models can determine the validity of input–output pairs for a composite function.
  • Easy-Reasoning: Models predict computation step counts for functions with simpler constructs (single branches/loops).
  • Hard-Reasoning: Encompasses highly nested loops, complex conditionals, and deeply layered function collaborations; here, step counts can grow substantially (up to fourfold vs. Easy).

Crucially, evaluation focuses not just on final correctness but on the model’s trace of execution: can it accurately reconstruct the cumulative computation path, as recorded in run_steps? Hard-Reasoning tasks, with complex control/data dependencies, are especially revealing of a model’s limits in long-range logical consistency and dynamic state tracking.

4. Automated Benchmark Generation and Quality Control

SX-Bench relies on an integrated automated pipeline to ensure scale, diversity, and high verification standards:

  • Program synthesis: A library of atomic operations is constructed and systematically composed into program templates based on the three paradigms.
  • Symbolic execution: Composite functions are analyzed symbolically to establish all valid execution paths and embed correct step-count annotations.
  • LLM-aided validation: Large-LLMs generate diverse test cases, and further validation removes ill-posed samples (e.g., those with infinite loops or non-terminating execution, using thresholds such as 3 seconds wall-clock time).
  • Filtering: Sandbox execution verifies correctness across nontrivial random inputs; output overflow checks (e.g., 64-bit limits) and the injection of corrupted I/O pairs (≈50%) ensure effective balanced classification and robustness.

Only those samples that meet stringent correctness and stability criteria are retained, guaranteeing that SX-Bench tasks represent meaningful, executable, and non-trivial reasoning challenges.

5. Performance Analysis and Model Discrimination

SX-Bench achieves high discriminatory power across contemporary LLMs. In a comparative evaluation of over 20 models—including both non-reasoning baselines and 14 models explicitly fine-tuned for reasoning—the SOTA model (openai-o3) achieved 86.92% global accuracy. On the Hard-Reasoning subset, accuracy fell to 78.37%, exposing substantial room for improvement in detailed program comprehension and long-step logical tracking. Models manifest notable performance degradation as step count and control-flow complexity increase. Empirical analysis reveals (Loop > Selective > Sequential) as a hierarchy of challenge: loop constructs, especially those with nested or data-dependent structure, are particularly error-prone for current systems.

6. Methodological Influences and Comparative Benchmarks

SX-Bench incorporates and extends methodological innovations from recent benchmark efforts. For instance, frameworks such as CodeBenchGen (Xie et al., 31 Mar 2024), which automate dataset creation through LLM-driven sandboxing, test synthesis, and iterative debugging, inform scalable curation of complex code tasks and diversity coverage. Execution-based evaluation—actual code running against tests—is central in both SX-Bench and CodeBenchGen, emphasizing validation of true functional correctness. Insights from VCR-Bench (Qi et al., 10 Apr 2025), which decomposes reasoning into stepwise Chain-of-Thought (CoT) elements and quantifies reasoning quality through fine-grained metrics (precision, recall, and F1/CoT score), suggest potential extensions for SX-Bench: namely, explicit annotation or tagging of each reasoning step (input interpretation, algorithmic step, error checking) to further diagnose model weaknesses. The current focus of SX-Bench is on step-count prediction and execution simulation, but extensions may include integrating stepwise rationale matching as in VCR-Bench.

<table> <thead> <tr> <th>Benchmark</th> <th>Key Focus</th> <th>Distinctive SX-Bench Features</th> </tr> </thead> <tbody> <tr> <td>HumanEval, MBPP</td> <td>Single-function I/O correctness</td> <td>Multi-function dynamic reasoning, stepwise tracing</td> </tr> <tr> <td>CRUXEVAL</td> <td>Single-function logical reasoning (limited complexity)</td> <td>Large, composite, multi-paradigm code evaluation</td> </tr> <tr> <td>VCR-Bench</td> <td>Video stepwise CoT, recall/precision breakdowns</td> <td>Potential for stepwise annotation of code reasoning</td> </tr> </tbody> </table>

7. Implications and Prospects for Code Intelligence Research

SX-Bench establishes a new evaluation axis for LLMs in code, shifting the field beyond static output matching to in-depth simulation and understanding of dynamic control/data flows. This exposes weaknesses not previously measurable and provides a rigorous foundation for model development focused on execution fidelity, structured code representations (e.g., ASTs, control/data-flow graphs), and reasoning strategies tailored to long dependency chains. Future research may adopt approaches such as "structure recognition – block reasoning" to reduce error accumulation in nested constructs. A plausible implication is that high discriminative power and detailed error taxonomy in SX-Bench will accelerate advances in code intelligence architectures and training protocols addressing executional and contextual code reasoning.

SX-Bench thereby acts as a foundational benchmark for evaluating, comparing, and driving improvement in next-generation code reasoning systems, enabling progress beyond what was possible with previous single-function or I/O-centric evaluations (Yan et al., 7 Aug 2025).