Code Logic Bench: Evaluating Program Reasoning
- code-logic-bench is a suite of benchmarks that systematically assesses deep program logic through controlled semantic and structural challenges.
- It employs atomic logic bombs in symbolic execution to test specific reasoning capabilities, revealing performance cliffs in current analysis tools.
- The benchmark also leverages neurosymbolic methods with formal region decomposition to accurately model and verify diverse program behaviors.
code-logic-bench denotes a family of benchmarks—distinct but convergent in their aims—that systematically evaluate program logic capabilities across multiple domains, ranging from symbolic execution, neurosymbolic reasoning, and LLM code understanding, to digital hardware optimization. These benchmarks are unified by their design to probe not only surface syntactic correctness, but also deep semantic, logical, and structural properties of code, via task-specific frameworks that operationalize precise and multifactorial metrics.
1. Conceptual Foundation
The term code-logic-bench originally refers to two high-impact benchmarks, both aimed at measuring fine-grained logical and semantic software understanding. The first instance, introduced in the context of symbolic execution benchmarking, employs the concept of "logic bombs": minimal code fragments each engineered to require an atomic logic reasoning capability from the analysis tool. The second, introduced as part of a neurosymbolic reasoning system (Imandra CodeLogician), targets the middle ground between automated theorem proving and traditional software-engineering tests by requiring LLM-based agents to perform rigorous, region-based semantic reasoning on program state spaces, control flow, and edge conditions—not just generate correct code or pass unit tests, but explicitly enumerate and explain logical program behaviors (Xu et al., 2017, Lin et al., 17 Jan 2026).
In both variants, code-logic-bench eschews evaluation by code-surface criteria in favor of task designs that demand explicit logic extraction, branch/path condition analysis, and comprehensive reasoning—objectives not met by standard code synthesis or bug-fixing benchmarks.
2. Symbolic Execution Logic Bombs: Scope and Methodology
The classic code-logic-bench for symbolic execution is architected around a suite of atomic “logic bombs”, each encapsulating a specific challenge for symbolic analyzers. Each bomb is a code fragment that executes a marked action only if a constraint—modeling one of twelve canonical logic pathologies—is satisfied. The challenges are divided into:
- Symbolic-Reasoning Challenges: e.g., symbolic variable declaration, covert propagation, buffer and arithmetic overflows, symbolic memory/jumps, floating-point reasoning, contextual symbolic values, parallel execution.
- Path-Explosion Challenges: e.g., external function calls, arbitrary loop unrolling, cryptographic functions.
Each bomb is stand-alone, tagged for category, and constructed such that, if an analysis engine (e.g., KLEE, Angr, Triton) can successfully synthesize an input to trigger the bomb, it is certified to handle the underlying reasoning challenge.
The benchmarking workflow is fully automated:
- Preprocessing to produce self-contained tests,
- Compilation to appropriate IR or binary,
- Batch symbolic execution with per-test timeouts,
- Automated verification by replaying found inputs to confirm bomb detonation.
Metrics, such as category coverage, aggregate success rate , and average trigger time , yield a multidimensional performance profile (Xu et al., 2017).
Empirical findings reveal severe capability cliffs for even mature engines. For instance, Angr solves 21/62 bombs (0.339), KLEE 9/62 (0.145), and Triton 3/62 (0.048) at generous timeouts. Pathological cases such as symbolic memory indirection, floating-point analysis, and covert propagation consistently evade most analyzers, highlighting key research gaps.
3. Formal Reasoning Benchmark: Region Decomposition and Semantic Analysis
A distinct but thematically aligned approach to code-logic-bench is exemplified by its deployment in the Imandra CodeLogician system (Lin et al., 17 Jan 2026). This benchmark targets the explicit induction of formal models from source code to answer logic-intensive semantic queries—spanning state-space enumeration, path condition identification, region coverage, and edge-case detection.
Each benchmark task derives from real-world state-machine programs, with ground truth established via symbolic region decomposition:
- Programs are modeled as labeled transition systems .
- Control-flow graphs are constructed, and root-to-leaf paths annotated with conjunctions of branch predicates.
- Region decomposition computes a partition of input space, each described by a path condition and associated output .
Evaluation tasks require the agent to:
- Enumerate the true number of semantic regions (“distinct behaviors”).
- Identify necessary and sufficient input constraints (Boolean ) for behaviors.
- Prove/disprove properties and explicitly characterize edge/counterexample conditions.
Metric family includes state-space estimation accuracy, coverage completeness, outcome precision, direction accuracy (for yes/no queries), control-flow understanding, edge-case detection, and decision-boundary clarity.
Empirically, pure LLM reasoning achieves mean aggregate scores (best), whereas formal augmentation (LLM+CodeLogician) achieves perfect accuracy across metrics, with the largest gaps in state-space estimation and coverage (Lin et al., 17 Jan 2026).
4. Dataset Organization, Metrics, and Operationalization
Symbolic Execution Logic Bombs
The dataset consists of 62 atomic logic bombs across twelve categories, each stored as a minimal C (or C++) file annotated for challenge type and difficulty. Directory structure partitions by challenge; naming schemes encode challenge, subcase, and difficulty level (e.g., symbolic_memory_l2.c). Each input function contains a single if-guard with a path condition engineered to require the associated logic reasoning capability, e.g., buffer overflow, indirect jump, or specific floating-point comparison.
Automated benchmarking scripts instrument the standard symbolic-execution tools (KLEE, Angr, Triton) with tool-specific initialization logic, execute the batch, and postprocess recorded solutions to verify correctness and filter false positives.
Metrics:
- Global success rate: fraction of bombs solved (within timeout).
- Category coverage: fraction per-challenge.
- Average trigger time (on successes).
- Propagation consistency: "should/may" DAG tracks implied challenge dependencies to identify internal method inconsistencies (Xu et al., 2017).
Formal Reasoning Benchmark
Comprised of 50 real-world state-machine models, each forming three tasks (region enumeration, condition extraction, property verification), totaling 150 evaluation items. All ground-truth logical decompositions are obtained via ImandraX’s formal symbolic execution engine and are accompanied by explicit coverage witnesses.
Each LLM (or neurosymbolic agent) response is graded on seven axes, with granular scores reflecting both the total behaviors captured and fine structure (e.g., identification of precise boundaries or subtle edge cases).
5. Comparative Analysis and Key Findings
The code-logic-bench methodology exposes limitations in both symbolic and data-driven program analysis tools. Key empirical findings include:
- Symbolic engines are highly sensitive to minute program structural changes; binary engines (e.g., Angr) gain flexibility on raw memory layout but lose source-level semantic cues, whereas source-based tools (KLEE) exploit rich solver theories but miss dynamic behaviors (e.g., covert propagation, external APIs).
- Timeout scaling yields only modest benefits, indicating that the primary barrier is logical—not resource-based—intractability; successful analysis of deeper logic bombs demands fundamentally more expressive symbolic reasoning or compositional techniques (Xu et al., 2017).
- Neurosymbolic systems (LLM+formal engine) markedly outperform LLMs alone in state-space and edge-case reasoning by operationalizing explicit symbolic region enumeration and automated theorem-proving artifacts (Lin et al., 17 Jan 2026).
- LLMs excel at shallow control-flow reasoning but fail with combinatorial region undercounting, failure to catch subtle equivalence boundary conditions, and incomplete edge-case analysis.
6. Significance, Limitations, and Extension Proposals
code-logic-bench sets a rigorous reference point for evaluating progress in code reasoning and analysis. Its atomic challenge design, with explicit logical focus, avoids confounding effects of larger software systems and clarifies true capabilities and fundamental limitations of tools. A critical limitation is that isolated logic bombs do not capture multi-challenge interactions in real programs; future proposals advocate compounded logic bombs, support for additional platforms (ARM, Windows), expansion to higher-level languages, and integration with dynamic propagation frameworks (Xu et al., 2017). In the neurosymbolic context, benchmarks can further be extended to require automated learning of formal system models or to combine region decomposition with arbitrary first-order temporal queries (Lin et al., 17 Jan 2026).
By capturing explicit metrics and facilitating reproducible, interpretable analyses, code-logic-bench remains a vital standard for tracking and advancing state-of-the-art in code logic reasoning research.