TF-Bench: Program Semantics Reasoning
- TF-Bench is a formal benchmark for program semantics reasoning using type inference in System F to isolate deductive logic from natural language cues.
- It employs a three-stage pipeline including task extraction from Haskell’s Prelude, dependency closure, and alpha-rewrite to create a purely deductive setting.
- Empirical results show significant accuracy drops when natural language cues are removed, underscoring the need for enhanced reasoning in LLMs.
TF-Bench is a benchmark framework for evaluating LLMs on program semantics reasoning tasks, specifically via type inference in System F. Motivated by the limitations of existing code benchmarks—which often focus on token-level generation or leverage superficial natural language cues—TF-Bench provides a formal, deductive setting that assesses a model’s capacity to reason about the intrinsic logic and structure of code, independent of cognitive semantics or linguistic artifacts.
1. Formal Motivation and Benchmark Definition
TF-Bench is grounded in the principle that accurate code reasoning requires understanding program semantics rather than mere recognition or association of natural language tokens. The evaluation task centers on type inference in System F, a polymorphic lambda calculus that underpins formal type theories and functional programming languages such as Haskell. By constructing tasks around type deduction, TF-Bench isolates the logical properties of code—such as compositionality, abstraction, and polymorphism—from the semantic noise of naming conventions, comments, and other natural language features. This program-centric deductive framework distinguishes TF-Bench from prior benchmarks that often conflate code understanding with LLMing performance.
2. Benchmark Construction Methodology
The creation of TF-Bench follows a three-stage pipeline:
- Task Extraction: Functions are systematically gathered from Haskell’s Prelude, which offers a rigorously typed corpus based on the Hindley–Milner type system (a practical incarnation of System F). Tasks are selected across categories:
- Monomorphic functions: Single concrete type signature.
- Parametric polymorphism: Type signatures with universally quantified type variables.
- Ad hoc polymorphism: Functions employing type classes or overloading.
- Dependency Closure: Each benchmark item is rendered self-contained by resolving all invoked helper functions’ type signatures. Dependency retrieval leverages automated tools (e.g., Hoogle), ensuring every logical premise required for type inference is explicit and present.
- Alpha-Rewrite to TF-Bench_pure: To completely remove semantically irrelevant natural language from code, three verified rewrite operators are systematically applied. These operators replace natural language content in type names, type variables, and bindings, producing TF-Bench_pure—a variant retaining only semantics-critical structure. The rewriting process is formally justified: operators are commutative and associative under Kleisli composition, guaranteeing semantic preservation and a purely deductive reasoning setting.
3. Evaluation Metrics: Robustness and Reasoning Scores
TF-Bench introduces two novel quantitative metrics tailored to program semantics reasoning:
- Semantic Robustness (RS): This metric captures a model’s dependence on natural language cues. For a given model , robustness score is defined as:
where is the accuracy on the original TF-Bench and is the accuracy on the NL-free TF-Bench_pure. High indicates lower reliance on NL artifacts and better intrinsic semantics reasoning.
- Test-Time Compute Reasoning Effectiveness (RE): To measure the impact of additional test-time reasoning (TTC), the metric compares accuracy improvements due to TTC on both original and pure benchmarks:
This ratio quantifies the proportion of improvement attributable to genuine deductive computation versus spurious NL exploitation.
4. Empirical Analysis and Key Results
State-of-the-art LLMs, such as OpenAI’s O3 and Anthropic’s Claude-3.7-sonnet, achieve strong performance on TF-Bench’s original tasks (accuracies approaching 90%). Notably, performance drops considerably—Claude-3.7-sonnet attains only 55.85% accuracy—once natural language cues are removed (TF-Bench_pure). This empirically substantiates the hypothesis that current models often leverage linguistic artifacts rather than performing robust deductive reasoning.
The robustness score (RS) provides further regularization of this finding; for instance, models with maintain relatively higher semantic reasoning ability in the absence of NL, while others overfit to language patterns. Fine-tuning strategies dramatically influence results: LLMs trained on structured math data show better resilience compared to code-specialized models, which tend to memorize language-based patterns rather than abstract semantics—a plausible implication is that abstract symbolic data improves deductive inference capabilities over language-based data.
5. Theoretical Foundations: Type Inference and Curry–Howard Correspondence
At the heart of TF-Bench is the use of type inference as a proxy for deductive reasoning, grounded in the Curry–Howard Isomorphism. In this paradigm, logical implication corresponds to the function type in Haskell and System F. Benchmark tasks are thus interpreted as constructing formal proofs about programs, with type inference serving as the deduction mechanism. This theoretical linkage ensures that TF-Bench evaluates not superficial code prediction but the rigorous logic that compilers guarantee.
6. Research Implications and Future Directions
The TF-Bench framework elucidates critical limitations in current LLM architectures regarding semantics-driven reasoning. Recommendations for future work include:
- Designing LLMs and fine-tuning protocols that prioritize deduction over pattern recognition, e.g., by augmenting training datasets with formal or mathematical code.
- Extending evaluation to other program analyses demanding soundness, such as formal verification or symbolic execution.
- Refining metrics (RS and RE) to provide finer granularity and diagnostic power for model improvements.
This suggests that advancements in LLM reasoning for code will require attention both to the structure of benchmarks and the nature of model pretraining and fine-tuning corpora.
7. Summary and Significance
TF-Bench establishes a rigorous benchmark for program semantics reasoning via type inference in System F, revealing the intrinsic limitations of contemporary LLMs once superficial language cues are eliminated. By employing verified transformations, self-contained deductive chains, and domain-specific metrics (RS, RE), it offers an authoritative evaluation platform for the logical capabilities of software reasoning models and motivates a renewed focus on fundamental deductive reasoning in future AI systems (He et al., 28 Sep 2025).