TF-Bench_pure: A Benchmark for Deductive Type Inference

Updated 5 October 2025

TF-Bench_pure is a specialized benchmark evaluating LLMs' deductive type inference skills, isolating pure program semantics from natural language cues.
It employs verified rewrite operators such as type canonicalization and normalization to remove extraneous natural language, ensuring purely formal evaluations.
Empirical results show significant performance drops in LLMs, highlighting the limitations in current models' semantic reasoning and the need for improved deductive methods.

TF-Bench_pure is a specialized benchmark designed to evaluate the ability of LLMs to reason about program semantics through type inference in System F, with all semantically irrelevant natural language cues removed. Unlike traditional code reasoning benchmarks, TF-Bench_pure constrains the evaluation to deductive, formal semantics, aiming to distinguish genuine logic-based reasoning from superficial pattern recognition driven by natural language elements embedded in code. Its construction, methodology, and metrics address a critical gap in the assessment of LLMs’ program understanding, revealing substantive limitations in current state-of-the-art systems and providing an empirical foundation for future research in software intelligence.

1. Conceptual Basis

TF-Bench_pure is derived from TF-Bench, which uses function-level type inference tasks sourced from Haskell's Prelude—a library notable for its formal type system rooted in System F and Hindley–Milner theory. The benchmark’s primary objective is to assess program semantics reasoning by presenting LLMs with tasks that require inferring type signatures, relying solely on program logic rather than natural language associations. Notably, the benchmark enforces a purely deductive framework: the ground-truth ("oracle") type signatures are computed and verified in a formal system, attesting to the integrity of evaluation.

This design addresses a core deficiency in earlier benchmarks, whereby model performance can be inflated by exploiting the natural language properties of code (e.g., function names, variable names, comments), which are superficial and extraneous to semantic reasoning. TF-Bench_pure thus establishes a standard for evaluating LLMs on their ability to reason about code structure in a linguistically neutral setting (He et al., 28 Sep 2025).

2. Task Construction and Verified Transformations

The benchmark is constructed by systematically eliminating semantically irrelevant natural language from the type inference tasks. The transformation process involves three verified rewrite operators:

Type Name Canonicalization: Natural language type names (e.g., Int, Bool, type-class identifiers such as Eq) are replaced by canonical forms (e.g., T1, T2, ...).
Variable Normalization: Type variables, usually denoted by conventional forms (a, b, c), are normalized to t1, t2, ..., ensuring uniformity without semantic leakage.
Binding Name Normalization: Function and operator names are replaced uniformly (f1, f2, ...), with syntactic context preserved (distinguishing infix from prefix usage).

These rewrite operators are implemented functionally as

1	op :: Task -> Either String Task

and formally verified to be commutative and associative under Kleisli composition. This guarantees that the operational and denotational semantics are invariant under transformation, with the sole effect of obscuring any natural language cues that might interfere with the purity of the deductive reasoning task (He et al., 28 Sep 2025). The resulting TF-Bench_pure tasks retain only the logical dependencies and type structure necessary for purely semantic type inference.

3. Evaluation Metrics

To assess LLMs’ semantics reasoning in the NL-neutral environment, the paper introduces two rigorous metrics designed for program-centric evaluation:

Semantic Robustness

This metric quantifies the sensitivity of a model's performance to the removal of NL cues. Formally,

$\text{Acc}(m)$ = accuracy on TF-Bench (with NL cues)
$\text{Acc}_{\text{pure}}(m)$ = accuracy on TF-Bench_pure

The Robustness Score (RS) is defined by

$\text{RS}(m) = \frac{\text{Acc}(m)}{\text{Acc}_{\text{pure}}(m)}$

A higher RS indicates the model relies less on NL cues, performing true semantic reasoning rather than pattern matching on code tokens.

Reasoning Effectiveness

This metric isolates the contribution of test-time compute (TTC) reasoning enhancements, discounting improvements due to NL overfitting. If $\Delta_{\text{pure}}$ is the accuracy increase from TTC on TF-Bench_pure, and $\Delta$ is the overall increase on TF-Bench (with NL cues), then

$\text{RE}(m) = \frac{\Delta_{\text{pure}}}{\Delta}$

A higher RE signifies genuine semantic reasoning improvement due to TTC methods; RE < 1 suggests overfitting to NL patterns rather than logic-based enhancements.

Metric	Formula	Interpretation
Robustness Score	$\text{RS}(m) = \text{Acc}(m) / \text{Acc}_{\text{pure}}(m)$	NL sensitivity
Reasoning Effectiveness	$\text{RE}(m) = \Delta_{\text{pure}} / \Delta$	TTC’s deductive impact

These metrics operationalize the benchmark’s aim of distinguishing semantic reasoning ability from superficial pattern recognition.

4. Model Performance and Empirical Findings

Empirical results indicate that removing natural language cues causes a significant reduction in LLM performance across all evaluated architectures. For example, the best-performing LLM, Claude-3.7-sonnet, attains only 55.85% accuracy on TF-Bench_pure (He et al., 28 Sep 2025). Other competitive LLMs (including GPT and Gemini variants) exhibit similar declines, confirming that high performance on conventional benchmarks often reflects NL pattern exploitation rather than genuine program semantics reasoning.

Calculated robustness scores for advanced models fall in the 60–64 range, quantitatively stressing the impact of NL removal as a semantics “stress test.” Further analysis reveals that LLMs fine-tuned on mathematical corpora demonstrate a greater propensity for deductive reasoning, whereas those trained on code tend to exhibit inductive pattern-matching, leveraging NL aspects embedded in codebases.

5. Methodological Implications for LLM Research

Findings from TF-Bench_pure underscore important limitations in contemporary LLM design and training methodologies:

The reduced accuracy signifies substantive gaps in semantic reasoning, even among best-in-class LLMs.
The dependence on NL cues suggests that programming intelligence in modern LLMs is intertwined with cognitive aspects of natural language, rather than being strictly grounded in program logic.
Fine-tuning strategies (especially using code corpora rich in NL cues) may reinforce inductive rather than deductive skills.
TTC methods, unless carefully calibrated, may optimize for patterns present in pretraining data rather than enhancing formal reasoning capability.

The benchmark’s metrics provide effective diagnostics to evaluate the impact of training and fine-tuning procedures, affording a mechanism to guide model development toward deductive reasoning enhancements.

6. Future Directions and Broader Significance

The introduction of TF-Bench_pure opens avenues for further research in multiple domains:

Model Training: There is a pronounced need for specialized training regimens—such as leveraging mathematical or formal semantic corpora—in order to cultivate deductive skills in LLMs.
Benchmark Extension: Expanding the framework beyond type inference to assess other formal reasoning tasks (e.g., program verification, logical consequence, symbolic manipulation) may deepen insights into software intelligence.
Theory and Practice Balance: Addressing the dichotomy between inductive pattern matching (favored by NL-rich datasets) and deductive semantics reasoning is a central challenge for both LLM research and application in software engineering.
Diagnostic Utility: The semantic robustness and reasoning effectiveness metrics may be employed to evaluate and calibrate future architectures and training protocols, signaling progress toward robust program-centric reasoning.

A plausible implication is that improving model architecture or training to privilege formal deductive knowledge over natural language pattern recognition could lead to substantial advancements in LLM program understanding. Continued benchmarking with TF-Bench_pure can thus serve as a central empirical tool in charting progress toward this goal.

TF-Bench_pure situates itself within a lineage of formal methods-driven benchmarks. By focusing on System F type inference and leveraging Haskell’s well-defined type system, the benchmark adheres to the rigorous logical foundations often discussed in programming semantics and type theory literature. Its methodology is informed by principles of task closure (ensuring all dependencies available), operational semantics preservation under transformation, and evaluation against oracle type signatures produced in formal systems.

The benchmark's empirical insights parallel themes in equivariant homotopy theory and arithmetic geometry (cf. (Sulyma, 2022)), wherein the distinction between structural and superficial properties is foundational. Analogous challenges appear in the computation of graded invariants (such as $\mathrm{TF}_\bigstar$ for perfectoid rings) and in the separation of module-theoretic consequences from external, language-induced artifacts.

TF-Bench_pure thus both reflects and contributes to ongoing efforts to rigorously assess and enhance semantic reasoning in computational models, standing as an important reference point in the evaluation and development of next-generation software intelligence systems.