TypyBench: LLM Type Inference Benchmark

Updated 1 August 2025

TypyBench is a benchmark that evaluates LLMs' ability to accurately infer types in real-world Python repositories using novel metrics.
It leverages TypeSim for local semantic similarity and TypeCheck for repository-level consistency, quantifying performance across varying type complexities.
Findings reveal that high local accuracy may not guarantee global type consistency, emphasizing the need for context-aware inference strategies.

TypyBench is a benchmark designed to rigorously evaluate LLMs on type inference for entire, real-world Python repositories. It introduces two novel quantitative metrics—TypeSim and TypeCheck—to assess, respectively, the semantic similarity between LLM-inferred and human-annotated types and the repository-level consistency of inferred types when subject to static type checking. TypyBench’s methodology, dataset, and evaluation pipeline are intended to reveal both the strengths and persistent challenges of LLM-based type inference in dynamic programming environments.

1. Motivation and Scope

Type inference in dynamic languages such as Python is a longstanding and unresolved challenge. While LLMs have demonstrated strong performance on code understanding tasks, their precise capabilities in type inference—especially at the scale of entire projects—remain underexplored. TypyBench was developed to address this gap by moving beyond evaluations on small code snippets or isolated functions. TypyBench’s benchmark is grounded in a curated set of 50 high-quality, real-world Python repositories, each stripped of type annotations. LLMs are tasked with recovering the missing annotations, simulating practical scenarios such as retrofitting legacy codebases with type information or verifying the safety and maintainability of large projects.

2. Metrics: TypeSim and TypeCheck

TypyBench introduces two principal metrics:

TypeSim (Type Similarity)

TypeSim measures the semantic proximity between an LLM’s inferred type and the ground truth annotation, providing a continuous similarity value instead of a strict binary match. TypeSim is calculated as follows:

For non-generic (base) types, the similarity is computed using the Jaccard index over sets of supported methods or operations:

$s(t, t') = \frac{|attrs(t) \cap attrs(t')|}{|attrs(t) \cup attrs(t')|}$

For generic types (e.g., $\text{List[int]}$ or $\text{Dict[str, int]}$ ), similarity is computed recursively as the mean of the similarity of the “root” type (e.g., List) and its type arguments, via a procedure ListCompare:

$S(T, T') = \frac{1}{2} [s(\text{root}(T), \text{root}(T')) + \text{ListCompare}(T.args, T'.args)]$

For union types, a SetCompare procedure is used to optimally match and compare unordered collections of type constituents.

TypeSim captures both structural and functional congruence, allowing partial credit for “close” but non-exact matches, such as inferring List in place of Sequence or accepting $[\text{int}, \text{float}]$ where $[\text{int}]$ was expected.

TypeCheck (Type Consistency)

TypeCheck quantifies the repository-scale consistency of the predicted types by counting the errors produced when the predicted type stubs are subjected to a static type checker (e.g., Mypy). After LLMs generate type stubs ( $.pyi$ files), these are checked for type consistency across modules and function signatures, with the number of reported errors serving as a direct proxy for usable type coherence.

Metric	Input	Output
TypeSim	Predicted/ground-truth types	[0, 1] similarity score
TypeCheck	Set of predicted stubs	Integer error count (lower better)

TypeSim emphasizes local, semantic fidelity, whereas TypeCheck assesses global, integration-level correctness.

3. Dataset and Evaluation Protocol

TypyBench’s dataset comprises 50 open-source Python repositories selected for both quality and diversity of type usage, covering data science, web development, and systems programming. All original type annotations are stripped, producing a “blind” type inference scenario. LLMs generate type predictions for functions, methods, classes, and module-level variables, producing a set of stub files. To prevent information leakage, only public code is included, and the repositories are curated for manageable size and type complexity.

The following pipeline summarizes the TypyBench evaluation flow:

Annotation removal from all repository files.
Prompting the LLM to infer types, possibly with project context.
Stub generation with inferred types for all code elements.
TypeSim calculation for each code element by comparing predicted and ground-truth type individually.
TypeCheck computation by running a static type checker across the stubs and aggregating the number of type errors encountered.

Empirical results indicate that state-of-the-art LLMs (e.g., GPT-4, Claude) achieve TypeSim scores around 0.80 in optimal conditions, but suffer a drop-off for nested or complex generic types. TypeCheck reveals even subtle local type mismatches propagate into substantial repository-level inconsistencies.

4. Empirical Findings and Type Complexity Analysis

TypyBench’s results demonstrate that while LLMs are competent at inferring simple types, their accuracy deteriorates for complex, deeply nested, or generic types. For example, performance as measured by exact match drops sharply for types of depth greater than 2. TypeSim, while more forgiving, still registers a decline for increasing complexity, reflecting a persistent challenge in capturing the full semantics of nested structures or union types.

Another key finding is the disjunction between local and global correctness: repositories with relatively high TypeSim scores can still accumulate large numbers of TypeCheck errors, especially when the predicted types are locally plausible but globally incompatible. This is evidenced by instances where inferred return types conflict with downstream expectations or variable uses.

Notably, some models achieve better TypeCheck (i.e., fewer errors) even if their TypeSim scores are only equivalent to others. This suggests that strategies optimizing for repository-level coherence (rather than per-function accuracy) are effective in lowering static checking failures.

5. Methodology: TypeSim and TypeCheck Algorithms

Central components of TypyBench are the exact methodologies for TypeSim and TypeCheck. A representative outline for TypeSim is as follows, strictly as described in the data:

Algorithm TypeSim:
    Input: types T, T′
    If T or T′ is a Union, then return SetCompare(as_set(T), as_set(T′))
    Else, set score = s(T.root, T′.root)
    If both T and T′ have type arguments:
        score = ½ · [score + ListCompare(T.args, T′.args)]
    Else if only one has arguments:
        score = score⁄2
    Return score

TypeCheck, by contrast, simply involves running a tool like Mypy on the generated stub files and counting the resulting errors as an integer-valued measure of repository-wide type integration.

6. Implications for Model and Tool Development

TypyBench reveals that improvements in local, semantically driven type prediction (boosting TypeSim) are insufficient for guaranteeing usable, repository-scale annotation (as gauged by TypeCheck). A plausible implication is that model architectures or prompting strategies must explicitly incorporate longer context, cross-module dependencies, and a global view of code relationships.

Preliminary experiments cited in the data indicate that incorporating repository-wide context into the LLM inference process reduces TypeCheck errors, but this also raises practical challenges around prompt length, memory consumption, and inference latency. Future work is thus directed toward developing modeling and evaluation techniques capable of handling such long-range dependencies.

Additionally, these findings argue for integrating reasoning mechanisms that consider type dependencies and their usage throughout the codebase, as well as the need for more sophisticated metrics that jointly capture local semantic accuracy and global static soundness.

7. Future Directions and Impact

TypyBench sets a new comprehensive standard for evaluating automated type inference systems, particularly those based on LLMs. It enables nuanced, multi-level analysis of type inference quality and exposes the limitations of relying solely on local context or exact-matching metrics.

This suggests that future research should emphasize repository-level consistency and context-aware prediction as central objectives. TypyBench’s dataset and evaluation code (available at https://github.com/typybench/typybench) provide an extensible platform for benchmarking progress, comparing new modeling strategies, and identifying strengths and weaknesses of LLM-based type inference.

In summary, TypyBench exposes the challenges LLMs face in attaining both semantic similarity and practical consistency when inferring types in dynamic languages at scale. It provides robust metrics and an empirical foundation for the continued advancement of automated type inference methodologies.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TypyBench.