TimeMachine-bench: Evaluating Code Migration
- TimeMachine-bench is a benchmark suite that assesses LLM-driven repository migrations by simulating real-world dependency evolution in Python projects.
- It employs an automated, multi-stage pipeline to curate, filter, and verify projects, ensuring realistic testing scenarios with resource-constrained evaluations.
- The benchmark reveals high efficacy on routine 'Easy' tasks while highlighting challenges in semantic correctness and minimal editing on complex migration cases.
TimeMachine-bench refers to a benchmark suite designed for evaluating model capabilities on repository-level migration tasks, specifically focusing on automating software migration in the context of real-world Python projects. Its construction, scope, and evaluation protocol target the challenges that arise when codebases must adapt to evolving third-party dependencies, as is common in practical software engineering. Unlike conventional code-generation benchmarks that operate over a fixed environment, TimeMachine-bench simulates the actual dynamics of software evolution, providing a platform for rigorously measuring the effectiveness of LLM–driven agents in realistic migration scenarios (Fujii et al., 30 Jan 2026).
1. Benchmark Motivation and Scope
TimeMachine-bench addresses the need to study repository-level migration—a task where Python projects with passing tests in an “old” dependency environment start to fail under updated dependencies, necessitating adaptation in response to API changes, function deprecations, behavioral changes, or language version upgrades. The benchmark encompasses the entire PyPI ecosystem, not limited to a small set of library-specific migrations. It systematically captures breakage patterns that include broken imports, missing or renamed APIs, and switching APIs from synchronous to asynchronous interfaces.
The distinguishing factor from prior benchmarks is the simulation of real-world evolution, with the “old” environment pegged to a historical commit date and the “new” environment standardized to a recent snapshot (July 31, 2025). This ensures the tasks reflect the actual problems encountered by practitioners in open-source maintenance.
2. Dataset Construction Process
The dataset-generation pipeline is designed for automation and extensibility, enabling continuous updates as new failures emerge. It proceeds in five main stages:
- Pre-Execution Filtering: Seeds are drawn from The Stack v2 (≈200k Python repositories). Repositories must provide a reproducible dependency specification (requirements.txt, pyproject.toml, or setup.py), import pytest or unittest, have a permissive license, and ≥1 GitHub star, reducing the pool to ≈45k.
- Runtime Environment Preparation: “Time travel” is achieved via the pypi-timemachine mechanism, restricting pip to install packages up to a specified cutoff date. The environment “commits” to the Python version and project setup instructions as inferred from the project metadata using a Claude-based workflow.
- Execution-Based Candidate Extraction: Docker-based sandboxing is used to build and test each repo under both “old” and “new” environments. Projects must successfully install and pass all tests in the old environment and fail at least one test in the new environment (excluding installation failures). 4. Post-Execution Filtering: Excludes repos with failures due to test timeouts or stack traces rooted in external dependencies rather than user code.
- Human Verification: A random sample from the post-filtered corpus undergoes manual analysis. Minimal code edits (user code only) required to restore all tests in the “new” environment are annotated, excluding fixes that trivially downgrade dependencies or alter tests. Difficulty levels are assigned based on repair effort.
This produces two main datasets:
- TimeMachine-bench-Full: 1,145 repositories where migration is required.
- TimeMachine-bench-Verified: 100 repositories (median 2 lines to patch, max 54, majority “Easy”) with human-curated minimal solutions and difficulty annotation.
3. Task Definition and Evaluation Protocol
Each migration instance consists of a codebase snapshot, target dependency/Python versions, and the pre-existing test suite. LLM-based agents are tasked with iterative code editing, restricted to implementation files and not tests, under budgets of LLM calls and test executions per instance.
Evaluation is bifurcated into:
- Sufficiency: Extends pass@k to pass@1(n, m), marking success only if all tests pass within resource bounds:
- Necessity (Edit Precision): Precision quantifies the minimality of edits with respect to a human-verified gold annotation ():
Edits to test files are disallowed (“test-masking”) to prevent degenerate solutions.
4. Baselines and Experimental Setup
TimeMachine-bench supports agent-based evaluation with interoperable LLM-tools frameworks. The baseline evaluation includes 11 LLMs:
Proprietary:
Open-weight:
- Qwen3-Coder-480B
- Qwen3-235B
- Qwen3-32B
- Llama-4-Maverick
- Llama-3.3
- DeepSeek-V3.1
- gpt-oss-120b (low)
Each model is used as part of a ReAct agent with access to 10 tools (e.g., list_dir, search_dir, edit_file, execute_tests). Observation history is truncated (only 5 most recent agent turns and last test’s context are kept). All LLMs are run deterministically (T=0), with max response length of 512 tokens.
5. Results and Observations
Performance on TimeMachine-bench-Verified under budget constraints (, ) is summarized in the table:
| Model | pass@1(100,10) | prec@1(100,10) | Easy (64) | Medium (30) | Hard (6) |
|---|---|---|---|---|---|
| Claude Sonnet 4 | 99.0% | 78.0% | 64 | 30 | 5 |
| Claude 3.5 Sonnet v2 | 91.0% | 66.8% | 61 | 25 | 5 |
| GPT-5 | 91.0% | 54.2% | 62 | 27 | 2 |
| GPT-4o | 76.0% | 61.4% | 57 | 19 | 0 |
| Qwen3-Coder-480B | 90.0% | 70.1% | 62 | 26 | 2 |
| Qwen3-235B | 87.0% | 69.1% | 62 | 24 | 1 |
| Qwen3-32B | 53.0% | 44.1% | 40 | 13 | 0 |
| Llama-4-Maverick | 76.0% | 63.2% | 56 | 20 | 0 |
| Llama-3.3 | 52.0% | 44.0% | 40 | 12 | 0 |
| DeepSeek-V3.1 | 75.0% | 61.4% | 52 | 21 | 2 |
| gpt-oss-120b (low) | 55.0% | 33.8% | 36 | 19 | 0 |
Key findings:
- Sufficiency: State-of-the-art LLM-agents can almost always adapt “Easy” and many “Medium” projects, but the majority fail on “Hard” cases (e.g., requiring deep refactoring or subtle API semantics).
- Necessity: Many agents (e.g., GPT-5, prec@1=54.2%) introduce excessive or superfluous modifications relative to the minimal human patch; only Claude Sonnet 4 consistently consolidates high sufficiency and precision.
- Reliability Pathologies: Agents sometimes introduce spurious fixes exploiting low test coverage (e.g., adding dummy constants), rarely utilize revert_last to undo ineffective edits, and commonly over-execute redundant viewing steps (notably, GPT-5 spends 53% of actions on view_file and delays initial execute_tests, leading to slow convergence).
- Budgeted Robustness: Imposing and as hard resource caps provides a strong upper bound on LLM-agent practicality. Most models converge to a solution in far fewer turns on “Easy” instances.
6. Analysis and Implications
TimeMachine-bench reveals that recent LLM-based code agents can automate a significant portion of routine real-world repository migrations but remain brittle regarding edit precision, semantic faithfulness, and introspective repair. Critical limitations are:
- Semantic Correctness: Passing the original tests does not necessarily guarantee full semantic equivalence post-migration, especially in the presence of weak or incomplete test suites.
- Edit Minimality: Non-negligible code review overhead is introduced when agents propose unnecessarily broad or unrelated changes.
- Agent Tool-Use Suboptimality: Current tool-use strategies, even when equipped with ReAct-like frameworks, do not leverage historical context or effective error localization for optimal convergence.
- Spurious Passing: Potential for models to exploit inadequate coverage with trivial or hacky edits remains high.
A plausible implication is a future direction toward richer feedback, automated test suite expansion, integration of version/control history signals, and model architectures equipped for more robust codebase introspection. The fully automated and reproducible pipeline for updating TimeMachine-bench makes it extensible to other language ecosystems and evolving dependency landscapes.
7. Future Prospects and Extensions
The TimeMachine-bench protocol, through its continuous and automated data generation as well as a rigorous dual-metric evaluation design, provides a basis for further research on reliability and safety in software evolution. Recommended future augmentations include:
- Cross-language Expansion: Adapting the date-filtered dependency control and migration protocol to ecosystems like npm (JavaScript), Maven (Java), or RubyGems (Ruby).
- Automated Oracle Generation: Leveraging automated test generation approaches to augment the gold oracle and reduce false acceptances due to test-incompleteness.
- Version-History Integrations: Incorporating fine-grained VCS history and “migration exemplars” into agent toolkits.
- Improved Sandboxing: Bolstering isolation and reproducibility for complex build and runtime environments.
TimeMachine-bench is positioned to catalyze the development and assessment of evolution-aware code agents that can safely scale with the real-world pace and diversity of software maintenance challenges (Fujii et al., 30 Jan 2026).