Papers
Topics
Authors
Recent
Search
2000 character limit reached

Terminal-Bench v1.0 & v2.0

Updated 13 February 2026
  • Terminal-Bench comprises v1.0 and v2.0 suites that evaluate LLM agents on real-world CLI tasks using Dockerized environments with automated, outcome-centric verification.
  • The benchmarks leverage a large-scale trajectory dataset from TerminalTraj to measure key metrics such as resolution rate and pass@k, driving significant fine-tuning improvements.
  • Experimental results show up to a 606% relative improvement on TB v1.0, underscoring the critical role of domain diversity and rigorous task curation.

Terminal-Bench (TB) v1.0 and v2.0 constitute a critical benchmarking suite for evaluating agentic models and LLM agents on hard, outcome-driven tasks in realistic command-line interface (CLI) environments. These benchmarks, together with the large-scale trajectory dataset generated by the TerminalTraj pipeline, establish new standards for measuring long-horizon, tool-centric task completion, scaling data curation, and rigorous evaluation in containerized Linux settings (Wu et al., 1 Feb 2026, Merrill et al., 17 Jan 2026).

1. Overview and Motivation

Terminal-Bench emerged to address limitations in prior evaluations of LLM agentic competencies for real-world terminal use. Existing benchmarks were either insufficiently representative of authentic, “hard” CLI workflows or failed to challenge frontier models, leaving gaps in the measurement of agent capabilities relevant to software engineering, scientific computation, system administration, security, and ML engineering. Both TB v1.0 and v2.0 leverage Dockerized environments and natural-language instructions, requiring agents to autonomously issue shell commands to transform an environment’s state, validated through automated testing (Merrill et al., 17 Jan 2026).

2. Task Coverage and Structure

TB v1.0 focuses on eight general terminal workflows: file-system manipulation (ls, cd, grep, find), process inspection, package installation, and simple scripting. All tasks are grounded in isolated Docker containers, ensuring reproducibility and isolation (Wu et al., 1 Feb 2026).

TB v2.0 expands drastically, comprising 89 tasks (from 229 community proposals) across domains. In addition to general workflows, v2.0 introduces specialized categories: Web Service, SQL, QEMU, Security, Multimodal, Data Processing, Model Training & Evaluation, and Environment Interaction. Each is rooted in domain-specific tools (e.g., nginx, sqlite3, qemu, hashcat, PIL) (Wu et al., 1 Feb 2026, Merrill et al., 17 Jan 2026). Every task features:

  • A Dockerfile defining the environment
  • Natural-language instructions
  • A human-written oracle solution
  • Automated, outcome-centric verification via pytest or custom scripts

Table: Representative Domains and Example Tasks (TB v2.0) | Domain | Example Task (Difficulty) | Key Tool(s) | |----------------------|----------------------------------------------|------------------| | Software Engineering | cobol-modernization (Easy) | Python | | Security | break-filter-js-from-html (Medium) | Python, HTML | | Scientific Computing | adaptive-rejection-sampler (Medium) | R | | System Administration| configure-git-webserver (Hard) | Git, HTTP | | ML Engineering | train-fasttext (Hard) | fastText |

The TB v2.0 suite applies rigorous human and LLM-backed triage for curation, covers a significant range of real workflow complexities, and enforces a single-pass outcome-driven verification that precludes cheating through agent introspection (Merrill et al., 17 Jan 2026).

3. Evaluation Metrics and Protocols

The primary metric for both TB v1.0 and v2.0 is resolution rate (success rate), defined as:

Score=#successful tasks#total tasks×100%\text{Score} = \frac{\#\text{successful tasks}}{\#\text{total tasks}} \times 100\%

(Merrill et al., 17 Jan 2026)

Additional metrics include:

  • Exact-match accuracy ("Acc”) in a single-pass setting
  • pass@k: the probability at least one of kk sampled rollouts is successful, with k{1,2,4,8,16}k \in \{1, 2, 4, 8, 16\} (Wu et al., 1 Feb 2026)
  • Average number of agent turns, input/output tokens, and timeout rate per task

Evaluation is conducted in a fresh Docker sandbox per task via the Terminus-2 harness, integrating a robust checking suite and isolation protocols. All reported TB results use mean accuracy and 95% bootstrap confidence intervals over at least four independent seeds for open-weight models (Wu et al., 1 Feb 2026).

4. Data Generation Pipeline: TerminalTraj

TerminalTraj enables scalable, executable, and verifiable agentic trajectory collection for terminal tasks. Its pipeline consists of:

  1. Data Sources Collection:
    • Crawling 899,741 GitHub repositories across eight programming languages and 20,000 external assets.
    • Extraction of domain relevance signals (e.g., Markdown/Shell files).
  2. Docker Image Curation:
    • Application of a lightweight ScoreModel (Qwen2.5-Coder-0.5B backbone) to select repositories (QiQ_i) with Qi0.2Q_i \geq 0.2 on average.
    • Construction of 32,325 Docker images (17% success rate from 196,051 high-quality repositories).
  3. Instance Generation and Verification:
    • Synthesis of 1,030,695 task instances.
    • Task queries and executable pytest validation code are generated by Qwen3-Coder-480B.
    • Agents are rolled out under Terminus-2, sampling k=4k = 4 attempts per query.
    • Only trajectories passing all validation checks are retained, yielding 50,733 verified examples (4.92% retention) (Wu et al., 1 Feb 2026).

The retention of only code-verified, instance-specific successful trajectories demonstrably improves downstream agent training efficacy.

5. Model Training and Experimental Results

Backbones: Qwen2.5-Coder-Instruct (7B, 14B, 32B) with fine-tuned TerminalTraj-7B, 14B, and 32B variants via multi-turn supervised fine-tuning (Megatron-LM).

Hyperparameters: Sequence length 4096, batch size 2048, learning rate 1×1051 \times 10^{-5} (minimum 1×1061 \times 10^{-6}, cosine decay, 3,000 warmup steps), weight decay 0.01, gradient clipping 1.0, BF16 precision (Wu et al., 1 Feb 2026).

Performance:

  • Qwen2.5-Coder backbones perform near random (5%\approx 5\%) on TB v1.0 and v2.0.
  • Fine-tuning on TerminalTraj gives large absolute and relative improvements:

    • TerminalTraj-32B: 35.30% (TB v1.0), 22.00% (TB v2.0)
    • Absolute gains: $30.3$ points (TB v1.0), $17.5$ points (TB v2.0)
    • Relative improvement (TB v1.0):

    35.305.005.00×100%=606%\frac{35.30 - 5.00}{5.00} \times 100\% = 606\% - Relative improvement (TB v2.0):

    22.004.494.49×100%389%\frac{22.00 - 4.49}{4.49} \times 100\% \approx 389\%

  • Compared to Qwen3-32B-Nex-N1 (28.75%/16.70%), TerminalTraj-32B is +6.55/+5.30 points higher, despite using less data and a smaller backbone (Wu et al., 1 Feb 2026).

Frontier closed-source model combinations (e.g., GPT-5.2 + Codex CLI) currently top the TB v2.0 leaderboard (up to 62.9% ± 3.0%), but open-weight models such as Kimi K2 Thinking + Terminus 2 reach 35.7% ± 2.8% (Merrill et al., 17 Jan 2026).

6. Scaling Behavior, Ablations, and Analysis

Test-Time Scaling: TerminalTraj-32B displays strong scaling with increased pass@k; at pass@16, it achieves 63% on TB v1.0, outperforming Qwen3-Coder-480B for k>4k > 4 (Qwen2.5-Coder-7B remains under 15% at pass@16), demonstrating enhanced sample efficiency at inference (Wu et al., 1 Feb 2026).

Domain Ablation: Removal of any single specialized domain from training significantly degrades performance (e.g., –9.7%/–8.7% for QEMU; –8.9%/–7.4% for Web Service data on TB v1.0/2.0). This emphasizes the necessity of domain diversity for broad-agent generalization.

Trajectory Verification Ablation: Models trained on code-verified data consistently outperform those using LLM-verified data, especially for modest data scales (e.g., +10–15 points improvement with 1K–4K samples), establishing the importance of strict, executable, instance-specific filtering (Wu et al., 1 Feb 2026).

Failure Modes: Analyses reveal three major classes:

  • Execution (e.g., disobeying specifications, step repetition)
  • Coherence (context loss, reasoning-action mismatch)
  • Verification (premature termination, insufficient verification)

Most failures in closed-source models (∼50%) are execution errors, while open-weight models show a more uniform distribution. At the command level, 24.1% of failures are invocation errors, with significant fractions from dependency and filesystem issues.

7. Context, Ecosystem, and Future Directions

TB v2.0 represents a substantial scale-up from the initial TB framework (~20 tasks), adding mathematically intensive, orchestrated multi-tool, and long-horizon assembly tasks. The auditing pipeline—combining CI checks, LLM-backed static analysis, and multi-party manual triage—surpasses prior standards in rigor.

Integration with evaluation infrastructure (Harbor) and adapter support for ingesting 26 external benchmarks substantially expands evaluation diversity and reproducibility (Merrill et al., 17 Jan 2026). Agents remain below 65% resolution on TB v2.0, with persistent gaps in long-horizon reasoning, verification, command reliability, and cross-tool competency.

A plausible implication is that progress on TB v2.0 will require both improved model architectures and enhanced environment introspection and planning modules, as well as more sample-efficient, diverse datasets such as that produced by TerminalTraj.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Terminal-Bench v1.0 & v2.0.