Terminal-Bench v1.0 & v2.0
- Terminal-Bench comprises v1.0 and v2.0 suites that evaluate LLM agents on real-world CLI tasks using Dockerized environments with automated, outcome-centric verification.
- The benchmarks leverage a large-scale trajectory dataset from TerminalTraj to measure key metrics such as resolution rate and pass@k, driving significant fine-tuning improvements.
- Experimental results show up to a 606% relative improvement on TB v1.0, underscoring the critical role of domain diversity and rigorous task curation.
Terminal-Bench (TB) v1.0 and v2.0 constitute a critical benchmarking suite for evaluating agentic models and LLM agents on hard, outcome-driven tasks in realistic command-line interface (CLI) environments. These benchmarks, together with the large-scale trajectory dataset generated by the TerminalTraj pipeline, establish new standards for measuring long-horizon, tool-centric task completion, scaling data curation, and rigorous evaluation in containerized Linux settings (Wu et al., 1 Feb 2026, Merrill et al., 17 Jan 2026).
1. Overview and Motivation
Terminal-Bench emerged to address limitations in prior evaluations of LLM agentic competencies for real-world terminal use. Existing benchmarks were either insufficiently representative of authentic, “hard” CLI workflows or failed to challenge frontier models, leaving gaps in the measurement of agent capabilities relevant to software engineering, scientific computation, system administration, security, and ML engineering. Both TB v1.0 and v2.0 leverage Dockerized environments and natural-language instructions, requiring agents to autonomously issue shell commands to transform an environment’s state, validated through automated testing (Merrill et al., 17 Jan 2026).
2. Task Coverage and Structure
TB v1.0 focuses on eight general terminal workflows: file-system manipulation (ls, cd, grep, find), process inspection, package installation, and simple scripting. All tasks are grounded in isolated Docker containers, ensuring reproducibility and isolation (Wu et al., 1 Feb 2026).
TB v2.0 expands drastically, comprising 89 tasks (from 229 community proposals) across domains. In addition to general workflows, v2.0 introduces specialized categories: Web Service, SQL, QEMU, Security, Multimodal, Data Processing, Model Training & Evaluation, and Environment Interaction. Each is rooted in domain-specific tools (e.g., nginx, sqlite3, qemu, hashcat, PIL) (Wu et al., 1 Feb 2026, Merrill et al., 17 Jan 2026). Every task features:
- A Dockerfile defining the environment
- Natural-language instructions
- A human-written oracle solution
- Automated, outcome-centric verification via pytest or custom scripts
Table: Representative Domains and Example Tasks (TB v2.0) | Domain | Example Task (Difficulty) | Key Tool(s) | |----------------------|----------------------------------------------|------------------| | Software Engineering | cobol-modernization (Easy) | Python | | Security | break-filter-js-from-html (Medium) | Python, HTML | | Scientific Computing | adaptive-rejection-sampler (Medium) | R | | System Administration| configure-git-webserver (Hard) | Git, HTTP | | ML Engineering | train-fasttext (Hard) | fastText |
The TB v2.0 suite applies rigorous human and LLM-backed triage for curation, covers a significant range of real workflow complexities, and enforces a single-pass outcome-driven verification that precludes cheating through agent introspection (Merrill et al., 17 Jan 2026).
3. Evaluation Metrics and Protocols
The primary metric for both TB v1.0 and v2.0 is resolution rate (success rate), defined as:
Additional metrics include:
- Exact-match accuracy ("Acc”) in a single-pass setting
- pass@k: the probability at least one of sampled rollouts is successful, with (Wu et al., 1 Feb 2026)
- Average number of agent turns, input/output tokens, and timeout rate per task
Evaluation is conducted in a fresh Docker sandbox per task via the Terminus-2 harness, integrating a robust checking suite and isolation protocols. All reported TB results use mean accuracy and 95% bootstrap confidence intervals over at least four independent seeds for open-weight models (Wu et al., 1 Feb 2026).
4. Data Generation Pipeline: TerminalTraj
TerminalTraj enables scalable, executable, and verifiable agentic trajectory collection for terminal tasks. Its pipeline consists of:
- Data Sources Collection:
- Crawling 899,741 GitHub repositories across eight programming languages and 20,000 external assets.
- Extraction of domain relevance signals (e.g., Markdown/Shell files).
- Docker Image Curation:
- Application of a lightweight ScoreModel (Qwen2.5-Coder-0.5B backbone) to select repositories () with on average.
- Construction of 32,325 Docker images (17% success rate from 196,051 high-quality repositories).
- Instance Generation and Verification:
- Synthesis of 1,030,695 task instances.
- Task queries and executable pytest validation code are generated by Qwen3-Coder-480B.
- Agents are rolled out under Terminus-2, sampling attempts per query.
- Only trajectories passing all validation checks are retained, yielding 50,733 verified examples (4.92% retention) (Wu et al., 1 Feb 2026).
The retention of only code-verified, instance-specific successful trajectories demonstrably improves downstream agent training efficacy.
5. Model Training and Experimental Results
Backbones: Qwen2.5-Coder-Instruct (7B, 14B, 32B) with fine-tuned TerminalTraj-7B, 14B, and 32B variants via multi-turn supervised fine-tuning (Megatron-LM).
Hyperparameters: Sequence length 4096, batch size 2048, learning rate (minimum , cosine decay, 3,000 warmup steps), weight decay 0.01, gradient clipping 1.0, BF16 precision (Wu et al., 1 Feb 2026).
Performance:
- Qwen2.5-Coder backbones perform near random () on TB v1.0 and v2.0.
- Fine-tuning on TerminalTraj gives large absolute and relative improvements:
- TerminalTraj-32B: 35.30% (TB v1.0), 22.00% (TB v2.0)
- Absolute gains: $30.3$ points (TB v1.0), $17.5$ points (TB v2.0)
- Relative improvement (TB v1.0):
- Relative improvement (TB v2.0):
- Compared to Qwen3-32B-Nex-N1 (28.75%/16.70%), TerminalTraj-32B is +6.55/+5.30 points higher, despite using less data and a smaller backbone (Wu et al., 1 Feb 2026).
Frontier closed-source model combinations (e.g., GPT-5.2 + Codex CLI) currently top the TB v2.0 leaderboard (up to 62.9% ± 3.0%), but open-weight models such as Kimi K2 Thinking + Terminus 2 reach 35.7% ± 2.8% (Merrill et al., 17 Jan 2026).
6. Scaling Behavior, Ablations, and Analysis
Test-Time Scaling: TerminalTraj-32B displays strong scaling with increased pass@k; at pass@16, it achieves 63% on TB v1.0, outperforming Qwen3-Coder-480B for (Qwen2.5-Coder-7B remains under 15% at pass@16), demonstrating enhanced sample efficiency at inference (Wu et al., 1 Feb 2026).
Domain Ablation: Removal of any single specialized domain from training significantly degrades performance (e.g., –9.7%/–8.7% for QEMU; –8.9%/–7.4% for Web Service data on TB v1.0/2.0). This emphasizes the necessity of domain diversity for broad-agent generalization.
Trajectory Verification Ablation: Models trained on code-verified data consistently outperform those using LLM-verified data, especially for modest data scales (e.g., +10–15 points improvement with 1K–4K samples), establishing the importance of strict, executable, instance-specific filtering (Wu et al., 1 Feb 2026).
Failure Modes: Analyses reveal three major classes:
- Execution (e.g., disobeying specifications, step repetition)
- Coherence (context loss, reasoning-action mismatch)
- Verification (premature termination, insufficient verification)
Most failures in closed-source models (∼50%) are execution errors, while open-weight models show a more uniform distribution. At the command level, 24.1% of failures are invocation errors, with significant fractions from dependency and filesystem issues.
7. Context, Ecosystem, and Future Directions
TB v2.0 represents a substantial scale-up from the initial TB framework (~20 tasks), adding mathematically intensive, orchestrated multi-tool, and long-horizon assembly tasks. The auditing pipeline—combining CI checks, LLM-backed static analysis, and multi-party manual triage—surpasses prior standards in rigor.
Integration with evaluation infrastructure (Harbor) and adapter support for ingesting 26 external benchmarks substantially expands evaluation diversity and reproducibility (Merrill et al., 17 Jan 2026). Agents remain below 65% resolution on TB v2.0, with persistent gaps in long-horizon reasoning, verification, command reliability, and cross-tool competency.
A plausible implication is that progress on TB v2.0 will require both improved model architectures and enhanced environment introspection and planning modules, as well as more sample-efficient, diverse datasets such as that produced by TerminalTraj.
References
- "Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments" (Wu et al., 1 Feb 2026)
- "Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces" (Merrill et al., 17 Jan 2026)