TerminalBench 2.0 Benchmark
- TerminalBench 2.0 is a benchmark suite for evaluating AI agents on long-horizon, real command-line tasks, reflecting professional workflows in engineering and science.
- It comprises 89 containerized tasks with human-written canonical solutions and outcome-driven verification to support both quantitative and qualitative error analysis.
- The platform ensures reproducibility and extensibility via fixed Docker environments, rigorous QA measures, and integration with 26 external benchmarks.
TerminalBench 2.0 is a benchmark suite designed for the rigorous evaluation of AI agents on hard, long-horizon tasks in real command-line (CLI) operating environments. It is motivated by the need to measure agent competency on workflows that are representative of professional software engineering, scientific computing, and related technical domains, bridging a gap left by previous benchmarks focused on short-form or synthetic tasks. TerminalBench 2.0 comprises 89 diverse tasks, each encapsulated in a reproducible, containerized environment, paired with a human-written canonical solution and a comprehensive verification suite. The benchmark infrastructure is purpose-built for extensibility, stringent quality assurance, and repeatability, providing a platform for both quantitative and qualitative error analysis of frontier AI models and agents (Merrill et al., 17 Jan 2026).
1. Motivation, Design Principles, and Scope
TerminalBench 2.0 was developed to address limitations in prior benchmarks, which typically fall short in task realism or fail to present sufficient difficulty to challenge state-of-the-art AI models. The core design principles include the curation of tasks based on real expert workflows, outcome-driven assessment standards, and containerized execution for environmental control.
Distinctive aspects of TerminalBench 2.0 include:
- Real-world focus: Tasks are formulated to reflect authentic engineering and scientific workflows, such as compiling complex software stacks, configuring services, reproducing scientific results, and conducting security audits.
- Frontier difficulty: The suite contains tasks that state-of-the-art models—e.g., GPT-5.2, Claude Opus 4.5, Gemini 3 Pro—resolve at rates below 65%, supporting meaningful measurement of agent progress at the research frontier.
- Outcome-driven evaluation: Each task is defined by a terminal state-validation check, not by intermediate output parsing, thereby allowing heterogeneous, creative agent solutions.
- Extensible architecture: Adapter support enables integration of 26 external benchmarks (e.g., SWE-Bench, QuixBugs, AppWorld), broadening task diversity and supporting benchmark evolution.
2. Task Composition and Categories
The 89 tasks in TerminalBench 2.0 span eleven high-level domains to capture the multifaceted nature of command-line work encountered in professional settings:
| Category | Approx. Number of Tasks |
|---|---|
| Software Engineering | 30 |
| Data Processing / Data Science | 10 |
| Scientific Computing | 8 |
| System Administration | 8 |
| Security / Cybersecurity | 8 |
| Machine Learning & Model Training | 6 |
| Mathematics / Cryptanalysis | 5 |
| Debugging | 3 |
| Games | 2 |
| Personal Assistant | 1 |
| Video Processing | 1 |
Representative simulated workflows include building projects from source (e.g., CompCert, pMARS), configuring networked services (Git+HTTP, SSH-forwarded VM), reverse-engineering binaries, conducting security exploits and defense maneuvers (XSS filter bypass, 7z password cracking), ML pipeline engineering, cross-language source challenges, and cryptanalysis (e.g., attacks on FEAL cipher). Each task includes a human-estimated completion time for both junior and expert engineers, with approximately 48% estimated as taking less than one hour for experts, and a majority accessible to juniors in under a day.
3. Environment and Oracle Solution Structure
Each task is containerized using Docker, ensuring full specification of operating system version, shell configuration, language runtimes (Python, R, C toolchains, etc.), system dependencies, and initial filesystem state. This minimizes drift and supports experiment reproducibility. On container instantiation, agents interact with the filesystem absent any test artifacts; all dependencies are version-locked or vendored.
Internet access is permitted in a controlled fashion to support package download and external lookup, with canary strings systematically inserted throughout task files to detect information leakage or trivial completion attempts. The benchmark prohibits trivial solution paths by cleansing shell and version control histories.
A human-written oracle solution script is provided per task. This script implements the canonical workflow for the desired outcome, guarantees passage of the pytest-based task test suite, and serves both as a ground-truth for solvability and as a trajectory upper bound. Rigorous QA practices—including CI validation, adversarial checks, and LLM-based specification reviews—are employed to verify correctness and enforce robust task specification.
4. Benchmarking, Evaluation, and Error Analysis
Benchmark evaluation is conducted using the open-source Harbor harness, providing infrastructure for large-scale, containerized, parallel experimentation via Daytona-based sandboxing.
- Each (model, agent) pair is run on a minimum of five independent trials per task, yielding over 32,000 total trials across 16 models and 6 agents.
- The primary metric is the per-task resolution rate:
Performance is reported with 95% confidence intervals from repeated runs.
- Success is determined exclusively by passing the full pytest validation suite for the post-execution container state.
Aggregate performance highlights are as follows:
| Model + Agent | Resolution Rate (mean ± 95% CI) |
|---|---|
| GPT-5.2 + Codex CLI | 62.9% ± 3.0% |
| Claude Opus 4.5 + Terminus 2 | 57.8% ± 2.5% |
| Gemini 3 Pro + Terminus 2 | 56.9% ± 2.5% |
| GPT-5.2 + Terminus 2 | 54.0% ± 2.9% |
| Claude Code + Opus 4.5 | 52.1% ± 2.5% |
Closed-source models dominate the top ranks; leading open-weight systems achieve substantially lower resolution rates, and small-scale models perform worse (<15%).
Error analysis employs a two-level approach:
- Trajectory-level: Failures are categorized (taxonomy derived from MAST) as execution errors (~55%), coherence errors (~25%; e.g., context loss), or verification errors (~20%). Closed-source models display a predominance of execution-level failures; open-weight models exhibit a more balanced error profile.
- Command-level: A scoring mechanism using LLM judgment attributes observed failures to: "command not found" (24.1%), execution failures in binaries (9.6%), filesystem/path errors, package/version mismatches, and network issues. Overall CLI command error rates range from 9.2% (Grok 4) to 26.7% (GPT-OSS-120B). This suggests improvements in environment introspection and error recovery may yield performance gains.
5. Infrastructure and Quality-Control Mechanisms
TerminalBench 2.0 is underpinned by quality control processes and experiment management strategies intended to ensure reliability, repeatability, and diagnostic clarity:
- Reproducibility: Pinned Docker environments and fixed dependencies minimize external drift and support repeated experimental runs.
- Randomness estimation: Multiple trials per task/model/agent facilitate variance estimation, with error bars reflecting 95% confidence intervals.
- Data leakage prevention: Canary strings serve as triggers for data contamination checks, verifying that models or agents do not succeed via prior exposure.
- Automated and manual QA: Multi-stage checks—including GitHub Actions, specification validation by LLMs, and adversarial agents—ensure each task remains well-posed and robust to exploitation.
6. Limitations and Future Development Trajectories
Notable limitations and directions for further work include:
- Benchmark saturation: Performance on the most difficult tasks is improving rapidly, mandating continual addition of harder tasks.
- Data contamination: Public distribution of tasks introduces leakage risk into pretraining sets; the adoption of private or held-out splits is under consideration.
- Nondeterminism: Select tasks rely on external APIs or involve long-running builds, reducing experimental determinism. Moves toward fully offline, hermetic environments are planned.
- Test suite coverage: Current test validity relies partly on manual inspection; incorporation of code-coverage analysis, mutation testing, or similar white-box criteria is envisioned.
- Modality and interaction: The team is exploring expansion beyond CLI to GUI automation, integration with IDEs, and support for multimodal tool use.
- Human–AI collaboration: Future variants may benchmark joint human–agent performance, extending from autonomous benchmarks to assistive AI scenarios.
7. Public Availability and Impact
TerminalBench 2.0 is publicly accessible (https://www.tbench.ai/) alongside all Docker images, task definitions, adapters, and experimental configurations. This comprehensive release enables full reproducibility, extension, and benchmarking, supporting ongoing research on robust, real-world agent design and the systematic analysis of agent capabilities (Merrill et al., 17 Jan 2026).