TerminalBench-2: Meta-Harness Evaluation

Updated 31 March 2026

TerminalBench-2 is a benchmark suite that evaluates LLM agentic coding and tool-chaining across 89 dependency-heavy command-line tasks.
It requires models to handle multi-step processes with robust error-handling, persistent memory, and precise orchestration of CLI tools.
Empirical results demonstrate that Meta-Harness optimized systems consistently outperform manual harnesses in task pass rates and generalization.

TerminalBench-2 is a public benchmark suite designed to evaluate the agentic coding and tool-chaining capabilities of LLM-based systems in the context of extended, dependency-rich terminal tasks. It has emerged as a critical substrate for the empirical assessment of “meta-harness” frameworks, which aim to automate or externalize the harness logic surrounding LLMs. The benchmark is referenced as the primary agentic coding testbed in the evaluation of cutting-edge harness optimization systems, notably Meta-Harness, where it serves both as a performance leaderboard and as a test case for generalization, search-based harness improvement, and ablation studies (Lee et al., 30 Mar 2026).

1. Benchmark Scope and Structure

TerminalBench-2 consists of 89 long-horizon, dependency-heavy command-line tasks that require agents to autonomously interact with and manipulate Unix-like environments. Tasks are characterized by the necessity to chain multiple CLI tools, manage temporally-extended state, and resolve intermediate sub-dependencies under noisy or dynamic system configurations. Typical instances may include multi-stage build pipelines, file system manipulations, software installation or debugging, and invocation of heterogeneous shell utilities in concert.

The design places particular emphasis on cross-step dependencies, with tasks crafted such that partial or naive completions yield low utility, forcing the agent harness to develop persistent memory, robust error-handling logic, and tool utilization skills.

2. Harness and Agent Interface Model

Evaluation on TerminalBench-2 mandates that each agent be controlled by a “harness”: the code and surrounding logic that determines the policy for information storage, observation, context retrieval, input assembly, and action selection. The harness exposes a programmatic interface to the LLM, orchestrates the underlying shell or virtual environment, and records full trajectory traces of prompts, tool calls, outputs, and internal state transitions.

In the benchmarking context, only the harness code is varied (with the LLM “core” held fixed), isolating the contribution of higher-level control strategies. Leading harness baselines include Terminus 2 (the software agent from the original authors) and Terminus-KIRA, both providing strong hand-engineered orchestration paradigms (Lee et al., 30 Mar 2026).

3. Evaluation Protocol and Metrics

TerminalBench-2 adopts binary pass/fail evaluation per task, where a “pass” is recorded if the agent's trajectory successfully completes all required subtasks and produces the expected terminal output or system state. For each run, the harnessed LLM is presented with the same 89 tasks used in search and validation phases; overfitting is assessed by manual audit. The aggregate reporting metric is the pass rate, computed as the fraction of tasks solved.

Performance is comparable to public leaderboards, where more than 20 strong baselines are tracked, and results are consistently reported across major harness studies for both reproducibility and transfer validation.

Model	Harness	Pass Rate (%)	Leaderboard Rank
Claude Opus 4.6	Discovered (MH)	76.4	#2
Claude Opus 4.6	Terminus-KIRA	74.7	#3
Claude Haiku 4.5	Discovered (MH)	37.6	#1
Claude Haiku 4.5	Terminus-KIRA	35.5	#2

MH: Meta-Harness optimized harness (Lee et al., 30 Mar 2026)

4. Meta-Harness Optimization on TerminalBench-2

TerminalBench-2 is the domain in which Meta-Harness demonstrated agentic coding improvements over state-of-the-art hand-engineered baselines. The search protocol utilized an outer-loop, agentic code-proposing system (Anthropic’s Claude Code with Opus-4.6), which iteratively generated harness variants, validated interface correctness, and evaluated candidates over the full TerminalBench-2 suite.

Key empirical findings include:

On Opus 4.6, the discovered harness achieved 76.4% pass rate, exceeding the best baseline (Terminus-KIRA: 74.7%).
On Haiku 4.5, the discovered harness achieved 37.6% pass rate, again surpassing baselines (Terminus-KIRA: 35.5%).
During search, harness variants were selected from a Pareto frontier based on pass rate and auxiliary costs (such as context token usage).

Qualitative log analysis indicated that early prompt-template edits regressed task-level reliability; subsequent search iterations shifted towards an “environment bootstrap” approach, where the harness proactively compiled a snapshot of installed tools and files to eliminate wasted agent actions in the agent loop.

5. Significance for Automated Harness Engineering

TerminalBench-2 functions as a hard testbed for meta-harness research because the tasks’ temporal length, tool interdependencies, and state-rich execution amplify the impact of harness logic on end-to-end agent performance. Improvements on this benchmark provide strong evidence for the benefits of automated, data-driven harness search architectures over manual prompt or template engineering.

Empirical results show that richer access to harness provenance—full execution traces, code histories, and reward logs—enables a coding agent to form causal failure hypotheses, design targeted code modifications, and iteratively compose higher-quality agent policies. TerminalBench-2 is thus pivotal for both benchmarking and driving forward the state-of-the-art in automated harness synthesis (Lee et al., 30 Mar 2026).

6. Role in the Broader Meta-Harness Ecosystem

TerminalBench-2 anchors the evaluation axis for LLM agentic coding and is referenced as the critical source of evidence for the practical viability of meta-harness systems. Its tasks are utilized not only by Meta-Harness, but also as a reference benchmark for controller externalization schemes (e.g., natural-language agent harnesses (Pan et al., 26 Mar 2026)), agentic workflow optimization protocols (Nie et al., 7 Apr 2025), and harness designs leveraging reinforcement or meta-learning pipelines.

Because the challenge comprises real-world terminal tasks, improvements on TerminalBench-2 are not just theoretical: outperforming hand-engineered harnesses on this suite demonstrates meaningful advances in generalized, executable LLM orchestration. This positions TerminalBench-2 as the canonical standard for empirical progress on meta-harness approaches in agentic LLM research.

Markdown Report Issue Upgrade to Chat

References (3)

Meta-Harness: End-to-End Optimization of Model Harnesses (2026)

Natural-Language Agent Harnesses (2026)

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TerminalBench-2.