Papers
Topics
Authors
Recent
Search
2000 character limit reached

General AgentBench: Multi-Domain LLM Evaluation

Updated 4 July 2026
  • General AgentBench is a benchmark that evaluates general-purpose LLM agents by unifying diverse domains like Search, Coding, Tool-use, and Reason into one persistent interface.
  • It employs the Model Context Protocol to integrate heterogeneous tools and simulate realistic deployment scenarios requiring dynamic task inference and precise tool selection.
  • Empirical results reveal significant performance drops from specialized to general settings, highlighting challenges in long-context management and reliable solution verification.

General AgentBench is a benchmark for evaluating general-purpose LLM agents in a unified, tool-rich environment that is intended to be closer to real deployment than earlier domain-specific agent benchmarks. Rather than testing an agent inside a pre-labeled sandbox such as coding-only, browsing-only, or tool-use-only settings, it presents a single persistent interface over heterogeneous tools and requires the agent to infer task type, select tools, and solve the task end to end. In that sense, it extends the broader AgentBench agenda of evaluating LLMs as interactive agents rather than static predictors, while shifting the emphasis from multi-environment breadth to unified multi-domain realism (Li et al., 22 Feb 2026, Liu et al., 2023).

1. Historical position and motivating problem

The original AgentBench framed LLM evaluation as agent evaluation, formalizing interactive tasks as multi-turn environments and assembling eight distinct environments across code-grounded, game-grounded, and web-grounded settings (Liu et al., 2023). A later review characterized AgentBench as an “evolving, multi-dimensional benchmark” designed to test reasoning and decision-making in a multi-turn, open-ended generation setting, and placed it alongside WebArena and ToolLLM as part of the shift away from static NLP-style evaluation (Barua, 2024).

General AgentBench targets a different remaining gap. Its central claim is that most existing agent benchmarks are still domain-specific: software engineering agents are evaluated inside coding-only environments with coding-only tools; web agents browse websites with browsing-specific interfaces; tool-use agents call a curated API set for that task alone. In deployment, however, a request does not arrive with a domain label. The agent must infer intent, choose from a broad tool pool, and operate under one persistent interface. The benchmark is therefore designed to measure the capability gap between specialized evaluation and this more general setting, and argues that domain-specific evaluation can substantially overestimate robustness because the benchmark itself gives away task type and narrows tool choice (Li et al., 22 Feb 2026).

This motivation also reframes what “general” means in agent evaluation. In General AgentBench, generality is not only breadth across task families; it is the requirement that the same agent operate in a single multi-domain environment where irrelevant but executable tools remain available. This suggests that agent competence depends not only on solving a task once the domain is known, but also on domain inference and tool selection under ambiguity (Li et al., 22 Feb 2026).

2. Benchmark construction and unified environment

General AgentBench unifies four domains—Search, Coding, Tool-use, and Reason—inside one interaction framework. It samples from seven existing datasets, yielding 496 sampled tasks total (Li et al., 22 Feb 2026).

Domain Source datasets Sampled tasks
Search BrowseComp, WebVoyager 189
Coding SWE-Bench Verified, Terminal-Bench 130
Reason MathHay 75
Tool-use Tau2-Bench, MCP-Bench 102

The dataset-level breakdown is explicit: BrowseComp contributes 124 sampled tasks and WebVoyager 65 for search; SWE-Bench Verified contributes 50 and Terminal-Bench 80 for coding; MathHay contributes 75 for long-context reasoning; and Tau2-Bench plus MCP-Bench contribute 50 and 52 respectively for tool use (Li et al., 22 Feb 2026).

The benchmark uses the Model Context Protocol (MCP) as its unifying substrate. Each original benchmark environment is wrapped as an MCP server, and a central Host exposes all tools through a global registry. The agent therefore sees a single tool space and a single interface, while task-specific environment details remain hidden behind the Host. Importantly, all servers are live for every task, even when irrelevant, so a search task is presented alongside coding tools, service APIs, scientific calculators, and Hugging Face tools. This makes wrong but executable tool calls possible by design (Li et al., 22 Feb 2026).

The context regime is correspondingly large and heterogeneous. Tool descriptions alone can occupy tens of thousands of tokens; the appendix reports a full Host toolset of 301 tools across 35 servers, with the unified toolset alone approaching 64K tokens, while interaction history plus user query often pushes contexts toward 128K tokens. The benchmark is therefore also a long-context, multi-turn agent benchmark, but not in the static sense of conventional long-context QA: the context is an evolving mixture of tool documentation, user request, prior reasoning, prior actions, and environment feedback (Li et al., 22 Feb 2026).

Several engineering details are relevant to that design. The Host follows an MCP Host-Client-Server architecture with tool schemas in OpenAI function-calling format. To control context size, the implementation includes tool-description compression: --compress-tools yields an 18.6% token reduction, and the --minimal-tools textual format used for self-choice yields 90.1% token reduction, saving about 70K tokens. For coding tasks, Docker-backed environments run in bridge mode with persistent host MCP servers and isolated task containers; Tau2-Bench is adapted through a simulated-user interface that preserves multi-turn conversational state (Li et al., 22 Feb 2026).

3. Evaluation protocol and agent setting

General AgentBench delegates scoring to the native scoring rules of the original benchmarks rather than reimplementing them. The agent interacts multi-turn with the environment until producing a final answer, which is then sent to the original evaluator. The appendix states that “All evaluators produce binary rewards (0/1) except MCPBench (continuous 0-1).” Concretely, SWE-Bench Verified uses automated test pass, Terminal-Bench checks whether the expected final terminal state is reached, BrowseComp compares the final answer against expert references, WebVoyager uses its original evaluation scripts, Tau2-Bench computes reward from environment state matching, action-sequence validation, and communication checks, and MCP-Bench yields a continuous score in [0,1][0,1] (Li et al., 22 Feb 2026).

The benchmark evaluates agents in two configurations. The default is the general-agent setting, in which each model acts as a single general agent over the unified interface. The comparison condition is a Baseline (BB) specialized-agent setting, corresponding to the original domain-specific configuration, contrasted with the General (GG) setting. Inference uses temperature 0.7, and the authors ensure that each model’s native context length exceeds the benchmark’s required maximum context length. The explicit test-time scaling budgets are also part of the protocol: for parallel scaling, each query is sampled at most 4 times; for sequential scaling, context is scaled up to 196K tokens (Li et al., 22 Feb 2026).

The “general LLM agents” in the benchmark are not bespoke agent frameworks. They are frontier LLMs placed under a common universal agent policy and shared tool interface, then evaluated end to end. The ten models are GPT-OSS-120B, Qwen3-235B-A22B, Qwen3-Next, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5-Flash, Gemini 2.5-Pro, Claude Haiku 4.5, Claude Sonnet 4.5, and GPT-5. The system prompt is a universal tool-using agent prompt that instructs careful tool selection, avoidance of redundant calls, and combined tool use plus reasoning; Tau2 tasks additionally append benchmark-specific policy documents (Li et al., 22 Feb 2026).

The release places strong emphasis on reproducibility. Code is public, and the appendix provides implementation details for tool wrapping and evaluation, API pricing tables, and estimated evaluation costs. The reported totals are around \$7,768** for the general-setting table, **\$24,392 for the sequential-scaling experiments, and \$29,576 for the parallel-scaling experiments summarized in the paper, which partly explains the cap of four parallel samples (Li et al., 22 Feb 2026).

4. Empirical performance and the specialized-to-general gap

The central empirical result is that performance drops materially when models move from specialized evaluation to the unified general-agent setting. In the general setting alone, the best average score is Claude Sonnet 4.5: 48.0, followed by GPT-5: 45.9 and Claude Haiku 4.5: 42.0. Among open models, DeepSeek-V3.2 is best at 39.0; the weakest overall model in the table is GPT-OSS-120B at 25.4 (Li et al., 22 Feb 2026).

Per-domain leaders differ. GPT-5 is best on BrowseComp (27.4), WebVoyager (61.5), and MathHay (64.0); Claude Sonnet 4.5 is best on Terminal-Bench (45.0) and MCP-Bench (72.9); Claude Haiku 4.5 is best on SWE-Bench (56.0); and DeepSeek-V3.2 is best on Tau2-Bench (54.0) (Li et al., 22 Feb 2026).

The more consequential comparison is BB versus GG. Most models degrade by roughly 10%–30% on average. Reported examples include GPT-OSS-120B -28.7%, Qwen3-235B -25.5%, Gemini 2.5-Flash -31.2%, Gemini 2.5-Pro -27.2%, and GPT-5 -22.7%. The starkest single-domain collapse is Gemini 2.5-Pro in Reason, dropping 60.8% from 61.3 to 24.0. By contrast, Claude Sonnet 4.5 is unusually robust, with just -0.2% average degradation overall (Li et al., 22 Feb 2026).

A common misconception is that the unified benchmark only makes tasks harder by adding irrelevant tools. The paper reports a more specific effect. Some models improve in Search under the general setting, apparently because broader tool availability permits effective cross-domain API use. In a trace analysis of 189 Claude Sonnet 4.5 search tasks, 26% (50/189) used specialized non-search tools. The most frequent were Google Maps APIs (78 calls), paper search APIs across arXiv/PubMed/Google Scholar (60), and Hugging Face model APIs (36). In a case study on identifying the most recent Hugging Face model suitable for text classification, the plain search baseline took 6 turns and produced a shallow answer, whereas the general agent used Hugging_Face__search-models and Hugging_Face__get-model-info, solved the task in 3 turns, and produced richer evidence (Li et al., 22 Feb 2026).

This supports a more precise interpretation: General AgentBench is not only a harder benchmark. It is also a tool-rich setting in which stronger agents can sometimes exploit cross-domain resources that domain-specific benchmarks would hide (Li et al., 22 Feb 2026).

5. Test-time scaling: sequential scaling, parallel scaling, and their limits

A major contribution of General AgentBench is its use as a testbed for test-time scaling in agents. The paper formalizes two forms. Parallel scaling independently samples KK trajectories per query. Sequential scaling extends the interaction horizon by injecting another round of feedback when the agent attempts to terminate, encouraging further reflection and increasing total context length (Li et al., 22 Feb 2026).

For parallel scaling, the paper distinguishes the oracle upper bound pass@KK from practical selection quality. It defines two self-choice strategies: point-wise choice, which evaluates each trajectory independently and emits a binary judgment, and pair-wise choice, which compares trajectories two at a time in a bubble-sort-style tournament until one final trajectory remains. The key question is not only whether extra samples contain a correct solution, but whether the model can identify that solution among its own candidates (Li et al., 22 Feb 2026).

The findings on sequential scaling are largely negative. The paper identifies two regimes. In stagnant fluctuation, performance oscillates in a narrow range as context grows, especially in reasoning tasks. In saturation and degradation, common in coding, extra turns help at first but then performance falls and does not recover. Instance-level analysis shows that many tasks remain unsolved despite more turns, or flip between correct and incorrect states across steps. This motivates the first named limitation: the context ceiling, defined as the maximum effective context length under sequential scaling beyond which additional interaction history yields diminishing, zero, or negative returns. Concrete turning points are reported at around 112K tokens for Qwen3-235B and 96K tokens for Gemini 2.5-Flash on search tasks (Li et al., 22 Feb 2026).

The findings on parallel scaling are more nuanced. Oracle pass@KK rises monotonically as KK increases. Moving from BB0 to BB1 yields roughly 50% average improvement, and DeepSeek-V3.2 shows the largest gains, nearly in coding and reasoning. Yet practical gains are much smaller, because selection remains difficult. This leads to the second named limitation: the verification gap, defined as the gap between a model’s ability to generate a correct trajectory and its ability to verify/select that trajectory from among candidates. Across all four domains, self-choice accuracy trails the pass@BB2 upper bound for both point-wise and pair-wise selection, and in some cases worsens as BB3 increases (Li et al., 22 Feb 2026).

The paper also tests whether an external stronger evaluator can close this gap by replacing self-judgment with GPT-5 as verifier. Reportedly, GPT-5 generally underperforms the model’s own self-judgment and can mislabel correct trajectories even at BB4. The proposed explanation is solution familiarity: a model may interpret its own generations better than an external verifier can interpret unfamiliar traces. This suggests that verification is not merely a weak-model problem, but a separate bottleneck in agent scaling (Li et al., 22 Feb 2026).

6. Interpretation, limitations, and relation to adjacent benchmark research

General AgentBench’s broader significance lies in what it implies about the state of “general-purpose” LLM agents. The benchmark suggests that current agents remain far from robust generalists once domain priors are removed and the tool space is widened. The paper identifies three especially underdeveloped capabilities: domain inference and tool selection under ambiguity, context management over long interactive traces, and reliable verification/ranking of sampled solutions. It further reports weak transfer from static long-context benchmarks: LongBench, HELMET, and MRCR correlate poorly with General AgentBench overall, with only MRCR showing moderate correlation with the reasoning domain. This suggests that agentic competence depends on more than long-context recall, requiring dynamic planning, tool selection, state tracking, and stable self-conditioning across turns (Li et al., 22 Feb 2026).

The paper also includes a suggestive architectural analysis. Comparing Qwen3-235B with Qwen3-Next, it attributes weaker sequential scaling in Qwen3-Next partly to its attention mechanism. The reported analysis suggests that full-attention models maintain larger mean attention distance and stronger head/layer specialization, whereas linear or hybrid attention exhibits weaker functional differentiation and reduced long-range context use. The authors present this as suggestive rather than fully causal, but it reinforces the broader theme that nominal long-context support does not guarantee robust long-horizon agent behavior (Li et al., 22 Feb 2026).

Subsequent work places these findings in a wider systems context. Bayesian-Agent argues that performance in AgentBench-like settings is determined not only by the frozen base model but by the surrounding agent harness—prompt, retrieved context, tools, memory, SOPs, runtime constraints, and verifier feedback—formalized as BB5. On Lifelong AgentBench, it reports that posterior-guided incremental repair can move GenericAgent + deepseek-v4-flash from 90% to 100% accuracy, while full online adaptation can regress to 85%, underscoring that harness optimization and uncertainty management can matter as much as model capability (Wu et al., 6 Jun 2026). Complementarily, AgentProcessBench argues that outcome-only agent benchmarks miss step-level process quality, and introduces a ternary human-labeled process benchmark for tool-using agents with 1,000 trajectories and 8,509 annotated steps, emphasizing that long-horizon reliability also depends on whether intermediate actions are effective, neutral, or harmful (Fan et al., 15 Mar 2026).

Taken together, these developments suggest that General AgentBench marks a transition in agent evaluation. It does not simply ask whether a model can solve a benchmark inside a pre-labeled environment; it asks whether a single agent can behave coherently in a unified, noisy, multi-domain tool ecosystem. The answer, as measured in the benchmark, is only partial. The benchmark therefore functions both as an evaluation suite and as a diagnosis of current bottlenecks in general-agent behavior (Li et al., 22 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to General AgentBench.