AgencyBench: Autonomous Agents Benchmark

Updated 20 January 2026

AgencyBench is a comprehensive framework that evaluates autonomous agents by probing multi-dimensional agentic intelligence in production-scale workflows.
It employs diverse, realistic scenarios—from software engineering to research workflows—using metrics like SuccessRate and Pass@k for rigorous evaluation.
The benchmark advances agentic intelligence by emphasizing autonomous task decomposition, self-correction, and scaffold synergy to inform future research.

AgencyBench is a comprehensive benchmark framework designed to rigorously evaluate the frontier capabilities of autonomous agents built with LLMs in real-world, long-horizon contexts. Unlike prior agentic evaluations that target narrow or single-skill settings, AgencyBench introduces multidimensional metrics—spanning task complexity, tool use, self-correction, efficiency, and scaffold synergy—to probe agentic intelligence over production-scale workflows. "Agency" in this context is conceived as the emergent capacity of an AI system to autonomously discover problems, formulate hypotheses, and execute solutions in diverse, tool-rich environments, integrating autonomous execution, reasoning, orchestration, and collaborative engagement. The benchmark is representative of critical industry tasks, including but not limited to software engineering, research workflows, and complex multi-stage synthesis (Li et al., 16 Jan 2026, Xiao et al., 22 Sep 2025, Liu et al., 2023).

1. Foundations of AgencyBench and Agency Evaluation

The conceptual foundation of AgencyBench is rooted in the operationalization of "agency" for AI systems. Agency is defined as the emergent capacity for autonomous agents to discover, hypothesize, and solve problems through self-directed engagement with environments and tools—marked not just by reasoning or output generation, but by productive, autonomous task execution and system-level orchestration (Xiao et al., 22 Sep 2025). This entails:

Autonomous task decomposition and multi-step planning.
Dynamic tool invocation and manipulation (files, APIs, web resources, environment controls).
Robust state management across extended interaction sequences and modalities.
Adaptive, multi-turn correction in response to feedback, failure, or system state.
Collaborative communication, where agent reasoning and interface protocols enable alignment with evolving user or system goals.

These elements distinguish agentic intelligence from static language modeling and simple decision automation.

2. Benchmark Composition and Scenario Design

AgencyBench comprises a diversified set of scenarios and tasks, each designed to stress specific agentic capabilities under realistic constraints. The 2026 version includes 32 scenarios encompassing 138 tasks across six domains:

Game development: complex multi-stage games requiring stateful interaction, e.g., Gomoku with incremental implementation, debugging, and stress testing.
Front-end development: comprehensive UI composition, visual layout fidelity, behavioral correctness.
Back-end development: server logic, API orchestration, database management.
Code generation: modular algorithm construction, refactoring, and system integration.
Research: advanced synthesis of web-sourced and structured data with iterative reporting.
MCP tool use: agents interacting with advanced structured data protocols (Li et al., 16 Jan 2026).

Each scenario demands multi-million-token context retention (mean ≈1M tokens), extensive tool use (∼90 invocations per scenario), and execution spanning several hours—mimicking production cognitive load rather than "toy" tasks with shallow horizons.

3. Evaluation Protocols and Metrics

AgencyBench employs a robust, automated evaluation suite incorporating:

Query/Deliverable/Rubric triad: Each task specifies a textual query, explicit deliverables (source code, APIs, UI artifacts, logs), and fine-grained rubrics (pixel tolerances, API contract fidelity, resource efficiency).
Rollout formalization: Multi-turn trajectories $\tau$ consist of alternating model reasoning, tool actions, and system/user feedback—capturing contextual dependencies over long horizons.
Success metrics:
- SuccessRate = $(1/N) \sum_{j=1}^N \mathbb{1}\{agentOutcome_j \ge threshold_j\}$ with threshold typically set at 60% rubric satisfaction.
- Pass@k: proportion of tasks completed within $k$ feedback rounds.
- Efficiency metrics: $E_{att} = S_{avg}/Att$ , $E_{tok} = S_{avg}/Tok$ —quantifying outcome per feedback round and per token.
- Resource profile: tokens used, tools invoked, wall-clock time per scenario.

A deterministic user simulation agent (e.g., Claude-4-Sonnet at zero temperature) replaces human-in-the-loop feedback, delivering actionable failures for self-correction, iterated up to $K=2$ rounds. A Docker-based sandbox executes deliverables for objective verification, while rule-based and LLM-based judges score outputs, achieving high inter-annotator agreement ( $\kappa = 0.93$ with humans) (Li et al., 16 Jan 2026).

4. Core Agentic Capabilities Assessed

AgencyBench's scenarios jointly assess six foundational agentic capabilities:

Capability Domain	Task Archetypes (Editor’s term)	Example Assessment
Game Development	Multi-stage, stateful logic	Fault-tolerant Gomoku pipeline
Front-End Dev	Visual/UI synthesis	15×15 grid rendering, layout conformance
Back-End Dev	Orchestration/database	API, server, and DB integration
Code Generation	Modular synthesis/refactoring	Algorithm, library, and test suite composition
Research Workflows	Iterative info synthesis	Web data aggregation, summarization
MCP Tool Use	Protocol-driven interaction	Structured ingest/emit tasks

Each domain incorporates tool-use, iterative reasoning, deliverable validation, resource management, and collaborative response. Long-horizon dependencies and failure recovery are integral, exposing weaknesses in strategic planning, error diagnosis, and self-correction.

5. Quantitative Analysis and Model Comparisons

Experiments demonstrate substantial gaps between closed-source and open-source LLM agents:

Closed-source agents (e.g., GPT-5.2, Claude-4.5-Opus) achieve higher Success Rates (48.4% closed vs 32.1% open-source).
GPT-5.2 leads with 56.5%, outperforming state-of-the-art open-source (GLM-4.6, 38.6%) (Li et al., 16 Jan 2026).
Efficiency varies by agentic style: "executors" (shell-command bias), "navigators" (high inspection), and memory-centric agents (Gemini-3-Pro, leveraging memory bank tools).
Native scaffold synergy is critical: Claude-4.5-Opus with Claude-Agent-SDK gains +20.5% compared to generic scaffolds, while open-source models display distinct performance peaks contingent on execution frameworks.
Self-correction metrics (Pass@1 vs Pass@2) report dramatic feedback-driven improvement (e.g., Kimi-K2-Thinking +300%, though from a low baseline), while brittle agents show negligible recovery.

Results from LIMI (Xiao et al., 22 Sep 2025) fundamentally challenge the data-scaling paradigm: LIMI achieves 73.5% avg. on AgencyBench with only 78 curated demonstrations, outperforming models fine-tuned on 10k samples by 53.7% relative improvement. This establishes the Agency Efficiency Principle—machine autonomy’s development hinges on strategic, high-quality agentic demonstrations rather than sheer data volume.

6. Implications for Agentic Intelligence and Future Research

AgencyBench serves as a rigorous yardstick for autonomous agent progress, enabling:

Controlled benchmarking across multidimensional, production-relevant settings.
Diagnosis of model–scaffold synergies and efficiency bottlenecks.
Empirical scrutiny of the tradeoff between demonstration quality and scaling laws.
Prototyping of automated self-correction loops, parametric vs. external memory integration, and feedback-driven adaptability.
Structured comparison across architectures, scaffolds, and training regimes.

Extending AgencyBench to embodied, multimodal, and decentralized meta-agent domains is a key frontier. The benchmark has catalyzed paradigm shifts: notably, global data volume is insufficient to guarantee agentic performance—strategic curation and trajectory logging are paramount. Models must be grounded in environment-driven, interactive supervision, moving beyond brute-force data synthesis.

AgencyBench is situated in a lineage of agentic benchmarks:

AgentBench (Liu et al., 2023): established the multidimensional evaluation of LLMs as agents via eight interactive, partially observable Markov decision processes (code, games, web).
HumanAgencyBench (Sturgeon et al., 10 Sep 2025): formalizes agency as user autonomy support across six dimensions (clarification, value manipulation avoidance, misinformation correction, decision deferral, learning encouragement, social boundary maintenance), with rigorous rubrics and LLM-vs-human inter-annotator statistics.
LIMI (Xiao et al., 22 Sep 2025): demonstrates agency’s emergence from efficient, curated demonstration, shaping training philosophies and supervision protocols at scale.

AgencyBench uniquely emphasizes scale, tool orchestration, environment fidelity, and automated rollout verification, foregrounding the next generation of autonomous, efficient, and scaffold-agnostic agentic intelligence.

Markdown Upgrade to Chat

References (4)

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts (2026)

LIMI: Less is More for Agency (2025)

AgentBench: Evaluating LLMs as Agents (2023)

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgencyBench.