AgencyBench: Autonomous Agents Benchmark
- AgencyBench is a comprehensive framework that evaluates autonomous agents by probing multi-dimensional agentic intelligence in production-scale workflows.
- It employs diverse, realistic scenarios—from software engineering to research workflows—using metrics like SuccessRate and Pass@k for rigorous evaluation.
- The benchmark advances agentic intelligence by emphasizing autonomous task decomposition, self-correction, and scaffold synergy to inform future research.
AgencyBench is a comprehensive benchmark framework designed to rigorously evaluate the frontier capabilities of autonomous agents built with LLMs in real-world, long-horizon contexts. Unlike prior agentic evaluations that target narrow or single-skill settings, AgencyBench introduces multidimensional metrics—spanning task complexity, tool use, self-correction, efficiency, and scaffold synergy—to probe agentic intelligence over production-scale workflows. "Agency" in this context is conceived as the emergent capacity of an AI system to autonomously discover problems, formulate hypotheses, and execute solutions in diverse, tool-rich environments, integrating autonomous execution, reasoning, orchestration, and collaborative engagement. The benchmark is representative of critical industry tasks, including but not limited to software engineering, research workflows, and complex multi-stage synthesis (Li et al., 16 Jan 2026, Xiao et al., 22 Sep 2025, Liu et al., 2023).
1. Foundations of AgencyBench and Agency Evaluation
The conceptual foundation of AgencyBench is rooted in the operationalization of "agency" for AI systems. Agency is defined as the emergent capacity for autonomous agents to discover, hypothesize, and solve problems through self-directed engagement with environments and tools—marked not just by reasoning or output generation, but by productive, autonomous task execution and system-level orchestration (Xiao et al., 22 Sep 2025). This entails:
- Autonomous task decomposition and multi-step planning.
- Dynamic tool invocation and manipulation (files, APIs, web resources, environment controls).
- Robust state management across extended interaction sequences and modalities.
- Adaptive, multi-turn correction in response to feedback, failure, or system state.
- Collaborative communication, where agent reasoning and interface protocols enable alignment with evolving user or system goals.
These elements distinguish agentic intelligence from static language modeling and simple decision automation.
2. Benchmark Composition and Scenario Design
AgencyBench comprises a diversified set of scenarios and tasks, each designed to stress specific agentic capabilities under realistic constraints. The 2026 version includes 32 scenarios encompassing 138 tasks across six domains:
- Game development: complex multi-stage games requiring stateful interaction, e.g., Gomoku with incremental implementation, debugging, and stress testing.
- Front-end development: comprehensive UI composition, visual layout fidelity, behavioral correctness.
- Back-end development: server logic, API orchestration, database management.
- Code generation: modular algorithm construction, refactoring, and system integration.
- Research: advanced synthesis of web-sourced and structured data with iterative reporting.
- MCP tool use: agents interacting with advanced structured data protocols (Li et al., 16 Jan 2026).
Each scenario demands multi-million-token context retention (mean ≈1M tokens), extensive tool use (∼90 invocations per scenario), and execution spanning several hours—mimicking production cognitive load rather than "toy" tasks with shallow horizons.
3. Evaluation Protocols and Metrics
AgencyBench employs a robust, automated evaluation suite incorporating:
- Query/Deliverable/Rubric triad: Each task specifies a textual query, explicit deliverables (source code, APIs, UI artifacts, logs), and fine-grained rubrics (pixel tolerances, API contract fidelity, resource efficiency).
- Rollout formalization: Multi-turn trajectories consist of alternating model reasoning, tool actions, and system/user feedback—capturing contextual dependencies over long horizons.
- Success metrics:
- SuccessRate = with threshold typically set at 60% rubric satisfaction.
- Pass@k: proportion of tasks completed within feedback rounds.
- Efficiency metrics: , —quantifying outcome per feedback round and per token.
- Resource profile: tokens used, tools invoked, wall-clock time per scenario.
A deterministic user simulation agent (e.g., Claude-4-Sonnet at zero temperature) replaces human-in-the-loop feedback, delivering actionable failures for self-correction, iterated up to rounds. A Docker-based sandbox executes deliverables for objective verification, while rule-based and LLM-based judges score outputs, achieving high inter-annotator agreement ( with humans) (Li et al., 16 Jan 2026).
4. Core Agentic Capabilities Assessed
AgencyBench's scenarios jointly assess six foundational agentic capabilities:
| Capability Domain | Task Archetypes (Editor’s term) | Example Assessment |
|---|---|---|
| Game Development | Multi-stage, stateful logic | Fault-tolerant Gomoku pipeline |
| Front-End Dev | Visual/UI synthesis | 15×15 grid rendering, layout conformance |
| Back-End Dev | Orchestration/database | API, server, and DB integration |
| Code Generation | Modular synthesis/refactoring | Algorithm, library, and test suite composition |
| Research Workflows | Iterative info synthesis | Web data aggregation, summarization |
| MCP Tool Use | Protocol-driven interaction | Structured ingest/emit tasks |
Each domain incorporates tool-use, iterative reasoning, deliverable validation, resource management, and collaborative response. Long-horizon dependencies and failure recovery are integral, exposing weaknesses in strategic planning, error diagnosis, and self-correction.
5. Quantitative Analysis and Model Comparisons
Experiments demonstrate substantial gaps between closed-source and open-source LLM agents:
- Closed-source agents (e.g., GPT-5.2, Claude-4.5-Opus) achieve higher Success Rates (48.4% closed vs 32.1% open-source).
- GPT-5.2 leads with 56.5%, outperforming state-of-the-art open-source (GLM-4.6, 38.6%) (Li et al., 16 Jan 2026).
- Efficiency varies by agentic style: "executors" (shell-command bias), "navigators" (high inspection), and memory-centric agents (Gemini-3-Pro, leveraging memory bank tools).
- Native scaffold synergy is critical: Claude-4.5-Opus with Claude-Agent-SDK gains +20.5% compared to generic scaffolds, while open-source models display distinct performance peaks contingent on execution frameworks.
- Self-correction metrics (Pass@1 vs Pass@2) report dramatic feedback-driven improvement (e.g., Kimi-K2-Thinking +300%, though from a low baseline), while brittle agents show negligible recovery.
Results from LIMI (Xiao et al., 22 Sep 2025) fundamentally challenge the data-scaling paradigm: LIMI achieves 73.5% avg. on AgencyBench with only 78 curated demonstrations, outperforming models fine-tuned on 10k samples by 53.7% relative improvement. This establishes the Agency Efficiency Principle—machine autonomy’s development hinges on strategic, high-quality agentic demonstrations rather than sheer data volume.
6. Implications for Agentic Intelligence and Future Research
AgencyBench serves as a rigorous yardstick for autonomous agent progress, enabling:
- Controlled benchmarking across multidimensional, production-relevant settings.
- Diagnosis of model–scaffold synergies and efficiency bottlenecks.
- Empirical scrutiny of the tradeoff between demonstration quality and scaling laws.
- Prototyping of automated self-correction loops, parametric vs. external memory integration, and feedback-driven adaptability.
- Structured comparison across architectures, scaffolds, and training regimes.
Extending AgencyBench to embodied, multimodal, and decentralized meta-agent domains is a key frontier. The benchmark has catalyzed paradigm shifts: notably, global data volume is insufficient to guarantee agentic performance—strategic curation and trajectory logging are paramount. Models must be grounded in environment-driven, interactive supervision, moving beyond brute-force data synthesis.
7. Related Benchmarks and Theoretical Context
AgencyBench is situated in a lineage of agentic benchmarks:
- AgentBench (Liu et al., 2023): established the multidimensional evaluation of LLMs as agents via eight interactive, partially observable Markov decision processes (code, games, web).
- HumanAgencyBench (Sturgeon et al., 10 Sep 2025): formalizes agency as user autonomy support across six dimensions (clarification, value manipulation avoidance, misinformation correction, decision deferral, learning encouragement, social boundary maintenance), with rigorous rubrics and LLM-vs-human inter-annotator statistics.
- LIMI (Xiao et al., 22 Sep 2025): demonstrates agency’s emergence from efficient, curated demonstration, shaping training philosophies and supervision protocols at scale.
AgencyBench uniquely emphasizes scale, tool orchestration, environment fidelity, and automated rollout verification, foregrounding the next generation of autonomous, efficient, and scaffold-agnostic agentic intelligence.