Papers
Topics
Authors
Recent
Search
2000 character limit reached

APEX-Agents Benchmark: Evaluating AI Agents

Updated 23 January 2026
  • APEX-Agents Benchmark is a systematic testbed designed to evaluate AI agents' ability to autonomously execute complex, long-horizon tasks in banking, consulting, and law.
  • It features 480 realistic tasks across 33 distinct worlds with unified API-based tool orchestration, detailed rubric criteria, and reproducible evaluation.
  • Performance metrics like Pass@1 and Pass@8 highlight current agent limitations and inform future enhancements in professional task automation.

The APEX-Agents Benchmark, formally known as the AI Productivity Index for Agents, is a systematic testbed for evaluating whether state-of-the-art AI agents can autonomously execute long-horizon, cross-application tasks emblematic of skilled knowledge work in investment banking, management consulting, and corporate law. Authored and vetted by domain professionals, the benchmark establishes a task suite derived from realistic, data-rich client engagements and supports rigorous, reproducible evaluation. The benchmark is accompanied by open infrastructure—Archipelago—and comprehensive datasets, serving both agent benchmarking and in-depth diagnosis of present agent shortcomings (Vidgen et al., 20 Jan 2026).

1. Motivation, Scope, and Target Domains

The central objective of APEX-Agents is to answer: Can today’s AI agents reliably execute the day-to-day work performed by professionals in banking, consulting, and law, encompassing multi-step, multi-application workflows? The benchmark’s 480 tasks were generated by investment-banking analysts (O*NET 13-2051), management consultants (O*NET 13-1111), and corporate lawyers (O*NET 23-1011). These experts role-played multi-day client engagements to produce realistic “worlds” and deliverables, ensuring tasks reflect both domain complexity and authentic information environments (Vidgen et al., 20 Jan 2026).

2. Task Suite and Environment Construction

APEX-Agents comprises 33 distinct “worlds” (10 banking, 11 consulting, 12 legal), simulating 5–10-day engagements and providing an average of 166 files per world. Each world exposes API access to a suite of applications (Calendar, Chat, Code, Documents, File System, Mail, PDF, Spreadsheets, Presentations). Two banking worlds supplement this with EDGAR SEC and Fixed Income Market tools for specialized financial analysis. Tasks (n=480) consist of 8–20 per world (mean 14.5), all single-turn prompts such as “Update the DCF model with Q4 2025 estimates, recompute IRR, then draft a two-slide investor summary.” Output types include console messages (422 tasks), new documents/spreadsheets/presentations (58), or direct file edits (18). Real-world completion times average 1.8 hours per task (verified mean: 1.37h) (Vidgen et al., 20 Jan 2026).

Each task is paired with 1–10 binary rubric criteria (mean 4.06), gold-standard outputs, and detailed metadata encompassing application workflow and expert-estimated completion times.

3. Evaluation Metrics and Statistical Protocols

APEX-Agents evaluation spans four primary metrics across eight independent agent runs per task:

  • Pass@1: Mean probability that a single run meets all rubric criteria, calculated as the task-uniform mean of per-task pass rates.
  • Pass@8: Probability of at least one successful outcome in eight runs on a given task.
  • Pass8: Consistency metric, denoting the fraction of tasks passed in all eight runs.
  • Mean-criteria: Proportion of rubric criteria satisfied, supplying partial credit.

For a task tt, let rt,i{0,1}r_{t,i} \in \{0,1\} be the pass indicator for run ii (i{1...8}i \in \{1...8\}), pt=(1/8)irt,ip_t = (1/8)\sum_i r_{t,i}, and T=480T=480:

Pass@1=1Tt=1Tpt Pass@8=1Tt[1i(1rt,i)] Pass8=1Ttirt,i\text{Pass@1} = \frac{1}{|T|}\sum_{t=1}^{|T|} p_t \ \text{Pass@8} = \frac{1}{|T|}\sum_t [1-\prod_i (1 - r_{t,i})] \ \text{Pass}^8 = \frac{1}{|T|}\sum_t \prod_i r_{t,i}

Confidence intervals are reported from 10,000-task bootstrapping. Pairwise comparisons of Pass@1 use McNemar’s exact test with Benjamini–Hochberg correction (Vidgen et al., 20 Jan 2026).

4. Agent Leaderboard and Performance Analysis

The inaugural APEX-Agents leaderboard encompassed eight agents, with the following results:

Model Pass@1 Pass@8 Pass8 Mean-criteria
Gemini 3 Flash (High) 24.0% 36.7% 13.4% 39.5%
GPT-5.2 (High) 23.0% 40.0% 11.0% 38.7%
Claude Opus 4.5 (High) 18.4% 34.0% 8.8% 34.8%
Gemini 3 Pro (High) 18.4% 37.3% 6.5% 34.1%
GPT-5 (High) 18.3% 31.0% 7.7% 32.9%
Grok (default) 15.2% 32.9% 4.7% 30.3%
GPT-OSS-120B (High) 4.7% 11.5% 1.2% 14.5%
Kimi K2 (Thinking) 4.0% 14.4% 0.3% 11.5%

Top-performing models are closed-source, "High"-thinking LMs; open-source models remain below 5% Pass@1 (Vidgen et al., 20 Jan 2026). Segment-level maxima: banking (27.3%, GPT-5/5.2), consulting (22.7%, GPT-5.2), and law (25.9%, Gemini 3 Flash). Zero-score rates (all criteria failed) are 40–62%; timeouts (>250 steps) range up to 30% for Kimi K2; partial-credit outcomes constitute 19–35%. Resource usage per run: Gemini 3 Flash (~5M tokens, ~540 tool calls), GPT-5.2 (~2M tokens, ~380 tool calls), typical output length of 200–500 tokens.

5. Archipelago: Execution and Grading Infrastructure

The APEX-Agents benchmark suite is tightly integrated with the Archipelago infrastructure (open-sourced at https://github.com/Mercor-Intelligence/archipelago), which operates as follows:

  • Environment Container: Sandboxed context hosting the complete “data room” with APIs for Calendar, Chat, Code Execution, Documents, File System, Mail, PDF, Spreadsheets, etc., via a unified Model Context Protocol.
  • Agents Runner: Supports multiple agent frameworks (here, ReAct), manages dynamic context (triggering summarization at 70% fullness, retaining the last 10 messages).
  • Grading System: Snapshot-based verifier to compare before/after states; a judge LLM (Gemini 3 Flash at low-thinking) grades per binary criterion with explanatory feedback.

All components communicate over HTTP APIs and are compatible with major orchestration frameworks (Kubernetes, Modal). The full dataset, agent code, grading protocol, and instructions are released under CC-BY at https://huggingface.co/datasets/mercor/apex-agents (Vidgen et al., 20 Jan 2026).

6. Significance, Limitations, and Future Directions

APEX-Agents presents the first systematic, fully open benchmark targeting the complete execution of professional, long-horizon, cross-application workflows by LLM-based agents, incorporating end-to-end file and tool management, domain-driven rubric evaluation, and unified infrastructure. By exposing realistic job complexity, it highlights marked gaps in state-of-the-art agent autonomy: leading models achieve only 24% Pass@1, with performance falling sharply on open-source agents. Failure modes are dominated by zero-score completions, timeouts, and workflows where only a minority of sub-tasks earn partial credit.

A salient implication is that reliable automation of high-skill knowledge work remains well outside current agent capabilities. The modular design of APEX-Agents, including granular rubrics and open infrastructure, supports both benchmarking and systematic diagnosis of agent and system-level deficiencies. Ongoing research will likely focus on boosting system-level consistency, robust error recovery, improved planning, and finer integration of domain-specific knowledge (Vidgen et al., 20 Jan 2026).

7. Relationship to Other Agent Benchmarks

Unlike paper-to-poster, slide-editing, or omics-analysis agent benchmarks, APEX-Agents is uniquely situated at the intersection of file/tool orchestration and genuine professional reasoning. While frameworks such as the APEX-Bench for poster editing (Shi et al., 8 Jan 2026) and single-cell omics agent benchmarks (Liu et al., 16 Aug 2025) demonstrate domain specificity and multi-metric evaluation, APEX-Agents subsumes a broader, real-world application landscape, operationalizing high-density professional task execution and enforcing stringent, multi-run pass criteria with gold-standard outputs. This suggests its findings and methodology are extensible to, yet distinct from, scientific and vertical-agent benchmarks in both design and impact.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to APEX-Agents Benchmark.