APEX-Agents: LLM Agent Benchmark Suite

Updated 6 March 2026

APEX-Agents is a benchmark suite evaluating LLM agents’ performance in executing complex, multi-tool professional workflows mimicking real-world tasks.
It features 33 distinct ‘worlds’ with 480 tasks drawn from investment banking, consulting, and law, simulating detailed 5–10 day projects.
The system employs rigorous evaluation metrics, automated grading, and a containerized infrastructure to ensure reproducible and scalable agent assessments.

APEX-Agents is a benchmark and infrastructure suite designed for rigorous evaluation of LLM agents executing complex, long-horizon, cross-application workflows derived from the daily work of professional investment banking analysts, management consultants, and corporate lawyers. It provides a realistic, multi-application environment (“worlds”) grounded in real and synthetic enterprise data, with tasks requiring the orchestration of tools such as spreadsheets, presentation editors, code execution, and file system management. APEX-Agents enables systematic assessment of agent capabilities, consistency, and tool-use within knowledge-intensive professional service contexts (Vidgen et al., 20 Jan 2026).

1. Motivation, Scope, and Benchmark Definition

APEX-Agents (AI Productivity Index for Agents) was developed to address deficiencies in existing agentic benchmark suites, notably a pronounced sim-to-real gap. Prior agent benchmarks were observed to be too narrow or lacking authentic complexity, failing to capture the multi-day, multi-tool, and high-stakes workflows characteristic of real-world analyst, consultant, and lawyer operations. Tasks in APEX-Agents are directly sourced from domain experts in investment banking (O*NET 13-2051), management consulting (O*NET 13-1111), and corporate law (O*NET 23-1011). Each task simulates a 5–10 day project, with data-rich environments requiring navigation of roughly 166 document and data files per world. Tasks, on average, require 1–2 hours for completion by experienced human professionals, ensuring realistic scope and difficulty (Vidgen et al., 20 Jan 2026).

A distinguishing feature is its commitment to cross-application workflows; agents must interleave commands and data across spreadsheets, slide decks, documents, email, PDF research, and code interpreters—all in a sandboxed, controlled environment that prohibits web search. This design intends to reflect the true complexity of professional knowledge work.

2. Dataset and Environment Construction

Each of the 33 “worlds” in APEX-Agents constitutes a project scenario—10 banking, 11 consulting, and 12 legal—populated with an average of 166 files, including company financials, SEC filings, contracts, and slide templates. The set of available applications encompasses nine core productivity tools (calendar, chat, code execution, documents, file system, mail, PDFs, spreadsheets, presentations), with some banking worlds adding domain-specific data analysis endpoints (e.g., Edgar SEC, fixed income data with 187 endpoints) (Vidgen et al., 20 Jan 2026).

The dataset includes 480 total tasks (evenly split by job type), with granularity reflecting real professional expectations:

Attribute	Value
Number of tasks	480
Number of worlds	33 (8–20 tasks per world)
Output types	Console (422), create/edit files (58)
Avg. criteria per task	4.06 (range 1–10)
Avg. estimated human time/task	1.81 hours

Each task specifies a rubric of must-have criteria and artifact(s) for evaluation (e.g., specific output within a spreadsheet or a slide deck).

3. Evaluation Protocol and Metrics

APEX-Agents implements a rigorous, multi-level evaluation protocol. For each task $t$ , human experts define a set of binary criteria $C_t = \{c_{t,1}, ..., c_{t,k}\}$ and a grading target artifact. Ground-truth outputs are created and checked for each task.

Outcomes are measured via:

Pass@1 (single-run success): $\text{Pass@1} = \frac{1}{T}\sum_{t=1}^T s_{t,1}$ , where $s_{t,1}$ is 1 if the first agent run meets all criteria on task $t$ .
Pass@k (at least one success in $k$ runs): $\text{Pass@}k = \frac{1}{T}\sum_{t=1}^T \mathbf{1}(\max_{1\leq r \leq k} s_{t,r} = 1)$ .
Passˆk (all $k$ runs succeed): $\widehat{\text{Pass}^{\,k}} = \frac{1}{T} \sum_{t=1}^T \mathbf{1}(\sum_{r=1}^k s_{t,r} = k)$ .
Mean Criteria Score: $\frac{1}{T}\sum_{t=1}^T \frac{1}{|C_t|}\sum_{c \in C_t} \mathbf{1}(\text{criterion }c\text{ met})$ .

Assessment is performed by a dedicated automated judge model (Gemini 3 Flash “thinking=low”), which was calibrated to 98.5% criterion-level accuracy (human-labeled baseline of 747 judgments) (Vidgen et al., 20 Jan 2026).

4. Agent Leaderboard and Empirical Findings

APEX-Agents evaluated eight agent LMs, using the above metrics across the test suite. The leaderboard (Pass@1) for top commercial and open-source agents is:

Model (config)	Pass@1 (mean, %)
Gemini 3 Flash (high)	24.0 ± 2.7
GPT-5.2 (high)	23.0 ± 2.6
Claude Opus 4.5 (high)	18.4 ± 2.9
Gemini 3 Pro (high)	18.4 ± 2.8
GPT-5	18.3
Grok	15.2
GPT-OSS-120B	4.7
Kimi K2	4.0

On Pass@8 (at least one success over eight runs), closed agents generally scored 36–40%, highlighting inconsistency (Passˆ8 drops ~10–12pp from Pass@1). The mean criteria score (partial credit) for top agents is ~39.5%. Open-source models, in contrast, rarely exceeded 8% Pass@1 and show especially poor performance on banking and consulting domains.

Job-segmented performance reveals that GPT-5.2 leads on management consulting and investment banking, while Gemini 3 Flash leads on corporate law. Partial progress (meeting some criteria) was common (~30–35%). Timeouts (>250 agent steps) were most prevalent in open-source agents (Vidgen et al., 20 Jan 2026).

5. Failure Modes and Analysis

Analysis of trajectories identified several key failure and bottleneck patterns:

High zero-score rates: Even top models failed all criteria in ≥40% of single-run cases.
Inconsistency: Substantial variance between runs led to a large Pass@8 – Pass@1 gap.
Partial credit prevalence: 30–35% of runs delivered only a subset of required work.
Tool usage patterns: Successful runs used marginally fewer environment actions but were slightly more likely to invoke code execution tools. Kimi K2 (open-source) showed 29.8% timeout frequency.
Job segmentation: Open-source models performed relatively better on legal tasks (7.8–8.0% Pass@1) compared to banking or consulting, while top commercial models were more robust but still exhibited domain-dependent variability.

6. Archipelago: Infrastructure for Sandboxed Evaluation

APEX-Agents is supported by the Archipelago infrastructure, a publicly released, containerized system for agent evaluation (Vidgen et al., 20 Jan 2026). Its components include:

Environment Container: Exposes all relevant “world” APIs (file, calendar, code, chat, etc.) in a unified protocol.
Agent Runner: Manages LM reasoning and action cycles, context summarization, and stepwise logging using a ReAct toolbelt approach.
Grading System: Orchestrates state snapshots, invokes the judge model, and aggregates results.

Archipelago ensures reproducibility, scalability for parallel experiment execution, and fine-grained trajectory capture. All benchmark assets—tasks, worlds, rubrics, gold outputs, and source code—are released under CC-BY on HuggingFace and GitHub.

7. Conclusions and Prospective Research Directions

APEX-Agents demonstrates that state-of-the-art commercial agentic LMs can attain non-trivial productivity (Pass@1 ≈ 24%) on complex, end-to-end professional tasks, but still fall well short of full reliability, with ≥75% failure rates in single runs. A plausible implication is that while AI agent deployment in hybrid professional settings is feasible for partial task support, robust full automation remains challenging. Large consistency gaps and failures to meet all criteria suggest that further advancements in loop planning, context management, and self-refinement are needed (Vidgen et al., 20 Jan 2026).

Proposed future research directions include:

Scaling the benchmark to cover longer horizons (multi-week projects)
Increasing environmental complexity and application diversity
Designing more challenging high-value tasks targeting deep domain knowledge and multi-step reasoning
Developing better agent orchestration strategies to improve result consistency
Promoting open-source community contributions for new worlds, tasks, and evaluation strategies (including potential human-in-the-loop grading).

APEX-Agents, through its data realism, rigorous evaluation, and open infrastructure, establishes a new standard for the measurement of agentic LLM performance in enterprise-mimetic knowledge work.

Markdown Report Issue Upgrade to Chat

References (1)

APEX-Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to APEX-Agents.