Agents' Last Exam: AI Workflow Benchmark
- Agents' Last Exam (ALE) is a benchmark of real-world, multi-step computer-use tasks designed to test AI agents on sustained, economically valuable digital workflows.
- It employs a rigorous five-gate task pipeline and formal scoring metrics, including full pass rate (FPR) and mean score (MS), to objectively assess performance.
- The benchmark highlights challenges in domain expertise, long-horizon planning, and tool-chain integration, with the hardest tier achieving an average FPR of about 2.6%.
Agents' Last Exam (ALE) is a benchmark for evaluating AI agents on long-horizon, end-to-end digital workflows drawn from professional practice and selected for verifiable outcomes (Sun et al., 3 Jun 2026). It was developed in collaboration with over 250 industry experts, is anchored in O*NET / SOC 2018, and organizes more than 1,000 task instances into 55 subdomains within 13 industry clusters (Sun et al., 3 Jun 2026). ALE is intended to measure whether a Generalist Computer-Use Agent can carry out sustained, economically meaningful work rather than merely answer short questions, while remaining a living benchmark through continuous task-pool growth and rotating public/private splits (Sun et al., 3 Jun 2026).
1. Definition, scope, and design objective
ALE is defined as a benchmark of real-world, multi-step computer-use tasks drawn from professional practice (Sun et al., 3 Jun 2026). Its primary aim is explicitly two-fold. First, it serves as a competence threshold: an agent that “passes” ALE demonstrates the ability to carry out sustained, economically meaningful workflows. Second, it serves as a difficulty frontier: by embedding authentic software tools, GUI and CLI interactions, and long-horizon planning, it is positioned at the boundary of what current systems can accomplish (Sun et al., 3 Jun 2026).
The benchmark evaluates a Generalist Computer-Use Agent (GCUA), defined as an AI system integrating five functional layers: Brain (LLM reasoning), Eyes (GUI perception), Body (control flow), Hands (tool invocation), and Feet (code execution / file work) (Sun et al., 3 Jun 2026). The target workflows are professionally sourced and end in deliverables such as CAD files, workbooks, rendered scenes, code repositories, and reports, with outcomes assessed by deterministic or narrowly scoped rubric-based verification (Sun et al., 3 Jun 2026).
This design choice places ALE in a different evaluative regime from short-form QA or stylized tool-use tests. A plausible implication is that ALE treats workflow completion, rather than isolated inference quality, as the core unit of capability.
2. Taxonomy, coverage, and task construction
ALE’s coverage is grounded in the U.S. federal occupational taxonomy, specifically O*NET and SOC 2018 (Sun et al., 3 Jun 2026). The construction process screened all 1,016 O*NET occupations for in-scope digitally mediated workflows, distilled 117 candidate SOC base codes, and then, after expert grouping, LLM-assisted clustering, and a small “frontier” supplement for emerging digital fields, produced the released taxonomy (Sun et al., 3 Jun 2026).
The resulting structure is hierarchical.
| Level | Scale | Examples from the benchmark |
|---|---|---|
| Top level | 13 industry clusters | Manufacturing & Industrial Operations; Biomolecular Design; Visual Media Arts; Autonomous Systems |
| Second level | 55 subdomains | CAM programming, protein design, 3D compositing, robot simulation |
| Tasks | over 1,000 unique task instances | CAD files, workbooks, rendered scenes, code repositories, reports |
Tasks enter ALE through a five-gate pipeline: expert sourcing via dedicated advisory committees; submission and AI-assisted specification refinement; first-pass conference-style review; engineering implementation through containerization, VM provisioning, and evaluator scripts; and final peer QC by domain experts (Sun et al., 3 Jun 2026). Only approximately 10% of instances are public at any time, with the remainder kept private to avoid contamination; public/private splits rotate over time to maintain an uncontaminated “fresh” evaluation surface (Sun et al., 3 Jun 2026).
The benchmark’s industry collaboration is central rather than ornamental. Experts contributed actual day-to-day projects drawn from their own professional practice, which makes the taxonomy operationally tied to digitally mediated labor rather than to synthetic benchmark authoring (Sun et al., 3 Jun 2026).
3. Formal scoring and verification framework
Each ALE task instance produces a normalized score via a deterministic evaluator (Sun et al., 3 Jun 2026). The benchmark defines the full-pass indicator as
and defines Full Pass Rate (FPR) over a set of tasks as
Mean Score (MS) is defined as
ALE organizes tasks into three nested difficulty tiers—Near-Term, Full-Spectrum, and Last-Exam—so that evaluation cost can be traded against remaining headroom (Sun et al., 3 Jun 2026). For a tier with tasks, the tier-specific full pass rate is
Verification is implemented through one or more of six artifact modes: exact/hash match, numeric/tolerance tables, 3D-geometry distance, behavioral state replay, free-text rubrics, or “vision-LLM” gates (Sun et al., 3 Jun 2026). These can be composed through “gate-and-score,” weighted rubrics, checklist averaging, or per-file means (Sun et al., 3 Jun 2026). By default, judges are code-based; only approximately 7% of tasks use narrowly scoped LLM-as-a-judge probes (Sun et al., 3 Jun 2026).
This scoring design formalizes a strict distinction between partial completion and end-to-end completion. In ALE, the headline statistic is not merely whether an agent made progress, but whether it fully completed the professional deliverable under the benchmark’s evaluator.
4. Experimental setup and baseline performance
The reported experiments evaluate mainstream harness and backbone configurations in GCUA mode, each extended with a GUI-as-Tool bridge exposing 14 desktop actions, including mouse, keyboard, and screenshot operations (Sun et al., 3 Jun 2026). The harnesses listed are OpenClaw, ALE-Claw, Claude Code, Cursor, and Droid; the backbones listed are GPT-5.5, GPT-5.4, Opus 4.7, Sonnet 4.6, Gemini 3.1 Pro, DeepSeek V4, and GLM 5.1, among others (Sun et al., 3 Jun 2026).
The public evaluation split contains 150 tasks: 59 Near-Term, 55 Full-Spectrum, and 36 Last-Exam (Sun et al., 3 Jun 2026). Reported headline results are as follows.
| Tier | Public task count | Reported result |
|---|---|---|
| Near-Term | 59 | top FPR ≈42% |
| Full-Spectrum | 55 | top FPR ≈20% |
| Last-Exam | 36 | average FPR ≈2.6% across frontier configs |
The Near-Term top full pass rate is reported as approximately 42% for Codex + GPT-5.5, while Full-Spectrum reaches approximately 20% and Last-Exam remains at an average full pass rate of approximately 2.6% across frontier configurations (Sun et al., 3 Jun 2026). The paper further notes that, despite some models reaching approximately 80% on prior CLI-only or GUI-only benchmarks, their Last-Exam FPR on ALE remains in the single-digit range (Sun et al., 3 Jun 2026).
These results position ALE as a headroom-rich benchmark. The hardest tier is not merely unsolved; it is far from saturated under the reported configurations.
5. Error sources, unsaturation, and research implications
A detailed failure-cause analysis for Claude Code + Opus 4.7 attributes approximately 75% of failures to high-level errors rather than to mechanical execution bugs (Sun et al., 3 Jun 2026). The two largest categories are database gaps in domain knowledge, at approximately 31%, and flawed strategy or premature abandonment, at approximately 44% (Sun et al., 3 Jun 2026).
The paper interprets these findings as evidence that Last-Exam tasks require deep professional expertise, specialized tool chains, or non-trivial domain judgment, and that current agents lack the integrated domain understanding and long-horizon planning needed to close such workflows end-to-end (Sun et al., 3 Jun 2026). This suggests that the dominant failure mode in ALE is not simply brittle GUI control or action-level unreliability. A plausible implication is that progress on the benchmark will depend at least as much on domain grounding, workflow decomposition, and tool-chain semantics as on low-level computer-use dexterity.
The future directions explicitly highlighted are richer domain-knowledge retrieval, hybrid neuro-symbolic planning over professional APIs, and tighter tool-chain integration (Sun et al., 3 Jun 2026). These directions follow directly from the benchmark’s diagnosis that unsaturation persists primarily at the level of expertise and strategy.
6. Living-benchmark properties, economic orientation, and terminological disambiguation
ALE is designed as a living benchmark (Sun et al., 3 Jun 2026). Its task pool grows as new workflows and subdomains are onboarded through the same five-gate expert pipeline; public tasks are periodically retired to private status and replaced by fresh private instances; and the taxonomy can evolve through extension of the frontier supplement as new digital professions emerge (Sun et al., 3 Jun 2026). This structure is intended to preserve an uncontaminated challenge for each new model generation.
The benchmark’s broader purpose is to close what the paper describes as the gap between benchmark success and GDP-relevant impact (Sun et al., 3 Jun 2026). Its workflows are grounded in productive labor, including legal document filings, engineering simulations, and creative-media production, and the stated premise is that saturation of ALE would signal readiness for substantive professional responsibilities rather than performance on stylized exams alone (Sun et al., 3 Jun 2026).
The acronym “ALE” has prior and parallel uses in the literature. In reinforcement learning, ALE commonly denotes the Arcade Learning Environment, introduced as a unified interface to Atari 2600 games and described as a “final exam” for general-purpose reinforcement learning and planning agents (Bellemare et al., 2012). Later work on that benchmark emphasized evaluation protocols such as sticky actions, multi-mode support, reporting at fixed frame milestones, and human-normalized metrics (Machado et al., 2017). Separately, agentified assessment for logical reasoning agents uses an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types under a standardized agent-to-agent interface; that framework benchmarks an auto-formalization agent on a cleaned split of FOLIO and is methodologically distinct from the industry-workflow benchmark introduced as Agents' Last Exam (Ni et al., 3 Mar 2026).
For this reason, “ALE” is not a unique label across AI evaluation. In current usage, Agents' Last Exam refers specifically to the 2026 benchmark of long-horizon, economically valuable, real-world digital workflows with verifiable outcomes (Sun et al., 3 Jun 2026).