AlphaEval: Evaluating Agents in Production

Published 14 Apr 2026 in cs.CL | (2604.12162v1)

Abstract: The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics -- conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products -- Claude Code, Codex, etc. -- as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework -- a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

Abstract PDF Upgrade to Chat

Authors (27)

First 10 authors:

Summary

The paper presents AlphaEval as a benchmark that systematically transforms authentic production requirements into executable evaluation tasks.
It introduces a detailed requirement-to-benchmark framework incorporating multi-paradigm evaluation methods across six high-value domains.
Empirical results reveal significant performance gaps, scaffold-dependence, and domain variance, informing economic value-based agent selection.

AlphaEval: Authoritative Evaluation of AI Agents in Authentic Production Workflows

Introduction

"AlphaEval: Evaluating Agents in Production" (2604.12162) directly addresses the persistent disconnect between agent capability benchmarks and their actual performance in deployed, commercial environments. This work introduces AlphaEval, a benchmark suite grounded in production tasks sourced from organizations actively deploying AI agents. AlphaEval complements its benchmark with a formalized requirement-to-benchmark construction framework, enabling systematic transformation of production requirements into highly authentic, executable evaluation pipelines. This construction framework, taxonomy coverage, and empirically validated results establish AlphaEval as an essential infrastructure for measuring production readiness of AI agents.

Motivation and Research Context

Typical research benchmarks are constructed from curated, retrospective tasks with well-defined objectives and deterministic evaluation metrics. This lab-centric paradigm abstracts away the core complexities of production: ambiguous and under-specified business requirements; fragmented, multi-modal inputs; dynamic and evolving success criteria defined by practitioners; and the need for domain expertise and long-horizon deliverables. Recent studies have quantified this disconnect—over 80% of agent systems in deployment or pilot production phases are unsupported by representative benchmarks, and 63% of surveyed practitioners lack confidence that model iterations translate to product improvements.

AlphaEval is designed to explicitly target these deficiencies. The study sources 94 tasks from seven companies across six O*NET domains, capturing authentic production constraints, input heterogeneity, and evolving evaluation rubrics. The explicit aim is to enable rigorous, reproducible, and rapid construction of production-grounded evaluation tasks that are extensible across new domains and organizations.

Construction Framework and Benchmark Design

The AlphaEval requirement-to-benchmark construction framework consists of four stages:

Partner engagement: Selection of organizations with revenue-critical workflows involving agent systems and access to authentic task requirements.
Requirement elicitation: Iterative dialogue to extract, clarify, and refine complex workflow segments into tasks suitable for mechanized evaluation, ensuring preservation of production ambiguity and implicit constraints.
Task formalization: Standardized packaging of each task, including source documents (PDF, Excel, code, images), domain-aligned evaluation specifications, and ground-truth construction. Over 40% of tasks require complex business documents as input.
Iterative validation: Multiple refinement cycles with domain practitioners to ensure task and rubric quality remain synchronized with evolving business standards.

AlphaEval’s taxonomy coverage is broader than prior art. Tasks incorporate multiple evaluation paradigms—reference and semantic answer verification, rubric-based and formal logic verification, execution-based UI and functional testing, and LLM-as-a-Judge semantic assessment—mirroring the multi-dimensional nature of production quality.

Task and Domain Coverage

AlphaEval’s 94 tasks span six O*NET domains, mapping directly to production labor categories with high economic value:

Human Resources: Resume screening with compound subjective/objective criteria, requiring PDF/JPEG input processing and cross-candidate calibration. Scored by F1 alignment with human interview selection.
Finance & Investment: Generation of investment segment reports, multi-layered business plan analysis, and unstructured pitch critique with professional template adherence and LLM-based structural/semantic assessment.
Procurement & Operations: Constrained optimization on large-scale structured data inputs (up to 2,000-row Excel catalogs), evaluated by strict constraint satisfaction and cost optimality.
Software Engineering: Development of full-stack applications from extensive product requirement documentation; evaluated with automated end-to-end UI testing and codebase structure analysis.
Healthcare & Life Sciences: Clinical eCRF protocol execution, regulatory policy analysis, and multi-step arithmetic reasoning within domain-specific rule systems.
Technology Research: Dynamic market and startup landscape investigations requiring information retrieval, multi-source synthesis, and evidence-backed report generation.

Each task is explicitly annotated with estimated human time and wage value, and per-task scoring is standardized and cross-domain comparable.

Experimental Protocol and Results

Evaluation Setup

Models: Six leading models (Claude Opus 4.6, GPT-5.2, Gemini 3 Pro Preview, Kimi K2.5, GLM-5, MiniMax M2.5) spanning proprietary and open-source lines.
Agent products ("scaffolds"): Four commercial deployment platforms—Claude Code, Codex, GitHub Copilot, and Cursor. Agents are tested in Docker-isolated, version-pinned environments.
Configurations: Fourteen model-scaffold pairings, selected for deployment prevalence and practical evaluation tractability.

Main Results

The empirical findings are definitive:

Absolute performance gap: The best configuration (Claude Code + Opus 4.6) achieves only 64.41/100 across all tasks, indicating that SOTA agents are far from automating complex commercial workflows.
Scaffold-dependence: The same base model exhibits up to a 15-point score differential across scaffolds (e.g., Opus 4.6: 64.41 via Claude Code vs. 53.45 via Codex), demonstrating that real-world performance cannot be inferred from model-level benchmarks alone.
Dramatic inter-domain variance: Top scores reach 88.09 in Procurement & Operations but only 38.91 in Human Resources, with models’ domain ranking non-monotonic and not reliably predicted by aggregate scores.
Score ranking vs. economic value ranking: Domain-weighted labor value demonstrates that aggregate score ranking and delivered economic value are non-aligned; for example, Gemini 3. Pro configurations can outperform higher-scoring ones in value-centric domains, directly impacting agent selection strategy in production.

Evaluation Robustness

Statistical reliability: Three-run repeat evaluations yield tight 95% confidence intervals (overall $\pm1.83$ ), so ranking and score differentials are robust to sampling variance.
Meta-evaluation: Automated LLM-as-a-Judge decisions achieve up to 89.7% agreement with human experts ( $\kappa=0.72$ , substantial agreement), establishing meta-evaluation reliability.

Failure Modes and Cognitive Analysis

Six failure modes are systematically surfaced through large-scale qualitative and quantitative analysis:

Cascade dependency failure: Single-point errors propagate through dependent calculations (e.g., anchoring errors in clinical protocols lead to compound downstream miscalculations).
Subjective judgment collapse: Agents extract factually correct data but fail on holistic, non-quantitative judgment (e.g., inability to infer cultural fit or soft skills from resumes).
Information retrieval pathology: Five sub-modes including factual hallucination (making up plausible but incorrect details based on past data), rigid or incomplete search, and attribution confusion.
Cross-section logical inconsistency: Internal contradictions within long-form outputs, particularly in multi-section analytic deliverables.
Constraint misinterpretation: Implicit or domain-specific constraints are routinely violated, with agents exhibiting “synergy blindness” in optimization problems and strong feasibility bias.
Format compliance failure: Outputs structurally valid but incompatible with business or system consumption requirements.

These failure modes are unique to the production context and are invisible to conventional code-oriented or task-driven research benchmarks.

Practical and Theoretical Implications

The transition from model-centric evaluation to system-level, domain-weighted, and value-grounded assessment underpins several implications:

Production readiness must be measured at the product/agent system level, not at the raw model level. Scaffold, domain, prompt, and execution context are comparably influential.
Value-based agent selection: Organizations can perform domain-weighted economic optimization over agent solutions, systematically routing task classes to configurations that yield highest labor value replacement rather than maximizing aggregate mean scores.
Benchmark extensibility: The open-source requirement-to-benchmark construction methodology enables other organizations to rapidly instantiate their own production-grounded benchmarks, adapt to task drift, and objectively monitor progress.

Limitations and Future Directions

While AlphaEval substantially bridges the research-production evaluation gap, there are open limitations:

Domain coverage: Current tasks cover six high-value O*NET domains; legal, education, creative, and other major segments are unrepresented.
Temporal dynamics: Quality standards and agent capabilities evolve rapidly in production; mechanisms for continuous longitudinal evaluation are necessary.
Extensible frameworks: Only a subset of prominent agent systems and scaffolds are covered; enterprise in-house platforms and emergent open-source solutions warrant inclusion.
Economic estimation assumptions: Labor value calibration depends on domain-expert adjustment and wage model robustness.

Future development should prioritize expanded occupation coverage, benchmarking of memory and learning-enhanced agents, metric disentanglement (e.g., safety, efficiency, compliance), and longitudinal post-deployment tracking.

Conclusion

AlphaEval establishes a new standard for agent evaluation realism by embedding evaluation in authentic economic, domain, and workflow contexts. The results provide compelling evidence that even leading agent systems remain far from production-level competence on complex, ambiguous, and multi-modal tasks, and that scaffold and context are as critical as model selection. The distinction between performance scores and true production value has direct operational consequences for system architecture, agent procurement, and organizational ROI optimization.

By open-sourcing both benchmark and construction pipeline, AlphaEval explicitly invites and enables community-driven evolution toward genuinely useful and reliable AI deployment in high-value occupational domains.

Reference:

"AlphaEval: Evaluating Agents in Production" (2604.12162)

Markdown Report Issue