Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Published 18 Nov 2025 in cs.AI | (2511.14136v1)

Abstract: Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $ρ=0.83$) compared to accuracy-only evaluation ($ρ=0.41$).

Abstract PDF Upgrade to Chat

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list identifies what remains missing, uncertain, or unexplored in the paper, framed as concrete and actionable directions for future research:

Precise metric formalization and reproducibility: The definitions and normalization for CNA, CPS, SCR, PAS, and pass@k need exact mathematical specifications (units, denominators, scaling, edge-case handling) and standard procedures for computing confidence intervals and statistical significance across runs.
Reliability estimation robustness: pass@k is computed on 10 repeats of 60 tasks; evaluate sensitivity to sample size, task selection, and non-determinism sources (model temperature, environment variability), and report CIs, bootstrapped intervals, and test-retest reliability over time.
Threshold justification and mapping to business SLOs: The chosen reliability target (e.g., pass@8 ≥ 80%) and SLA thresholds (e.g., 3s for support, 30s for code) are not theoretically or empirically justified; derive thresholds from empirical user outcomes, business loss models, and industry SLOs, and perform sensitivity analyses of CLEAR vs. threshold changes.
Composite score weighting validity: The min-max normalization and equal weights (w_i = 0.2) may distort cross-domain comparability; develop principled, learnable weights aligned to business objectives (e.g., via multi-objective optimization, preference elicitation, or utility modeling) and test robustness of rankings to weighting schemes.
Cost model completeness and stability: Token-based API costs exclude infrastructure, engineering overhead, guardrail/tooling costs, caching, batching, concurrency scaling, and incident costs; construct a comprehensive cost-of-ownership model and quantify the impact of pricing drift and vendor changes over time on CLEAR.
Tail latency and load realism: Report latency distributions (p95/p99), cold-start effects, concurrency/load tests, queueing delays, and streaming response behavior; current averages obscure production-relevant tails and throughput constraints.
Security coverage and severity modeling: PAS treats all violations uniformly; incorporate severity-weighted scores, near-miss events, exploitability, and blast radius; expand adversarial coverage beyond prompt injection (e.g., tool misuse, data exfiltration, jailbreaking, supply-chain vulnerabilities, RBAC bypass).
Hallucination and error taxonomy: Define and measure hallucination rates and error categories per domain (e.g., incorrect legal interpretations, unsafe code patterns), including severity and detectability; evaluate guardrail efficacy and its cost/latency tradeoffs.
Policy evaluation rigor: Document policy sources, annotation guidelines, and adjudication processes; measure false positives/negatives in policy violation detection and inter-annotator agreement for compliance judgments across domains.
Generalization across organizations and domains: Validate the Enterprise Task Suite across multiple companies, industries (healthcare, finance, manufacturing), and regulatory regimes; include out-of-distribution tasks and holdout sets to test generalization claims.
Multilingual and multimodal coverage: Extend tasks beyond English to multilingual settings with locale-specific policies, and include multimodal inputs/outputs (voice, documents, screenshots, spreadsheets) common in enterprise workflows.
Long-horizon and stateful scenarios: Incorporate tasks exceeding 15 steps, persistent memory/state, session continuity, and cross-session dependencies; evaluate how agents manage context accumulation, forgetting, and recovery.
Dynamic environments and concept drift: Test agent stability under knowledge base updates, API changes, policy revisions, and adversary adaptation; quantify performance decay and recovery in online, non-stationary settings.
Multi-agent coordination evaluation: Although cited, multi-agent orchestration is not evaluated; design protocols and metrics for coordination efficiency, communication overhead, conflict resolution, and emergent failure modes.
Tooling and integration realism: Evaluate end-to-end workflows with heterogeneous tools (databases, ticketing, CI/CD, ERP/CRM), offline/failed tool calls, rate limits, and permission constraints; measure recovery, rollback, and audit trail fidelity.
Human-in-the-loop (HITL) impact: Quantify how human oversight, triage, and escalation pathways affect CLEAR dimensions, costs, and reliability; develop metrics for HITL efficiency, agreement, and error catching.
Fairness, bias, and ethics: Add fairness metrics across user cohorts and task types, quantify disparate impact, and integrate ethical risk scoring (e.g., PII exposures, sensitive attribute handling) into Assurance.
Mapping CLEAR to compliance frameworks: Operationalize how CLEAR dimensions satisfy standards (GDPR, SOC 2, ISO 27001), define audit artifacts, and verify traceability from agent actions to compliance controls.
Automated cost-aware architecture search: Formalize and evaluate automated methods to select agent architectures/hyperparameters under CLEAR constraints (e.g., Pareto optimization, Bayesian optimization, constrained RL), including dynamic routing across models/tools.
Robustness to reflection loops: Analyze causal effects of reflection/planning iterations on efficacy vs. cost, latency, and reliability; identify optimal stopping criteria and safeguards against error amplification.
User-centered outcomes: Beyond expert readiness, collect user satisfaction, task utility, error harm, and rework rates; model the relationship between CLEAR and real user outcomes with prospective studies and A/B tests.
Reproducibility and release details: Provide versioned endpoints, prompts, seeds, evaluation harness, and licensing for the dataset/code; address model drift and endpoint updates that threaten reproducibility and longitudinal comparability.
Uncertainty quantification and calibration: Measure output confidence, calibrate uncertainty estimates, and develop decision policies (e.g., abstain/escalate) that improve Assurance and Reliability under uncertainty.
Cross-model comparability: Standardize tokenization, context-window effects, and tool-call accounting across models to ensure fair cost and latency comparisons; address differences between closed vs. open models.
Failure mode root-cause analysis: Create a structured taxonomy and diagnostic pipeline to attribute failures to planning, tool use, retrieval, generation, or compliance layers; use this to guide targeted improvements and reporting.
CLEAR portability across domains: Investigate whether CLEAR scores are comparable across domains or require domain-specific calibrations; develop domain-adjusted normalization to avoid misleading cross-domain rankings.
Severity-aware Reliability: Incorporate partial credit and severity-weighted reliability (e.g., benign vs. catastrophic failures) rather than binary success, and study its economic implications via CPS.
Economic sensitivity analysis: Quantify how small accuracy gains vs. large cost increases affect total cost of ownership at scale (e.g., 10k–1M tasks), and define decision boundaries where expensive architectures are economically justified.
Streaming and incremental interaction: Evaluate agents that stream partial answers, refine iteratively, and negotiate with users; measure effects on latency, satisfaction, and error recovery compared to single-shot responses.
Policy conflict resolution in multi-stakeholder tasks: Design benchmarks that explicitly encode conflicting policies/priorities and measure how agents negotiate, seek approvals, and maintain compliance without deadlock.
Lifecycle monitoring and drift detection: Propose metrics and infrastructure for continuous monitoring of CLEAR dimensions, alerting, retraining triggers, and post-incident analyses in production.

These items aim to guide researchers toward high-impact extensions that make CLEAR-based evaluations more rigorous, representative, and predictive of real-world enterprise deployment success.