- The paper introduces EnterpriseClawBench, a novel benchmark that converts real enterprise AI session logs into 852 reproducible tasks with a detailed role and skill taxonomy.
- It employs a multi-dimensional evaluation framework that assesses artifact delivery, outcome quality, runtime, and cost across diverse modalities in enterprise settings.
- Results reveal significant model-harness interactions, cost-score trade-offs, and challenges in skill transfer and artifact calibration, emphasizing the need for robust enterprise agent architectures.
EnterpriseClawBench: Advancing the Evaluation of Enterprise AI Agents with Real-World Session-Derived Tasks
Motivation and Benchmark Construction
EnterpriseClawBench addresses the deficit of high-fidelity, reproducible benchmarks for evaluating enterprise agents operating in authentic workplace environments. Unlike previous benchmarks that often rely on simulated or human-authored tasks, EnterpriseClawBench utilizes a comprehensive archive of proprietary agent sessions from an internal deployment at an AI startup. The benchmark construction pipeline automates the conversion of session logs into 852 reproducible tasks, each accompanied by fixtures, prompts, a detailed taxonomy (seven role classes and 45 skill subclasses), hard delivery rules, and semantic rubrics.
The construction pipeline enforces stringent gates: length and fixture recovery, redaction restoration, network-dependency filtering, and a final sandbox preflight to ensure each task is fully self-contained and reproducible. This methodology distinguishes EnterpriseClawBench, guaranteeing realism in both input distribution and deliverable requirements across a broad landscape of enterprise functions, including product/project, engineering/IT, HR/admin, executive, sales/customer, marketing, and finance/operations activities.
Benchmark Scope and Task Taxonomy
EnterpriseClawBench's taxonomy exposes both role-level and skill-level structure, allowing rich analysis at multiple abstraction levels. Tasks are not artificially balanced, preserving the natural distribution observed in practice: product/project and engineering/IT dominate, while HR, executive, sales, marketing, and finance form a specialized long tail. Inputs and expected outputs span diverse modalities (text, spreadsheets, code, HTML, presentations, images), reflecting modern enterprise workflows.
The taxonomy enables skill-oriented experiments. For each task, roles and skill subclasses allow identification of repeatable operational patterns and facilitate explicit skill-transfer studies at the class level, a feature not systematically covered in prior benchmarks.
Multi-Dimensional Evaluation Protocol
A central contribution of EnterpriseClawBench is the multi-dimensional evaluation framework. Performance must be jointly reported across harness--model combinations, with dimensions capturing artifact delivery success (type, presence, format, openability), outcome quality (multi-criteria semantic judging for both text and visual modalities), runtime, and cost (token usage, compute/time, monetary). Judging is routed by artifact modality: text outputs are LLM-evaluated using a five-dimension semantic rubric (grounded accuracy, relevance, substantive depth, practical utility, communication quality); non-text outputs (HTML, slides, spreadsheets, images) are rendered and visually scored by LLMs.
Unlike benchmarks that collapse results into a single score, EnterpriseClawBench exposes important interactions. Systemic harness--model effects, cost-score trade-offs, role/skill-level difficulty, and expected artifact types are all disaggregated, enabling diagnosis of failure and variance sources.
Evaluation and Empirical Results
The main study audits 120 tasks using 32 harness--model combinations: Claude Code, Codex, DeepAgents, Hermes, and OpenClaw, with models such as GPT-5.5, Sonnet 4.6, Opus 4.6, Haiku 4.5, Kimi K2.6, MiniMax-M3, Qwen3-235B-A22B, and DeepSeek V4 Pro. All evaluation is sandboxed per task for reproducibility.
Key findings:
- The best configuration (Codex/GPT-5.5) achieves only 0.663 on the main metric, indicating unsaturated performance and substantial unsolved enterprise-agent challenges.
- Performance depends crucially on harness--model pairing: Sonnet 4.6 is stable across Claude Code, DeepAgents, and OpenClaw, but drops markedly with Hermes due to agent-environment compatibility issues that block file delivery or truncate artifact generation.
- Cost and score exhibit a log-like relationship: incremental cost yields diminishing returns beyond the mid-tier, with some harnesses (notably Hermes with Claude-family models) suffering high cost with no score gain.
- Task performance varies by role class (marketing and finance/ops are most challenging, likely due to domain complexity and lack of open training data) and by expected artifact (strong text judging calibration, but inflated/less calibrated scores for spreadsheet and visual outputs).
- Semantic analysis reveals that most systems are better at communication and relevance than at grounded accuracy, often missing key evidence from large, heterogeneous enterprise inputs.
A scalability check using four DeepAgents harness--model pairs on all 852 tasks confirms that the broad performance ranking observed on the audited subset persists, supporting the pipeline's robustness.
Skill Transfer and Generalization
EnterpriseClawBench enables systematic evaluation of skill generalization. In the frontend page generation subclass, agent-specific skills are distilled by running consumer agents on in-domain tasks, followed by feedback aggregation and skill injection. Performance delta on held-out tasks quantifies skill transfer.
Notable outcomes:
- Skill transfer is high variance and strongly creator-dependentโGPT-5.5 as a skill creator consistently improves consumer agents (+0.0681), whereas Haiku 4.5 as a creator can degrade performance (-0.0941).
- Skill creation and consumption capabilities are not always aligned: a weak skill creator can yield positive transfer for strong consumers, and strong consumers can regress if paired with suboptimal skills.
- Reporting average skill scores is insufficient; a matrix disaggregation (by skill creator, consumer, and baseline performance) is necessary to understand transfer dynamics.
Judge Reliability and Calibration
EnterpriseClawBench audits judge reliability both inter-LLM and LLM-vs-human. While Sonnet 4.6 and GPT-5.4 have high rank correlation for text (ฯ=0.918), visual-judge calibration with humans remains problematic (MAE 0.303; negative rank correlation), especially for complex artifacts (presentations, spreadsheets). This exposes an important limitation in current automated evaluation stack for enterprise use.
Implications for Enterprise Agent Research
EnterpriseClawBench demonstrates that enterprise agent capability is not explainable by base LLM performance alone but must account for dynamic harness interaction, delivery pathway compatibility, and multidimensional output quality. Task-class skill generalization and robust artifact evaluationโespecially in multimodal settingsโremain critical research targets. The interdependence of model, harness, role, and skill class underscores the operational challenges in deploying large-agent systems in production enterprise environments.
The findings suggest that further advances in:
- Robustness to diverse real-world input modalities
- Harness and toolchain compatibility
- Calibrated, trustworthy multimodal LLM-based evaluation
- Generalizable, transferable agent skills extracted from workplace interaction traces
are preconditions for closing the automation gap in enterprise artifact delivery and task support.
Conclusion
EnterpriseClawBench establishes an end-to-end protocol for benchmarking enterprise agents under real session-derived constraints, emphasizing reproducibility, multidimensionality, and skill-level analysis. The benchmark reveals unsolved challenges and nontrivial system-level interactions that are invisible in narrower, simulation-driven benchmarks. The protocol and evaluation methodology, rather than dataset release, define the reusable contribution, setting a new standard for future agent benchmarks in enterprise settings.
Limitations include its single-organization scope, non-release of task data due to privacy, and residual immaturity of LLM judges for visual artifacts. These highlight avenues for methodological innovation and cross-organization collaborative benchmark construction. The benchmark sets the stage for further research on robust agent architectures, tool integration, and evaluation paradigms for production-grade enterprise automation.