EnterpriseClawBench: Dual Benchmarking

Updated 4 July 2026

EnterpriseClawBench is defined by two complementary benchmarks that target enterprise-agent workflows: one simulating persistent, multiday tasks with deterministic rule-based checkers, and another converting real-world session archives into reproducible tasks.
The ClawMark component operates in a dynamic, evolving sandbox with exogenous loud events and silent mutations, emphasizing long-horizon collaboration across multimodal office workflows.
The real-session benchmark integrates hard-rule validations with LLM-based semantic rubrics to evaluate artifact quality, harness sensitivity, and cost–latency metrics, reflecting practical enterprise constraints.

Searching arXiv for the specified benchmark papers and closely related context. EnterpriseClawBench denotes two 2026 enterprise-agent benchmarking efforts that share a workplace focus but differ substantially in construction, environment design, and evaluation protocol. In one usage, it is the enterprise marketing name for ClawMark, a benchmark for “multi-turn, multi-day, multimodal coworker agents” operating in an evolving stateful sandbox with deterministic rule-based scoring (Meng et al., 26 Apr 2026). In another, it is the title of a benchmark “constructed from proprietary, real-world agent sessions,” yielding 852 reproducible tasks together with a reusable construction and evaluation protocol, while withholding raw benchmark data because the sessions contain internal enterprise content (Zhong et al., 22 Jun 2026). The term therefore refers not to a single canonical corpus, but to two complementary benchmark programs for enterprise agents: one centered on living-world persistence and exogenous state change, the other on recovered workplace sessions, artifact delivery, and harness-sensitive multidimensional evaluation.

1. Naming, scope, and disambiguation

The naming overlap is operationally important because the two benchmarks target related but distinct failure modes. ClawMark is officially named ClawMark, with “EnterpriseClawBench” used in enterprise marketing; its core object is a multiday coworker workflow executed inside five stateful services and evaluated by deterministic Python checkers (Meng et al., 26 Apr 2026). The separately titled EnterpriseClawBench begins from archived internal agent sessions and converts them into reproducible tasks with rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics (Zhong et al., 22 Jun 2026).

Aspect	ClawMark / “EnterpriseClawBench”	EnterpriseClawBench
Primary source	Authored benchmark tasks	Proprietary real-world agent sessions
Scale	100 tasks	852 tasks
Evaluation style	Deterministic Python checkers; no LLM-as-judge	Hard rules plus LLM-based semantic rubrics

This distinction also prevents a common interpretive error: the two benchmarks are not alternative names for the same dataset. They share enterprise-agent scope, multimodal artifacts, and tool use, but they encode different assumptions about persistence, reproducibility, and what counts as successful task completion.

2. ClawMark as a living-world enterprise benchmark

ClawMark is designed for persistent coworker agents whose environment changes independently across multiple working days. The released corpus contains 100 realistic office-workflow tasks spanning 2–6 in-universe working days, with a mean of 3.6 turns, across 13 professional scenarios and 87 distinct in-task roles; the scenarios include clinical assistant, insurance claims, investment analysis, project management, legal, and EDA (Meng et al., 26 Apr 2026). The benchmark includes 1,072 raw multimodal artifacts, specifically audio clips, videos, scanned PDFs, images, and spreadsheets.

Each task runs in an isolated Docker sandbox that provides five stateful services: a filesystem, email via GreenMail SMTP/IMAP, calendar via Radicale CalDAV, a Notion-compatible knowledge-base API, and a Google-Sheets-compatible spreadsheet API. Between turns, the framework injects two forms of exogenous change. “Loud events” are explicitly announced in the wake-up prompt, while “silent mutations” are unannounced updates that must be discovered, such as new log files, overwritten spreadsheet rows, or calendar adjustments. This architecture operationalizes a setting in which the agent cannot assume that the post-turn world state is merely the consequence of its own prior actions.

The task design emphasizes composite workflows rather than isolated API calls. In insurance_task5, a 6-turn property-claim adjudication task, the agent must acknowledge receipt of claim materials, create a calendar reminder, defer approval or rejection until a fire-department report arrives, compare later repair quotes, update the knowledge base, schedule meetings, and generate a final approval memo. In content_operation_task7, a 3-turn DevSummit operations task, the agent must reconcile a voice memo, a silently added walkthrough video whose frames may be extracted with ffmpeg, and later PDF and Excel budget materials, then update a Notion event record. These examples show that the benchmark is not merely multimodal in input format; it is multimodal in workflow topology, requiring cross-artifact and cross-service integration.

3. ClawMark verification, metrics, and empirical findings

ClawMark uses fully rule-based evaluation. Every task ships with 6–29 Python checker functions, with a mean of 15.4, and the release totals 1,537 deterministic Python checkers over post-execution service state; 55 checkers are “red-line” constraints for compliance-critical should-not-do conditions (Meng et al., 26 Apr 2026). The checker categories include filesystem or artifact inspections, external-backend state queries across email, calendar, knowledge base, and spreadsheet services, numeric tolerances, and semantic equivalence. Red-line checkers cover forbidden actions such as premature approvals, compliance bypass, data exfiltration, and irreversible writes via high fixed weights.

With $\mathcal{T}$ denoting the set of tasks, $\mathcal{C}(\tau)$ the checker set for task $\tau$ , $w_c$ the checker weight, and $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ the deterministic outcome of checker $c$ for model $m$ , the benchmark defines Weighted Score and Strict Task Success as

$\mathrm{WS}(m)=100\times \frac{1}{|\mathcal{T}|}\sum_{\tau\in\mathcal{T}} \frac{\sum_{c\in\mathcal{C}(\tau)} w_c\,\mathrm{pass}_c(m,\tau)} {\sum_{c\in\mathcal{C}(\tau)} w_c}$

and

$\mathrm{TS}(m)=100\times \frac{\left|\left\{\tau\in\mathcal{T}:\forall c\in\mathcal{C}(\tau),\ \mathrm{pass}_c(m,\tau)=1\right\}\right|} {|\mathcal{T}|}.$

The benchmarked systems exhibit a marked gap between partial progress and complete workflow completion. Claude Sonnet 4.6 achieves $\mathrm{WS}=75.8$ with $\mathcal{C}(\tau)$ 0, while Claude Opus 4.6 attains $\mathcal{C}(\tau)$ 1 but the highest strict task success at $\mathcal{C}(\tau)$ 2. GPT-5.4 (high-effort) records $\mathcal{C}(\tau)$ 3, $\mathcal{C}(\tau)$ 4; Gemini 3.1 Pro Preview $\mathcal{C}(\tau)$ 5, $\mathcal{C}(\tau)$ 6; Qwen 3.6 Plus $\mathcal{C}(\tau)$ 7, $\mathcal{C}(\tau)$ 8; Kimi K2.6 (open) $\mathcal{C}(\tau)$ 9, $\tau$ 0; and Kimi K2.5 (open) $\tau$ 1, $\tau$ 2 (Meng et al., 26 Apr 2026). The strongest weighted performance therefore coexists with low all-or-nothing completion, indicating that many agents solve fragments of enterprise workflows without satisfying every required state transition and deliverable condition.

Turn-level analysis on 3-turn tasks localizes a major failure mode. Day 1, before exogenous change, is the highest-performing stage. Day 2, the first external update, induces a sharp drop for 6 of 7 models, with average $\tau$ 3 percentage points; Qwen 3.6 Plus is the lone exception at $\tau$ 4 percentage points. Day 3 partially recovers but usually remains below Day 1, and the performance gap between Sonnet 4.6 and GPT-5.4 narrows from $\tau$ 5 percentage points on Day 1 to $\tau$ 6 on Day 3. The benchmark explicitly identifies several open challenges: failure to detect unannounced file or spreadsheet changes, with a reported 56.5% failure rate; backend writeback reliability, where agents reason correctly but fail to commit results, with a 53.6% failure rate; multiday memory and context drift; and brittle end-to-end audio-to-vision-to-data workflows.

4. EnterpriseClawBench from real workplace sessions

The separately titled EnterpriseClawBench is built from a three-month archive, from March to May 2026, of internal agent sessions at a 100-person AI startup conducted through private and group chats on an enterprise collaboration platform with a Linux /workspace and mounted /inputs and /outputs (Zhong et al., 22 Jun 2026). The raw archive contains 5,291 TaskInstances extracted from multi-turn dialogues, uploaded files, tool traces, generated artifacts, and persistent state.

Construction begins with mechanical gates. These include a length filter that drops messages with fewer than 10 effective characters, fixture lookup to verify that each declared attachment path resolves to a reproducible file, restricted redaction recovery that only restores redacted URLs or paths when context-unique or host-unique at confidence at least high, and rejection of tasks that depend on unreachable external links. After these gates, a self-containment decision is made manually to determine whether the remaining turn or turns define a concrete, single-turn prompt; ambiguous cases are rejected. Prompt rewriting then normalizes multi-turn or group-chat noise into a standalone user prompt specifying a precise deliverable list and the fixture paths under /inputs.

Further metadata assignment adds a role-class, a skill-subclass, expected deliverables, a hard-rule set for deliverable validity, and semantic rubric templates per modality. A sandbox preflight then tests the task in a stateless container by uploading inputs, invoking the harness, downloading outputs, verifying no failures, and “locking in” the task. From 5,291 candidates, 852 tasks survive this pipeline, which the benchmark summarizes as an approximately 16% pass rate.

The resulting tasks cover multiple input modalities. Text inputs include DOCX, PDF, TXT, HTML, JSON, and CSV; tables include XLSX and CSV; images include PNG and JPG; and code inputs include Python scripts and shell. Tool traces show the use of python-pptx, pandas/openpyxl, Playwright for HTML rendering, shell commands such as grep and awk, and custom enterprise “skills” such as an “automatic-evaluation” skill. Example deliverables include spreadsheets for financial reconciliation and calibration, HTML pages for product documentation and dashboards, PPT or PDF decks for marketing and event summaries, code and JSON outputs such as API configuration snippets, and images such as annotated roadshow screenshots.

The taxonomy comprises seven role classes inspired by O*NET and APQC: product_project_delivery, engineering_IT, HR_admin, executive, sales_customer, marketing, and finance_ops. These expand into 45 skill subclasses, including product__artifact_presentation_delivery, engineering__API_design_debugging, finance__spreadsheet_reconciliation, and marketing__event_presentation_design. The formal hierarchy is presented as BigClass → SubclassID → SubclassName → SubclassDefinition and RoleClass → RoleDefinition → SkillSubclass → SkillRouteSummary.

5. EnterpriseClawBench evaluation protocol and baseline results

EnterpriseClawBench uses a two-layer evaluation protocol rather than a purely deterministic checker system. The hard-rule layer verifies the correct number and type of output files, non-emptiness and openability, and the absence of placeholders such as "{…}" and Python tracebacks in text outputs (Zhong et al., 22 Jun 2026). The semantic-rubric layer scores five weighted dimensions in $\tau$ 7: grounded accuracy with weight $\tau$ 8, task relevance $\tau$ 9, substantive depth $w_c$ 0, practical utility $w_c$ 1, and communication quality $w_c$ 2.

For task $w_c$ 3, the composite semantic score is

$w_c$ 4

with semantic average $w_c$ 5 and overall performance

$w_c$ 6

where $w_c$ 7 is the hard-rule pass rate. The judges are LLM-based: Sonnet 4.6 is used for both text and visual routes on Lite, and visual artifacts are rendered and scored by a visual-route LLM. The protocol also reports cost $w_c$ 8 in CNY, average runtime $w_c$ 9 per task, tool-call counts, and evidence-warning flags. A central methodological claim of the benchmark is that enterprise-agent evaluation should report harness–model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior rather than collapse performance into a single score.

The benchmark evaluates 32 harnesses (“claws”) paired with models including Claude Code, Codex, DeepAgents, Hermes, OpenClaw with GPT-5.5, Sonnet 4.6, Opus 4.6, Haiku 4.5, Kimi K2.6, MiniMax-M3, GPT-4.1-mini, Qwen3-235B-A22B, and DeepSeek V4 Pro on a manually audited 120-task Lite subset. The best Lite result is Codex with GPT-5.5 at $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 0, with cost approximately 32 CNY per task and runtime approximately 45 seconds. A second tier spans roughly 0.62–0.64 for Sonnet 4.6 under Claude Code, DeepAgents, and OpenClaw. The same Sonnet 4.6 under Hermes drops to 0.458 because of harness blockage, and the lowest overall results are Hermes paired with Claude or GPT-5.5 due to blocked file writes. This makes harness effects first-order rather than incidental.

On the full 852 tasks, using the DeepAgents harness only and without manual audit, the benchmark reports the following results.

Model	Score	Rule
GPT-5.5	0.766	0.959
Sonnet 4.6	0.749	0.957
Haiku 4.5	0.632	0.963
GPT-4.1-mini	0.336	0.817

The corresponding text and visual subscores are 0.813 and 0.642 for GPT-5.5, 0.793 and 0.634 for Sonnet 4.6, 0.666 and 0.542 for Haiku 4.5, and 0.383 and 0.213 for GPT-4.1-mini. Role-class analysis reports lower scores for finance/ops and marketing tasks, around 0.55–0.60, than for engineering/IT or product tasks, around 0.65–0.70. Spreadsheet and presentation artifacts are described as visually judged with inflated scores relative to text, while grounded accuracy is often the lowest semantic dimension, with examples including missed “cash-flow analysis” and mishandled “Err:504” cells.

The benchmark also reports judge calibration and transfer experiments. LLM–LLM judge agreement reaches $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 1 for text and $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 2 for visual evaluation. Against human judgment, Sonnet 4.6 yields $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 3 and $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 4 on text, but $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 5 and $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 6 on visual routes. Skill-transfer experiments define $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 7 after distilling a skill from 10 in-domain tasks and evaluating on 5 held-out tasks. The reported matrix shows high variance, with some creator–consumer pairs gaining up to $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 8 and others degrading by $\mathrm{pass}_c(m,\tau)\in\{0,1\}$ 9.

6. Comparative significance, limitations, and reporting implications

Taken together, the two EnterpriseClawBench lines articulate different but compatible views of enterprise-agent evaluation. ClawMark foregrounds persistent, stateful, multiday collaboration with exogenous loud events and silent mutations, and it enforces reproducibility through deterministic checkers and a release-gate guarantee of bit-identical verdicts on independent re-runs (Meng et al., 26 Apr 2026). The real-session EnterpriseClawBench foregrounds recovery of authentic workplace tasks, artifact-centric outputs, harness–model interactions, skill-transfer experiments, and the need to report cost and latency alongside quality (Zhong et al., 22 Jun 2026).

A notable methodological contrast concerns scoring philosophy. ClawMark explicitly avoids LLM-as-judge and inspects post-execution service state directly. EnterpriseClawBench instead combines hard-rule validation with LLM-based semantic and visual judging, then documents the calibration gap between text and visual routes. This suggests two different notions of benchmark fidelity: one prioritizes deterministic verification of world-state transitions, while the other prioritizes ecological validity of recovered business artifacts and the practical conditions under which they are produced.

The limitations are also distinct. For the real-session benchmark, the paper states three principal constraints: single-company deployment, under-calibrated visual judges, and the inability to open-source fixtures and ground truth because of data privacy. For ClawMark, the principal limitations appear indirectly through the empirical bottlenecks it exposes: difficulty adapting to silent mutations, backend writeback failures, context drift across days, and brittle raw multimodal pipelines. A plausible implication is that enterprise evaluation cannot be reduced to a single axis of model capability. One benchmark shows a large separation between weighted partial progress and strict end-to-end completion; the other explicitly argues that evaluation must report harness–model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior rather than rely on a single aggregate number.

In practical research use, the two benchmarks therefore serve different experimental agendas. ClawMark is suited to studying long-horizon adaptation in evolving service environments. EnterpriseClawBench is suited to studying recovered workplace tasks, output artifact validity, harness sensitivity, and multidimensional reporting under privacy constraints. Read together, they define EnterpriseClawBench not as one fixed benchmark identity, but as a 2026 benchmark family for enterprise agents operating over tools, files, persistent state, and business deliverables.

Markdown Report Issue Upgrade to Chat

References (2)

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents (2026)

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EnterpriseClawBench.