EnterpriseOps-Gym: Enterprise AI Benchmark

Updated 19 March 2026

EnterpriseOps-Gym is a comprehensive framework simulating realistic, policy-rich enterprise operations to evaluate agentic AI performance.
It features persistent-state containers, multi-domain APIs, and 1,150 expert tasks that model complex workflows and safety protocols.
Quantitative results reveal that state-of-the-art LLMs struggle with long-horizon, policy-intensive tasks, highlighting challenges for future research.

EnterpriseOps-Gym is a benchmark suite and environment framework designed for the rigorous evaluation and training of agentic AI systems in realistic, policy- and state-rich enterprise operations. By extending earlier work in DevOps agent benchmarking and operations-research RL environments, EnterpriseOps-Gym introduces persistent-state containers, multi-domain tool APIs, and expert-authored tasks faithfully modeling the complexities of professional workflows—including stringent policy controls and multi-tenant scenarios. The sandbox’s core design, broad coverage of verticals, and robust metrics provide an unparalleled substrate for advancing robust, safe, and generalizable agentic reasoning in enterprise contexts (Malay et al., 13 Mar 2026).

1. Architectural Foundations and System Design

EnterpriseOps-Gym is architected as a Docker-resident sandbox with two tightly coupled components: a relational database (~164 interrelated tables) and a tool-execution layer of 512 REST/RPC endpoints implementing domain and policy logic. Each benchmark episode launches an isolated environment with a reproducible seed dataset (e.g., users, assets, calendars, cases) and ensures no cross-episode state leakage. Tables feature an average of 1.7 foreign keys per table (maximum ~2.4 in HR), compelling multi-step dependency resolution across operational silos such as Customer Service, IT, and HR. Tools implement permission checks, enforce access policies, trigger cross-table invariants, and affect durable system state—closely mirroring real-world enterprise operations.

Seeds and data are initially authored by domain SMEs and expanded to support task diversity. Each tool invocation is monitored for database state transitions, policy compliance, and potential side effects.

2. Task Suite and Enterprise Domain Coverage

EnterpriseOps-Gym comprises 1,150 expert-constructed tasks, balanced across eight mission-critical domains:

Customer Service Management (CSM)
Human Resources (HR)
IT Service Management (ITSM)
Email
Calendar
Teams
Drive
Hybrid (cross-silo flows)

Task trajectory lengths average 11.5–13.3 steps in operational domains, with HR tasks featuring up to 34 steps. Scenarios include lifecycle management (e.g., SLA escalation, asset reassignment), collaborative orchestration (e.g., calendar invitation, folder permission management), asset/warranty verification, and complex multi-domain flows requiring internal API orchestration and policy reasoning. Thirty tasks are explicitly infeasible, designed to assess agent safe-refusal capability.

Tasks are crafted to stress persistent state reasoning, strict access control, referential integrity, and policy-correct lifecycle propagation (e.g., account offboarding, SSO role changes). This breadth is intended to benchmark robustness far beyond traditional tool-calling or narrow workflow automation.

3. Formal Evaluation Metrics and Task Success Criteria

The primary state space $s \in S$ is the joint set of all database rows plus dynamic artifacts such as webhooks. Actions $a \in A$ are parameterized tool invocations; the transition function $s_{t+1} = T(s_t, a_t)$ is defined by the sandbox oracle and embodies all domain-side effects and policy enforcement.

Task success is determined by $K$ SME-authored SQL predicates per scenario: achievement of user goals, referential and policy integrity, and absence of unintended mutations. For task $i$ , if $v_{i,j} \in \{0,1\}$ is the outcome of verifier $j$ , the success predicate is

$\text{success}_i = \prod_{j=1}^K v_{i,j}.$

Aggregate performance is reported as Pass@1 over $N$ tasks:

$\text{Pass@1} = \frac{1}{N} \sum_{i=1}^N \text{success}_i.$

Infeasible-task refusal rate

$a \in A$ 0

counts episodes with explicit halting and no side effects.

4. Experimental Protocols and Quantitative Performance

Fourteen agentic models, including Claude Opus/GPT-5 variants, Gemini-3, and high-capacity open-source models, are evaluated via a unified ReAct-style agent loop with oracle-level tool retrieval. All models receive identical prompts and are run in triplicate, reporting Pass@1 mean.

Key quantitative outcomes:

Top performing agent (Claude Opus 4.5): 37.4% Pass@1.
Peak domain performance: Email (51.9%), Teams (51.0%), Drive (52.1%), Calendar (43.2%).
Marked drop-off in policy-heavy domains: ITSM (23.8–28.5%), CSM (16.7–36.4%), Hybrid (22.2–30.7%), HR (10.7–32.1%).
Infeasible-task refusal: best agent attains only 53.9%.
Increasing token budget significantly boosts performance for open-source models in less policy-intensive domains, but yields sharply diminishing returns in complex ones.
Pass@1 drops monotonically with increasing gold trajectory length (from ~35% at horizon 4 to <20% at horizon 16).

Cost–performance tradeoffs show open-source models below 25% Pass@1 (< $a \in A$ 10.36/task (Malay et al., 13 Mar 2026).

5. Failure Modes and Diagnostic Analysis

Bottleneck analysis reveals strategic planning—rather than tool invocation—as the primary limitation for current LLM agents:

Plan-conditioned interventions: Human-authored plans yield +14–35 percentage point gains; Claude-generated plans +6–13 points, directly linking strategic reasoning to success rate.
Tool retrieval is not the bottleneck: Adding distractor tools (<+2.4pp impact).
Multi-agent orchestration (Planner/Executor, Decompose/Subtask) offers modest gain but sometimes regresses due to state-transition complexity.

Common failures include:

Omission of prerequisite lookups (e.g., operating on a ticket without retrieving required entitlements).
Failure to propagate state changes across related entities (“cascading state”).
Incorrectly resolving ambiguous string identifiers without tenant context.
Premature “completion hallucination,” omitting policy enforcements in late-stage flows.
Unsafe handling of infeasible tasks, often resulting in data corruption or access violations.

These failure patterns indicate that current LLM agents’ memory and synthetic state-tracking capabilities are insufficient for complex, policy-bound enterprise tasks (Malay et al., 13 Mar 2026).

EnterpriseOps-Gym is conceptually descended from two main strands:

Agentic DevOps Benchmarking: DevOps-Gym provides a modular, CLI-focused agent environment for build, configuration, monitoring, and issue/test workflows (704 tasks, Java/Go). Extensions proposed for EnterpriseOps-Gym include multi-tenant controllers, compliance enforcement layers (e.g., SOC2, GDPR, Snyk/Trivy hooks), scalable CI replicas, and an expanded array of infrastructure-as-code, security, and compliance tasks (Tang et al., 27 Jan 2026). EnterpriseOps-Gym generalizes beyond DevOps to full enterprise verticals, introduces persistent database state, and enforces complex organizational policies.

Operations Research RL Environments: The modular, gym.Env foundation draws conceptually on OR-Gym (Hubbs et al., 2020), which established best practices for composable environments, MDP formalization, and hybridized RL/optimization pipelines. EnterpriseGym Corecraft extends this to a high-fidelity simulation of customer support with thousands of entities, delayed-reward rubric-based scoring, and RL training architectures (e.g., Group Relative Policy Optimization) producing strong out-of-distribution generalization (Mehta et al., 18 Feb 2026).

Notable distinctions are the use of large, persistent world state, dense policy constraints, and formalized safe-refusal evaluation in EnterpriseOps-Gym.

7. Conclusions and Future Research Directions

Empirical results demonstrate that state-of-the-art LLM agents are unable to reliably plan or act in enterprise settings characterized by policy complexity, persistent state, and long-horizon dependencies—even when tool invocation and retrieval are given nearly ideal conditions. Correct plan generation, robust internal state management, and reliable safe-refusal are central unsolved challenges.

Specific future work priorities include:

Constraint-aware plan induction that foregrounds preconditions, policy impacts, and cumulative side-effects.
Principled long-horizon state tracking, potentially via explicit memory modules or symbolic state managers.
Formal agent refusal and escalation protocols to prevent unsafe mutations on unsatisfiable tasks.

The open-source release of EnterpriseOps-Gym’s complete environment, task suite, and evaluation oracles is intended to drive the next generation of research into robust, safe, and generalizable enterprise agent systems (Malay et al., 13 Mar 2026).