SOP-Maze Benchmark for LLM SOP Evaluation
- SOP-Maze is a benchmark developed to evaluate LLM capabilities across 397 tasks drawn from 23 real-world business scenarios.
- It distinguishes between deep sequential tasks (Heart Root Systems) and broad decision-making tasks (Lateral Root Systems) for comprehensive assessment.
- Evaluation metrics reveal LLM weaknesses in route tracking, dialogue handling, and arithmetic, prompting new avenues for model improvement.
SOP-Maze is a benchmark designed to systematically evaluate the capabilities of LLMs in following and reasoning through complex business Standard Operating Procedures (SOPs). Developed from 397 tasks derived across 23 diverse real-world business scenarios, SOP-Maze targets both breadth (wide decision branching) and depth (long sequential logic chains) in procedural reasoning, and also incorporates realistic conversational noise. The benchmark rigorously tests LLM robustness in decision tracking, handling ambiguous or noisy dialogues, and executing embedded calculations such as time manipulation and arithmetic, areas where existing LLMs have demonstrated persistent limitations (Wang et al., 10 Oct 2025).
1. Dataset Construction and Scope
SOP-Maze was constructed from 300,000 anonymized API invocation records sourced with user consent from internal business units. After rule-based filtering and semantic clustering, the 23 most frequent business SOP scenarios were selected, each representing typical yet complex operational workflows. The resulting tasks encapsulate authentic logical inconsistencies and real dialogue pain points, curated by professional annotators for both realism and evaluability. Every SOP task is associated with a unique ground-truth solution to ensure unambiguous model assessment. Model outputs from Claude-4-Sonnet, GPT-4.1, and DeepSeek-R1 were cross-validated to ensure scoring reliability.
The benchmark is partitioned into ten Heart Root System (HRS) scenarios, characterized by deep, sequential logical dependencies (average depth ≈ 5, up to 10 branches), and thirteen Lateral Root System (LRS) scenarios, which prioritize broad, shallow decision trees (depth ≤ 3, but with around 58 leaf nodes stemming from an average of 5 parent nodes).
2. Task Taxonomy and Exemplars
SOP-Maze’s dual structure distinguishes task classes as follows:
- Lateral Root System (LRS): Tasks involve selection among many alternatives within shallow trees, emphasizing precise branch choice under strict output format constraints. A canonical example is the Customer Service Rate scenario (L4), where the model must compute reply latency and assign ratings by selecting exactly one of numerous applicable rules.
- Heart Root System (HRS): Tasks feature long, sequential logical chains that necessitate the faithful traversal of multiple conditional nodes to reach the correct solution. For example, in Bulk Order Clarification (H10), the model must identify items, check stock, handle delivery constraints, resolve exceptions, and finally compose an appropriate customer response.
Key characteristics of both classes are illustrated below:
| SOP Class | Breadth (avg. branching factor) | Depth (max path length) | Example Scenario |
|---|---|---|---|
| HRS | ≈3 | 5–7 (up to 10) | Bulk Order Clarification (H10) |
| LRS | ≈11 | ≤3 | Customer Service Rate (L4) |
This structural dichotomy is central to exposing distinct model weaknesses: LRS for evaluating high-parallel decision selection and HRS for deep procedural tracking.
3. Complexity Measures and Evaluation Metrics
SOP-Maze introduces formal complexity measures based on the SOP task decision graph rooted at node :
- Branching Factor:
where is the set of parent nodes and indicates the out-degree.
- Decision Depth:
Observed statistics show LRS tasks with average , ; HRS tasks with , –7.
- Scoring: All model outputs are evaluated under a reference-based schema:
The overall accuracy is:
where .
For tasks with categorical or multi-label structure, standard Precision, Recall, and F1 scores are reported, supplementing the main accuracy metric.
4. Experimental Design and Model Benchmarks
Eighteen LLMs—covering API-based commercial and open-source models—were evaluated using either direct zero-shot or chain-of-thought (CoT, “Thinking” mode) prompts. Each task prompt consists of: (1) Objective, (2) SOP (up to 5,040 tokens), (3) User Input (dialogue), and (4) Output Requirements (JSON schema, fields: “answer”, “index”). Full SOPs are always provided, and no few-shot demonstration is included. The CoT prompt includes explicit cues for stepwise reasoning: “Please think step by step before answering.”
Results indicate consistently low top-line performance:
| Model | Tasks at | Tasks at | Tasks at | Accuracy |
|---|---|---|---|---|
| DeepSeek-V3.1-Thinking | 132 | 268 | 7 | 33% |
| Claude-Opus-4-Thinking | — | — | — | 33% |
| Doubao-Seed-1.6-ThinkingOn | — | — | — | 32% |
The best HRS scenario (Food Delivery Customer Service, H5) achieved 96% accuracy, whereas the most challenging (Customer Evaluation Follow-Up, H4) was as low as 18–28%. LRS showed similar spreads, with Complaint Analysis (L11) being easiest (up to 76%) and Risky Content Detection (L2) the hardest (3–20%).
Reasoning (CoT) models outperform direct-prompted counterparts, notably in LRS, but no model or setting solves more than one-third of the benchmark.
5. Error Analysis
Three principal failure categories were identified:
- Route Blindness: LLMs deviate from the SOP, skipping steps or branches. Example: In “Schedule In-Person Appointment” (HRS), the model output skips mandatory precondition checks.
- Conversational Fragility: Models misinterpret user dialogue, especially when the input exhibits naturalistic ambiguity, sarcasm, or shifts such as threat reversals. Example: “I’ll let it slide” is ignored in Intention Recognition Analysis (LRS), causing incorrect selection.
- Calculation Errors: Models fail with basic arithmetic or time reasoning embedded within broader SOPs. Example: In “Customer Service Rate” (LRS), reply latency calculations are incorrect despite explicit timestamps.
For DeepSeek-V3.1-Thinking, among the 268 tasks scored at (format correct, answer incorrect): 177 were Route Blindness, 166 Conversational Fragility, and 60 Calculation Errors; similar distributions appear across other leading models.
6. Analysis of Model Capabilities and Deficits
SOP-Maze reveals multiple, orthogonal deficits in current LLMs: inability to trace deep or wide procedural flows, fragility to realistic conversational input, and decreased calculation reliability within full-context SOPs. Chain-of-thought prompts partially mitigate, though do not eliminate, route and dialogue errors. Arithmetic and time manipulations that are handled correctly in isolation often fail when assessed in the richer SOP context. This suggests that SOP-Maze probes integrative weaknesses not revealed by isolated or synthetic reasoning tasks.
7. Implications and Future Research Directions
The systematic gaps revealed by SOP-Maze motivate several directions for improving LLM robustness in business process automation:
- Modular SOP Tracking: Implementing explicit state trackers to manage node and path traversal within the SOP graph structure.
- Dialogue Robustification: Preprocessing conversational input with modules for intent and sarcasm to attenuate naturalistic input noise.
- External Tool Use: Offloading arithmetic and date processing to specialized calculators or parsers.
- Domain-Specific Fine-Tuning: Instruction tuning with SOP-like chains to specialize models for explicit step-wise reasoning.
- Memory Augmentation: Introducing working memory buffers to store intermediate SOP reasoning states.
By exposing intertwined deficits related to procedural breadth, depth, and dialogue handling, SOP-Maze offers a principled foundation for benchmarking and advancing LLM capabilities in real-world operational SOP settings (Wang et al., 10 Oct 2025).