Papers
Topics
Authors
Recent
Search
2000 character limit reached

SOP-Maze: Benchmark for LLM SOP Navigation

Updated 5 February 2026
  • SOP-Maze is a benchmark that evaluates LLM capabilities in executing complex business SOPs by testing multi-branch decisions and multi-hop logic.
  • It utilizes a dataset of 397 tasks across 23 SOP scenarios, capturing authentic challenges from sectors like customer service, finance, and logistics.
  • Evaluation metrics require correct responses and structured JSON output, revealing both the strengths and failure modes of tested models.

SOP-Maze is a comprehensive benchmark designed to evaluate the capabilities of LLMs in executing and reasoning over complex business Standard Operating Procedures (SOPs). SOP-Maze addresses the gap in existing benchmarks by structuring real-world SOP-driven tasks into a collection that rigorously tests both breadth (multi-branch decision spaces) and depth (multi-hop logical chains) required in business process automation. The benchmark provides a structured, instance-level representation and diagnosis of LLMs’ capabilities—and shortcomings—when navigating authentic business logic (Wang et al., 10 Oct 2025).

1. Benchmark Construction and Dataset Characteristics

SOP-Maze comprises 397 tasks derived from 23 major business SOP scenarios. The dataset was constructed from approximately 300,000 internal API-router logs (with user consent) reflecting real business SOP invocations. The process entailed rule-based filtering, semantic clustering, and manual selection of the largest scenario clusters. SOPs in the raw logs often contained logical inconsistencies—these were deliberately preserved to probe model robustness, though self-contradictions were resolved to guarantee unique ground truth for each task. Subject-matter experts (SMEs) validated task representativeness, and all tasks underwent cross-validation by five professional annotators (≥3 years’ experience per annotator), requiring at least three-way agreement for inclusion. Model outputs were independently cross-scored for consistency.

Key dataset statistics include:

  • 397 tasks spanning 23 scenarios
    • Heart Root System (HRS): 10 scenarios
    • Lateral Root System (LRS): 13 scenarios
  • SOP average length: 5,040 tokens
  • LRS: average ~5 internal nodes, ~58 leaf nodes
  • HRS: logic chain depths exceeding 10 steps in some cases
  • Scenarios cover domains such as customer service, finance, and logistics

Formally, each SOP-Maze task instance is structured as a tuple

T=⟨Objective, SOP, UserInput, OutputSpec⟩T = \langle \text{Objective},\ \text{SOP},\ \text{UserInput},\ \text{OutputSpec} \rangle

  • Objective: specifies background, role, and goal
  • SOP: natural language procedural description, often with embedded decision logic
  • UserInput: real user dialogue or request
  • OutputSpec: a strict JSON Schema requiring
    • a "response" (free-form natural language answer)
    • a "step_index" (integer pointer to the specific SOP node/step)

The scoring function is

S={1.0if correct response and correct step index 0.2if valid JSON Schema but wrong answer/index 0.0if invalid formatS = \begin{cases} 1.0 & \text{if correct response and correct step index}\ 0.2 & \text{if valid JSON Schema but wrong answer/index}\ 0.0 & \text{if invalid format} \end{cases}

2. Task Taxonomy: Lateral Root System and Heart Root System

SOP-Maze categorizes tasks into two distinct structural classes:

  • Lateral Root System (LRS):
    • Defined as wide-option SOPs with shallow (≤3-level) but extensively branched decision graphs
    • Representative domains: information extraction, named entity classification, and bulk order comparisons
    • Characterized by requiring the model to discriminate among tens of parallel leaf decisions, stressing branch-selection capabilities
  • Heart Root System (HRS):
    • Defined as SOPs demanding multi-stage, deep-sequenced reasoning (logic chain depth ≥5–10)
    • Typical scenarios involve stepwise clarification, multi-stage scheduling, and layered conditional checks
    • These tasks concentrate on multi-hop consistency and the reliable execution of chained conditionals

No formal pseudocode is used; task logic is presented in natural language, mimicking business SOP documentation (Wang et al., 10 Oct 2025).

3. Model Evaluation Protocol and Metrics

SOP-Maze evaluates a diverse range of models:

  • 11 API-based LLMs (e.g., Claude-Opus-4, GPT-4.1, DeepSeek-V3.1)
  • 7 open-source LLMs (e.g., Kimi-K2-Instruct, Qwen3, Doubao-Seed-1.6)

Default hyperparameters and temperature settings are enforced. Prompts embed the entire SOP and require output conforming strictly to the JSON Schema.

Evaluation metrics are:

  • Per-instance score S∈{1.0,0.2,0.0}S \in \{1.0, 0.2, 0.0\}
  • Overall accuracy: mean of instance scores across the dataset
  • Segregated accuracies for HRS (OA\textsubscript{HRS}) and LRS (OA\textsubscript{LRS}) subsets

Models with and without explicit Chain-of-Thought prompting ("Thinking" variants) are compared, quantifying the impact of explicit reasoning strategies on SOP navigation.

4. Empirical Results and Performance Analysis

Experimental results demonstrate a high level of difficulty inherent in SOP-Maze:

  • Top model (DeepSeek-V3.1-Thinking): 132/397 fully correct (score 1.0), 268 partially correct (0.2), 7 invalid; overall average score ≈ 0.36
  • Best HRS accuracy: up to 64% (DeepSeek-V3.1-Thinking)
  • Best LRS accuracy: 32% (DeepSeek-V3.1-Thinking and Claude-Opus-4-Thinking)
  • GPT-4.1: 51% on HRS versus 26% on LRS
  • Most models: <50% on HRS, <30% on LRS

Chain-of-Thought ("Thinking") variants consistently outperform non-reasoning variants by 5–10 points, particularly on LRS tasks.

Model HRS OA (%) LRS OA (%) Overall Score
DeepSeek-V3.1-Thinking 64 32 0.36
Claude-Opus-4-Thinking 60 32 —
GPT-4.1 51 26 —

Table: Summary of best HRS and LRS performance (Wang et al., 10 Oct 2025)

5. Error Diagnosis and Ablation

A taxonomy of error modes, based on annotator analysis (Table 4 in (Wang et al., 10 Oct 2025)), reveals three dominant failure categories:

  • Route Blindness: Inability to reliably traverse (LRS: select correct among ≈58 leaves; HRS: skipping required intermediate chains or failing hierarchical overrides)
  • Conversational Fragility: Misinterpretation of user intentions under dialogic ambiguity, e.g., over-anchoring on initial context, misunderstanding repairs or sarcasm, mislabeling short utterances in highly contextual exchanges
  • Calculation Errors: Breakdown on embedded arithmetic or time computations, e.g., timestamp-based metrics or aggregation operations

Ablation studies provide targeted insights:

  • Removing unrelated SOP branches in "Bulk Order Clarification" raises accuracy by up to 40 points at Stage 1, supporting the assertion that combinatorial complexity is the core obstacle in LRS tasks
  • Cleaning disfluent dialogue context in "Food Combo Dev" raises performance only modestly, plateauing at ~70%, indicating a deeper comprehension gap
  • Simplifying queries in calculation-heavy tasks (e.g., direct timestamp questions) enables models to achieve ≈90% accuracy by Stage 3, implying arithmetic and date/time reasoning is particularly fragile in complex contexts

6. Implications, Limitations, and Research Directions

Performance of even state-of-the-art LLMs is limited: <50% on HRS (deep logic) and <35% on LRS (broad selection) tasks. Failure is typically not attributable to a single capability but to complex, interacting requirements involving multi-branch navigation, robust conversational grounding, and calculation.

Potential research directions include:

  • Incorporation of graph-oriented planning modules to mitigate route blindness
  • Training on more diverse, authentic dialogue for enhanced conversational nuance
  • Augmenting with dedicated arithmetic and temporal reasoning plugins/tool calls
  • Expanding SOP datasets to include richer, step-level annotation for more granular supervision

Open questions persist regarding modular task decomposition, the possible efficacy of neural-symbolic hybrids, and the development of metrics that capture partial compliance in SOP reasoning beyond the discretized 0/0.2/1.0 schema (Wang et al., 10 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SOP-Maze.