On-the-Fly SOP Synthesis

Updated 27 May 2026

On-the-fly SOP synthesis is a dynamic method that creates executable procedural templates for LLM-driven software agents in real time.
It employs retrieval-augmented generation, structured templates, and layered constraint enforcement to ensure domain compliance and actionable outputs.
The approach supports multi-agent collaboration, industrial automation, and adaptive feedback loops through repository distillation and runtime adaptation.

On-the-fly Standard Operating Procedure (SOP) synthesis refers to the dynamic construction of executable procedural templates for software agents, particularly those powered by LLMs, in response to new, previously unseen tasks or situations. Unlike static SOP libraries or manual process definitions, on-the-fly synthesis enables systems to generate, tailor, and validate SOPs at runtime, supporting robust adaptation, multi-agent orchestration, and tool integration across diverse domains. Recent architectural advances combine retrieval-augmented generation, LLM prompting, repository-driven reflection, and hard-constraint enforcement to deliver SOPs that are both domain-compliant and immediately actionable, with key applications in multi-agent collaboration, root cause analysis, and industrial automation (Liu et al., 14 Feb 2026, Pei et al., 12 Feb 2025, Nandi et al., 9 Jun 2025).

1. Formal Representations of SOPs

Across modern frameworks, SOPs are formalized as structured templates that encode agent responsibilities, interaction protocols, and tool bindings. In MASFly, each SOP consists of agent specifications—tuples $\{A_i = (E_i, R_i, I_i, T_i)\}$ with $E_i$ (agent name), $R_i$ (coarse responsibility), $I_i$ (prompt instructions), and $T_i$ (external tools)—plus a communication graph $G = (V, E)$ describing message flows (e.g., User $\to$ Planner $\to$ WebSearcher $\to$ Summarizer), all stored as JSON-like objects (Liu et al., 14 Feb 2026).

In SOP-Bench, SOP documents adhere to a domain-specific DSL-like schema, ensuring the presence of defined sections (Purpose, Scope, Definitions, Input, Main Procedure, Output), numbered logic steps, and explicit tool calls. Structural regularity is enforced by parsing and validation rules, and each SOP aligns with a generated data schema and tool registry (Nandi et al., 9 Jun 2025).

Flow-of-Action instantiates SOPs as hierarchical, step-wise procedures with enforced ordering—system-level, application-level, then service-specific diagnostics—output as JSON with a name and steps array (Pei et al., 12 Feb 2025). This structural formalism underpins effective mapping between SOPs, agent orchestration, and automated tool invocation.

2. Retrieval- and Prompt-Augmented SOP Generation

The foundational mechanism for on-the-fly SOP synthesis is retrieval-augmented generation, in which historical cases guide LLM-driven instantiation:

Repository Construction: MASFly maintains a growing repository of triplets $\{(Q_i, N_i, S_i)\}$ , where $E_i$ 0 is a query, $E_i$ 1 is a need analysis, and $E_i$ 2 is a successful SOP. Sentence embedding techniques are used to precompute vector representations for both $E_i$ 3 and $E_i$ 4 (Liu et al., 14 Feb 2026).
Hybrid Retrieval Scoring: For a new query $E_i$ 5 (with $E_i$ 6), similarity is scored by $E_i$ 7, where Sim uses cosine similarity and $E_i$ 8 tunes the need vs. surface alignment. Top $E_i$ 9 matches are input to an LLM for SOP synthesis, typically through a prompt that combines $R_i$ 0, $R_i$ 1, and retrieved templates (Liu et al., 14 Feb 2026).

In Flow-of-Action, on-the-fly synthesis is triggered when match scores against the incident context fall below a threshold $R_i$ 2; the system constructs an LLM prompt comprising a header, several few-shot SOP examples, and a structured output request. The LLM then emits a stepwise SOP (Pei et al., 12 Feb 2025).

SOP-Bench pipelines condition the document generator on domain parameters and schema outputs, with grammar-based prompts ensuring procedural structure. The probability of each step sequence is modeled as $R_i$ 3, leveraging one-shot or few-shot prompting and explicit section templates (Nandi et al., 9 Jun 2025).

3. Constraint Enforcement and Structural Validation

To ensure domain, structural, and safety compliance, all reviewed frameworks incorporate layered constraint mechanisms:

Level	Technique	Example
Structural	Grammar/regex and section enforcement	Required SOP sections, step numbering
Data Model	JSON Schema validation, type checking	Ensuring all tool input/outputs valid
Procedural	Prompted hierarchy/rules, hardcoded verification	No omission of system-level checks
Human-in-the-loop	Expert review, smoke testing	Manual artifact validation in SOP-Bench

SOP-Bench employs Python validators for schema conformance and grammar matching, running end-to-end smoke tests by simulating agent runs on synthetic data. This hard-constraint layer is always enforced before releasing an SOP to an agent (Nandi et al., 9 Jun 2025). Flow-of-Action enforces constraint adherence by embedding “must-follow” rules into prompts, guiding LLMs toward best practices and sequential logic (Pei et al., 12 Feb 2025).

4. Test-Time Adaptation and Closed-Loop Supervision

On-the-fly synthesis is complemented by mechanisms for runtime execution adaptation. MASFly introduces a Watcher agent, supported by a Personalized Experience Pool (PEP) containing records of prior failures and remedial hints at the agent level. During execution, the Watcher periodically retrieves top-matching prior failures using query embedding, compares live traces, and triggers interventions by issuing hints or respawning agents (Liu et al., 14 Feb 2026).

The formal loop in MASFly:

Agents execute the Operation Procedure (OP), passing messages and performing tool calls.
Every $R_i$ 4 messages or $R_i$ 5 interactions, the Watcher consults the PEP.
Detected anomalies are matched to remedy patterns: if matched, feedback is given or faulty agents replaced.
Successes result in distilled SOPs being added to the repository (reflective distillation), while novel failures extend the PEP with diagnostics and remedies.

This closed-loop approach enables online correction of drift, tool misuse, or novel error patterns, providing robust adaptation without further model fine-tuning (Liu et al., 14 Feb 2026).

5. On-the-Fly SOP Synthesis in Industrial Benchmarks

SOP-Bench provides a systematized methodology for generating, validating, and benchmarking SOPs in industrial settings:

Six-Stage Generation Pipeline: Domain Parameter Module (task/context extraction), Dataset-Schema Generator, SOP Document Generator, Synthetic Dataset Generator, API/ToolSpec Generator, Tool-Code Generator. Each module feeds forward to ensure logical consistency and reduces hallucinations.
Complexity Injection: After artifact generation, SOPs are procedurally modified by adding redundant logic, distractor tools, or error branches to simulate real-world complexity.
Live Deployment Workflow: Generated tool specs are registered in the LLM’s function-calling registry, SOPs and code passed to the agent, and the agent executes the workflow following procedural constraints, validated against smoke tests.
Probabilistic Formulation: The main procedure is sampled as a chain $R_i$ 6, and prompts are parameterized to ensure section presence and ordering.

A key finding is that, on SOP-Bench’s benchmark of 1,800+ tasks, common agentic architectures such as ReAct and Function-Calling perform substantially below required industrial baselines (27%-48% success rates), often misusing tools when distractors are present (Nandi et al., 9 Jun 2025). This highlights the necessity of domain-specific benchmarking and constraint-aware SOP instantiation for practical deployment.

6. Illustrative Example: Multi-Agent Itinerary Planning

Consider the MASFly pipeline instantiated for “Plan a cost-effective 3-day Paris itinerary including flights, hotels, and sightseeing” (Liu et al., 14 Feb 2026).

Need Analysis: LLM produces: “Flight booking info, Hotel search, Daily sightseeing schedule, Budget check.”
Top-K Retrieval: System selects travel and planner SOPs containing relevant agent roles.
SOP Instantiation: LLM integrates retrieved SOPs, yielding a team (Planner, FlightAgent, HotelAgent, ItineraryAgent, Summarizer) and a communication structure reflecting role assignments and tool calls.
Execution + Adaptation: The Watcher agent detects that the FlightAgent proposes over-budget options; using the PEP, it prompts for application of budget constraints, which corrects the output.
Distillation and Reflection: After successful execution, the derived SOP is generalized and added to the repository; new failures update the PEP.

This example demonstrates the iterative nature of on-the-fly SOP synthesis, runtime adaptation, and reflection, enabling systems to acquire reusable procedural knowledge over time.

7. Applications, Metrics, and Limitations

On-the-fly SOP synthesis underpins robust automation in multi-agent LLM systems, microservice incident response, and synthetic industrial benchmarking. Performance evaluation often focuses on end-to-end system metrics rather than direct SOP quality—e.g., location/type accuracy in root cause analysis (Pei et al., 12 Feb 2025) or task completion rates in planning benchmarks (Liu et al., 14 Feb 2026, Nandi et al., 9 Jun 2025). The dramatic accuracy gain with SOP-centric architectures over conventional “thought-action-observation” paradigms (e.g., ReAct) underscores the critical importance of dynamically composable, constraint-guided procedural scaffolding.

A plausible implication is that future research should prioritize richer failure-mode capture, online repository distillation, and tighter type/structure integration between SOPs and agent tool registries. This suggests that while LLMs excel at flexible generation, robust industrial integration demands hybrid pipelines combining structured prompting, hard validators, and feedback-driven adaptation.