RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Published 1 May 2026 in cs.LG, cs.CL, and cs.MA | (2605.00798v1)

Abstract: Humans solve problems by executing targeted plans, yet LLMs remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a novel hybrid agentic language that uses control constructs like IF, GOTO, and FORALL to enable deterministic execution of natural language plans.
It details a structured three-module architecture—Initialization, Compiler, and Executor—that dynamically selects between LLM reasoning, tool invocation, and code generation.
The framework demonstrates effective constraint extraction and enforcement, achieving an 81.1% exact match accuracy on Calendar Scheduling tasks.

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Introduction

RunAgent presents a comprehensive solution to the persistent challenge in plan execution by LLM-driven agents: bridging the expressive flexibility of natural language plans with stepwise, deterministic, and verifiable execution typical of programming frameworks. Most tool-augmented LLM agents struggle to reliably follow multi-step workflows unless tightly constrained, and often fail to consistently enforce constraints or adapt to dynamic exceptions within a plan. RunAgent proposes an agentic language with explicit control constructs (IF, GOTO, FORALL), a suite of verification and constraint generation methods, and a robust multi-module execution architecture capable of dynamic choice among LLM reasoning, tool invocation, and code generation.

Figure 1: An overview of RunAgent, highlighting its three main modules.

Agentic Language and Control Constructs

A central innovation in RunAgent is its hybrid agentic language, designed to preserve the adaptability of natural language instructions while enabling deterministic execution through a limited set of reserved, compositional keywords. The IF, GOTO, and FORALL constructs support branching, jumps, and comprehensive iteration—addressing LLM limitations in reliably enumerating elements in a set or following conditional workflows. Each plan step can be explicitly prefixed with execution modality modifiers (LLM, PYTHON, TOOL), granting both users and the system granular control over execution pathways. Notably, RunAgent interprets these constructs directly, extracting and expanding steps and sub-steps as needed, rather than relying on the LLM to parse or enforce control flow.

Architecture and Workflow

RunAgent’s execution pipeline is structured around three core modules: Initialization and Staging, Compiler, and Executor.

Initialization and Staging

This module registers tools—encapsulated as Python functions—and encodes their metadata for dynamic retrieval. Critically, it also aggregates and derives atomic constraints from both user input and the plan’s context using specialized LLM prompts. These constraints include both explicit requirements and implicit assumptions, each decomposed to ensure unambiguous satisfaction conditions.

Figure 2: The Initialization and Staging module responsible for tool registry setup and constraint generation.

Compiler

The Compiler parses the natural language plan into a structured intermediate representation, detecting agentic keywords and generating the requisite internal control flow and sub-steps, particularly for FORALL and conditional IF branches. Each plan step is transformed into a machine-readable format with annotated reasoning hints for downstream execution.

Figure 3: The Compiler module parses plans, identifies keywords, and structures sub-steps and control flow.

Executor

The Executor orchestrates the plan execution, handling each step’s interpretation, modality selection, logging, context state maintenance, and constraint/rubric validation. The Interpret Step submodule determines appropriate execution pathways, leveraging the tool registry, dynamic code generation, or LLM direct calls as appropriate. Sanity checks, constraint enforcement, and rubric-based grading are performed at each step, with fallback mechanisms ensuring robustness if errors or constraint violations persist after a threshold number of retries.

Figure 4: The Executor module implements plan steps, context management, and constraint checking.

Step Interpretation and Implementation

The Interpret Step module evaluates whether a step requires LLM response, code execution, or tool invocation. For code generation steps, a loop of code synthesis, validation by LLM-based code checking, and print-statement injection ensures generated code meets the specification before execution. Error states and constraint violations from each attempt are logged and injected into the prompt history for error-aware retries.

Figure 5: Interpret Step module decides execution modality and manages branching and iterative constructs.

Implementation uses the provided call prompt to execute the step, verifies output format and semantics, conducts constraints and rubric checking, and manages error-driven retries or fallback.

Figure 6: Implement Step executes the determined action, handling validity and retry logic.

Code Generation Flow

Instrumented code generation is iterated with LLM-driven syntax/error analysis until passing both structural and contextual checks.

Figure 7: Code generation algorithm iteratively synthesizes and validates code modules for execution steps.

Constraint and Error Handling

Each step output is subject to atomic constraint checking using two-stage LLM validation, separating reasoning and final judgment to avoid ambiguous acceptance or spurious violations. Error correction amends the execution context with error rationales and constraint violation messages, steering further attempts to avoid repeated failures. If errors persist, the system reverts to LLM-based step completion to maintain plan progress.

Empirical Evaluation

RunAgent was evaluated on the Natural-plan Calendar Scheduling and Trip Planning datasets, as well as the SciBench Stat, Calc, and Diff datasets. Baselines included GPT-4o and SoTA PlanGEN methods. Results demonstrate RunAgent’s stepwise execution with constraint validation yields the highest exact match (EM) accuracy (81.1%) on Calendar Scheduling, exceeding PlanGEN BoN (best-of-N, 68.9%) and substantially surpassing all GPT-4o and Gemini compositional or plan-driven baselines. An ablation removing constraint checking reduced RunAgent’s EM to 75.4%, highlighting the material role of automatic constraint enforcement.

RunAgent consistently outperformed both direct LLM solution attempts and LLM-driven plan executions not backed by RunAgent’s verification and stepwise enforcement, especially on complex, multi-modal reasoning tasks. Performance degrades only gradually as problem complexity increases, with error analysis showing a further upward correction in accuracy after accounting for multiple viable solutions missed by strict exact matching.

Figure 8: Accuracy for different problem complexities in Calendar Scheduling, demonstrating robustness across difficulty.

Statistical analysis using bootstrapping and McNemar’s test confirmed a significant superiority for RunAgent vs. GPT-4o plan implementation (p-value $5.29 \times 10^{-56}$ ).

Figure 9: $95\%$ Confidence intervals for Calendar Scheduling, validating the significance of RunAgent’s performance gain.

Implications and Future Directions

The RunAgent framework demonstrates that agentic languages with minimal but robust control constructs, coupled with LLM-based dynamic constraint extraction and enforcement, yield substantial improvements in reliable plan execution. The practical implications are immediate for workflow automation, scientific inquiry, scheduling, and domains requiring mixed reasoning modalities. The generalized design—separating plan generation from execution, using modular LLM components, and supporting HITL involvement—facilitates adaptation to domains demanding both flexibility and formal guarantees.

Theoretically, RunAgent exposes the limits of current LLMs in reliably following stepwise plans absent explicit architecture-level execution enforcement. This highlights the necessity of hybridized frameworks where agentic control constructs, dynamic tool registries, and constraint-derived verification complement LLM capabilities.

Looking forward, future developments may target (1) broader agentic language extension for richer control flow, (2) adaptive learning from user feedback/facts during plan execution, (3) tighter integration with external knowledge bases, and (4) end-to-end differentiable execution-planning architectures. Robust error propagation, runtime optimization across modalities, and more advanced context-filtering for efficiency and privacy also remain influential research vectors.

Conclusion

RunAgent establishes a principled blueprint for constraint-guided, verifiable plan execution in LLM agentic systems. By integrating explicit control language, atomic constraint synthesis, and dynamic execution selection, it achieves verifiable reliability and superior empirical performance on benchmarked reasoning tasks. This delineates a clear pathway for future agent design: explicit structure, robust mid-execution verification, and adaptability anchored in a hybrid of natural and formal specification paradigms.

Markdown Report Issue