RP-ReAct: Decoupled Reasoning & Execution

Updated 30 June 2026

RP-ReAct is a multi-agent framework that separates high-level planning (RPA) from micro-level tool execution (PEA), improving context management and trajectory stability.
The system utilizes iterative reasoning cycles and reactive execution with supervised fine-tuning and reinforcement learning to optimize plan accuracy and tool selection.
Benchmark evaluations demonstrate enhanced accuracy (+10–20 points) and efficiency, indicating its strong potential in enterprise automation, robotics, and code generation.

RP-ReAct (Reasoner Planner–ReAct) is a multi-agent framework for complex tool-augmented reasoning and execution, characterized by explicit architectural separation between high-level task decomposition and low-level iterative interaction with tools. The central innovation is the decoupling of strategic planning (reasoner-planner) from micro-level tool invocation (reactive executor), a paradigm applicable across enterprise automation, robotics, code generation, and general multi-tool LLM reasoning. The approach addresses key limitations of monolithic plan–execute agent designs, including poor generalization, token context overflow, and trajectory instability, by introducing modular supervision, global planning, and formal context management (Molinari et al., 3 Dec 2025, Wei et al., 13 Nov 2025, Liu et al., 9 Oct 2025).

1. Architectural Principles and Agent Roles

RP-ReAct systematically decomposes the agentic workflow along two axes: reasoning/planning and acting. At its core are two (or more) specialized agents:

Reasoner-Planner Agent (RPA):

Receives user tasks and, using a large reasoning model (LRM), generates an ordered sequence of abstract sub-questions or constructs a global plan. The RPA supervises the overall process, consuming the results of execution, validating expectations, and engaging in diagnosis and replanning upon observed failure or deviation (Molinari et al., 3 Dec 2025).

Proxy Execution Agent (PEA):

Receives subtasks from the RPA and executes them by dynamically interacting with tool APIs (e.g., SQL interpreters, code evaluators, search engines) via a ReAct-style loop—alternating between "Thought", "Action", "Observation" phases. PEAs are isolated from the global plan and focus solely on concrete micro-execution (Molinari et al., 3 Dec 2025, Liu et al., 9 Oct 2025).

Alternate forms include frameworks where the planner emits a global DAG over tools (Planner-Centric RP-ReAct (Wei et al., 13 Nov 2025)), or where the pipeline is further specialized with agents for planning, searching, code generation, and extraction (RA-Gen (Liu et al., 9 Oct 2025)).

This design ensures trajectory stability, prevents context overload in the reasoning agent, and supports both sequential and parallel tool use. In enterprise applications, this separation also simplifies compliance by making sensitive tool calls occur in a sandbox isolated from planning logic (Molinari et al., 3 Dec 2025).

2. Formal Workflow and Context Management

The execution pipeline in RP-ReAct is governed by tightly specified algorithms for both agent roles:

RPA (Supervision and Planning):

Iteratively constructs a plan by querying the LRM for the next abstract step, submits each to the PEA, and incorporates execution results. On failure, the RPA invokes self-diagnosis and replanning, generating corrective queries and updating its plan history (Molinari et al., 3 Dec 2025).

PEA (ReAct Executor):

Maintains a scratchpad recording reasoning traces and tool observations. At each step, it emits a sequential "Thought", chooses an "Action" (tool call), observes the result, and appends the triplet to the scratchpad. Termination is monitored by explicit finish signals in action or observation (Molinari et al., 3 Dec 2025).

A key architectural element is context window management. When a PEA encounters a tool output whose token size exceeds a pre-defined threshold $T$ , the result is truncated for in-context preview, and the full output is externalized to a variable in the execution environment. The RPA receives only a summary and reference, avoiding catastrophic token overflow (Molinari et al., 3 Dec 2025). This is crucial for open-weight LRMs with restricted context sizes.

In planner-centric variants, the global plan is represented as a Directed Acyclic Graph (DAG) over tools, enabling parallel execution and efficient aggregation of results (Wei et al., 13 Nov 2025).

3. Learning and Optimization Strategies

RP-ReAct frameworks leverage both supervised and reinforcement learning protocols to enhance planning quality:

Supervised Fine-Tuning (SFT):

The planner LLM is fine-tuned on a dataset of (query, plan) pairs (e.g., the ComplexTool-Plan dataset in (Wei et al., 13 Nov 2025)), optimizing the likelihood of emitting the correct execution DAG.

Reinforcement Learning (GRPO):

After SFT, Group Relative Policy Optimization (GRPO) is applied on a filtered hard set, maximizing hierarchical rewards—penalizing syntax errors, cycles, disconnected plans, and incentivizing correct tool/edge selection (Wei et al., 13 Nov 2025).

In the code generation setting (RA-Gen), control over agent behavior and safety is augmented by user-settable constraints (e.g., maximum search depth, allowed tools), and by explicit static validation (e.g., CodeQL checks) before code release (Liu et al., 9 Oct 2025).

4. Benchmarking and Quantitative Performance

RP-ReAct frameworks have been evaluated on challenging multi-tool benchmarks:

ToolQA (Molinari et al., 3 Dec 2025):

Complex, multi-domain question answering requiring up to 13 tool functions per instance. RP-ReAct achieves superior accuracy—particularly on hard tasks requiring deep, sequential reasoning (+10–20 points on large models)—and demonstrates 50% lower accuracy standard deviation across models compared to ReAct or Reflexion baselines. Gains stem from reduced "trajectory drift" and robust context handling.

StableToolBench (Wei et al., 13 Nov 2025):

In a planner-centric RP-ReAct with the Qwen3-8B (RL) planner and GPT-4o executor, the framework yields 59.8% Solvable Pass Rate (SoPR), whereas GPT-4 ReAct achieves only 48.2%. RP-ReAct also requires significantly fewer inference steps (2.29 avg.) than iterative ReAct (~4.2 avg.), reflecting the efficiency of global planning.

SVEN (Security/Code Generation) (Liu et al., 9 Oct 2025):

RA-Gen, an RP-ReAct instance, attains a 94.8% security rate and 95.8% pass rate on vulnerability-patching tasks. It outperforms base GPT-4 and Gemini1.0 Pro, with explicit traceability for every step.

A common finding is that RP-ReAct requires large LRMs with substantial base reasoning capacity; model sizes below 7–8B parameters do not reliably solve complex benchmarks (Molinari et al., 3 Dec 2025).

5. Theoretical Foundations and Extensions

Early formalizations of the RP-ReAct paradigm are grounded in high-level transition models and reactive policies with planning (Saribatur et al., 2016). In these, execution proceeds through (a) a reasoner selecting subgoals based on clustered abstract states, and (b) an integrated planner computing plans to reach these subgoals. Theoretical analysis encompasses the soundness and completeness of plan synthesis, complexity (PSPACE in general), and properties of state clustering.

Later systems add online learning, partial observability, and multi-agent configurations. For example, RAE+UPOM+Learning interleaves deliberative acting and online planning with operational models, using Monte Carlo Tree Search and domain-learned heuristics to optimize real-time acting efficiency (Patra et al., 2020).

Controllability, safety, and interpretability are reinforced in code generation contexts by explicit agent modularity, transparent reasoning traces, and static verification gates (Liu et al., 9 Oct 2025).

6. Limitations and Future Trajectories

Current RP-ReAct variants are subject to several limitations:

Generalization is bounded by available datasets (ToolQA/ComplexTool-Plan) and the absence of broader, more diverse benchmarks (e.g., OfficeBench, Mint) (Molinari et al., 3 Dec 2025).
No supervised or reinforcement tuning has yet been applied to RPA and PEA in the original enterprise setting, leaving potential gains untapped (Molinari et al., 3 Dec 2025).
Context threshold and model temperature require further systematic exploration; integration of more advanced summarization and memory mechanisms is a prospective extension (Molinari et al., 3 Dec 2025).
Robustness with small models and real-time multi-agent deployments remains an open area, as does quantitative evaluation against traditional agent control architectures (Puerta-Merino et al., 17 Jan 2025).
Global plans may become intractably complex for very large toolsets without further abstraction, pruning strategies, or hierarchical decomposition.

Research trends indicate future directions including symbolic-abstraction refinement, richer plan forms (e.g., contingent, hierarchical), multi-agent orchestration, and explicit learning/refinement of reasoning and planning components over time (Saribatur et al., 2016, Wei et al., 13 Nov 2025).

7. Cross-Domain Relevance

RP-ReAct's architecture—decoupling global reasoning from local execution—has been broadly instantiated: from complex enterprise automation (Molinari et al., 3 Dec 2025), to multi-agent code generation with formal safety guarantees (Liu et al., 9 Oct 2025), to NPC control in simulated environments (Puerta-Merino et al., 17 Jan 2025), to planner-centric tool reasoning with parallel execution over DAGs (Wei et al., 13 Nov 2025), and foundationally to high-level policy synthesis in action-languages (Saribatur et al., 2016). A plausible implication is that this modular decoupling is converging as a standard design pattern in LLM-augmented agentic reasoning, especially under context, compliance, and generalization constraints endemic to complex real-world domains.