Reasoner-Planner Agent (RPA) Framework

Updated 21 April 2026

Reasoner-Planner Agent (RPA) is a modular agent architecture that decouples strategic reasoning and planning from low-level execution, addressing trajectory instability and context management.
It employs hierarchical control with formal planning cycles, leveraging large reasoning models to generate subgoals, evaluate outcomes, and re-plan when needed.
RPA frameworks demonstrate improved task accuracy and reduced performance variance in complex multi-tool environments, supported by robust context and memory management techniques.

A Reasoner-Planner Agent (RPA) is a computational agent architecture that explicitly separates strategic reasoning and planning from low-level execution, enabling effective multi-step problem decomposition, reliable tool use, and robust adaptation to complex, multi-domain environments. This separation addresses fundamental issues—such as trajectory instability and context management—found in monolithic or purely reactive agent designs. RPA traces its roots to established concepts in AI planning, LLM-driven control, cognitive science, and formal logic, but its modern incarnations leverage large reasoning models (LRMs) or LLMs to supervise, interpret, and adapt agent behavior in real time (Molinari et al., 3 Dec 2025).

1. Architectural Principles and Modularization

Contemporary RPA frameworks, exemplified by the RP-ReAct architecture, enforce a hierarchical division: the RPA is responsible for global strategy, subgoal formulation, outcome analysis, and re-planning, while a Proxy-Execution Agent (PEA) or equivalent executor is tasked with translating subgoals into atomic tool calls, often via a local ReAct loop (Think → Action → Observe). The RPA interprets user intentions, decomposes them into discrete sub-questions, and leverages high-capacity LRMs for both planning and evaluation. Data flow in RP-ReAct is mediated by standardized query/result wrappers and centralized context/state management (Molinari et al., 3 Dec 2025).

Within modular multi-agent systems, the RPA is functionally analogous to System 2 (“deliberative reasoning”) in dual-process models: deliberative and planning actions—including tool orchestration, explicit environment modeling, and goal refinement—occur asynchronously or in parallel to rapid System 1 (“Talker” or conversational) agents, as with the Talker-Reasoner architecture (Christakopoulou et al., 2024). This decoupling enables low-latency user interaction alongside sophisticated background state tracking and adaptation.

2. Control Algorithms and Reasoning Workflow

The core operational workflow of an RPA centers on a deliberative control loop. In the RP-ReAct framework, the agent executes up to $MAX\_SEARCH = 10$ planning cycles, with each cycle comprising the following operations:

Generation of the next sub-question via the LRM;
Emission of a structured query to the PEA;
Awaiting sub-result and evaluating with a success classifier

$\text{Success}(r) = \begin{cases} 1 & \text{if LRM judges } r \text{ matches expectations}\ 0 & \text{otherwise} \end{cases}$

If unsuccessful, invoking the LRM to replan via a loss-minimization objective over the failure trace:

$\arg\min_\pi \mathcal{L}(\pi \mid \text{failure\_trace}), \quad \mathcal{L} = \sum_{t\in\text{trace}} [1 - \text{Success}(r_t)]$

When a final answer is produced, emitting a standardized finish signal with the result (Molinari et al., 3 Dec 2025).

This process tightly integrates high-level strategy (plan/subgoal generation, self-monitoring, re-planning) with low-level, context-sensitive action execution and observation cycles in the executor domain (Molinari et al., 3 Dec 2025, Chen et al., 2024).

Alternative instantiations incorporate reinforcement learning (RL) controllers (e.g., HYDRA (Ke et al., 2024)), Monte-Carlo Tree Search for path optimization (Hao et al., 2023), or logic-programmed modules for epistemic state tracking (Burigana et al., 2020). The key is modular, stepwise, and potentially model-based reasoning, with explicit checkpoints for error correction and plan revision.

3. Context and Memory Management

A defining challenge for RPA design is efficient management of context under strict window limits, especially when using on-premise or privacy-constrained models. The RP-ReAct system formalizes context-saving as follows:

Truncation threshold $T$ is imposed on tool outputs; only the leading $T$ tokens are retained in context.
Full tool outputs are offloaded to external storage variables (e.g., $var_0$ ), referenced by handles passed back to the RPA/PEA.
The executing agent returns a response pairing a preview and a pointer to the full data:

$\text{preview}(o) = o_{1:T},\quad \text{store}(o) \to \text{var}_i, \quad \text{PEA\_response} = [\text{preview}(o),\,\text{“see var}_i\text{”}]$

On-demand secondary analysis is triggered by explicit requests to the execution environment (Molinari et al., 3 Dec 2025).

This context management procedure is pivotal for ensuring agentic stability and minimizing state drift, and is integral to robust operation with multi-modal tools and large result payloads.

4. Model Integration and Specialization

In practical deployments, RPAs operate atop open-weight LRM backends, with agent-specific specialization:

The strategic RPA module uses large models (e.g., gpt-oss-120B, Qwen3-32B) configured for reasoning fidelity (temperature 0.6, TopP 1.0), limited step horizons, and robust feedback evaluation.
Executor PEAs may employ identically sized or slightly downsized architectures, optimized for tool use and context discipline.
Fine-tuning and LoRA adapters enable on-device and cross-domain adaptation, with performance retention proven through weight merging and task-specific evaluation (Chen et al., 2024).

Architectural choices (model scale, context/temperature scheduling, few-shot vs. fine-tuned prompting) critically affect reliability, latency, and generalization.

5. Evaluation Benchmarks, Metrics, and Empirical Results

RPAs are typically benchmarked on orchestration-heavy multi-tool environments, the standard being ToolQA (Molinari et al., 3 Dec 2025). Key metrics include:

Task accuracy ( $Acc$ ) on both easy and hard subsets;
Model-robust standard deviation ( $Std$ ) of scores across multiple backbone models;
Saturation ( $Sat_a$ ) and Combined Performance Score ( $\text{Success}(r) = \begin{cases} 1 & \text{if LRM judges } r \text{ matches expectations}\ 0 & \text{otherwise} \end{cases}$ 0):

$\text{Success}(r) = \begin{cases} 1 & \text{if LRM judges } r \text{ matches expectations}\ 0 & \text{otherwise} \end{cases}$ 1

$\text{Success}(r) = \begin{cases} 1 & \text{if LRM judges } r \text{ matches expectations}\ 0 & \text{otherwise} \end{cases}$ 2

Empirical findings show that on hard tasks, RP-ReAct achieves up to +20 percentage point improvements over leading single-agent ReAct baselines, especially in context-poor or multi-domain settings. Standard deviation in performance is nearly halved relative to conventional approaches (0.12 vs. 0.26 on hard tasks). Notably, step-limit ablations confirm that merely increasing trajectory length in monolithic agents yields minimal gains, whereas explicit strategic planning in the RPA critically boosts success rates (Molinari et al., 3 Dec 2025).

Multi-domain, on-device RPAs (Octo-planner) similarly demonstrate >97% in-domain accuracy with quantifiable trade-offs as LoRA domains are merged, supporting modular adaptation (Chen et al., 2024). Visual reasoning frameworks (HYDRA) show gains in compositional inference and cross-benchmark generalization, with RL-driven control outperforming purely prompt-based compositional strategies (Ke et al., 2024).

6. Limitations, Failure Modes, and Research Directions

Current RPA instantiations manifest distinct limitations:

Model scale dependency: performance collapses with sub-10B parameter models, due to unstable plan formation and premature answer emission (Molinari et al., 3 Dec 2025).
Single RPA/single executor architectures predominate; multi-executor (multi-PEA) configurations are yet unexplored.
Context management relies on static thresholds ( $\text{Success}(r) = \begin{cases} 1 & \text{if LRM judges } r \text{ matches expectations}\ 0 & \text{otherwise} \end{cases}$ 3), not adaptive summarization.
Post-training techniques (SFT, RL) are absent in many pipelines—results are often zero-shot/few-shot baseline.
Latency, resource use, and tool integration complexities impose practical barriers to real-time and resource-constrained deployment, particularly in mobile and privacy-critical environments (Chen et al., 2024, Molinari et al., 3 Dec 2025).
RL controller capacity and LLM error propagation remain open problems in feedback-driven architectures (e.g., HYDRA) (Ke et al., 2024).
No formal large-scale user studies have yet evaluated human-interactive RPA systems in real-world contexts; existing demonstrations are qualitative (Christakopoulou et al., 2024).

Future research avenues include dynamic context scheduling, agent-specific decoding policies, application to broader benchmarks (e.g., OfficeBench, Mint), post-training for redundancy minimization, orchestration of hierarchical/multi-PEA networks, and advanced environment-specific world modeling (Molinari et al., 3 Dec 2025, Chen et al., 2024, Dinh et al., 2024).

7. Theoretical Underpinnings and Domain Extensions

RPAs formalize multi-step reasoning/planning as Markov Decision Processes (MDPs), with explicit decomposition into world-state tracking, action selection, and reward optimization (Hao et al., 2023, Dinh et al., 2024). Realizations span LLM-driven tree search (RAP), PDDL/UP hybridization (LLM Reasoner + Planner), symbolic epistemic state tracking (ASP-based epistemic planning with full modal/μ-calculus semantics), and reinforcement-learned operator selection (PRIMA's logic-based multi-task RL framework) (Lyu et al., 2022, Burigana et al., 2020).

Across these paradigms, a unifying characteristic is the modularization of logic, planning, and execution; reusability across new domains (by swapping out goal schemas and domain definitions); and preservation of interpretability and soundness via explicit, human-auditable state, plan, and memory representations.

In summary, the Reasoner-Planner Agent framework provides a scalable, modular, and empirically validated approach to orchestrated reasoning, decomposed planning, and reliable execution in complex, realistic environments. It advances both the efficiency and correctness of multi-step agents—bridging high-level strategy and practical tool chains—across classical, LLM-driven, and hybrid symbolic–neural domains.