Planner–Coder–Critic Pipeline
- Planner–Coder–Critic Pipeline is a unified architectural paradigm that integrates symbolic planning, code-based policy generation, and critical feedback into an interactive decision-making loop.
- It combines reinforcement learning, formal action semantics, and RLHF to iteratively refine executable plans, significantly boosting sample efficiency and robustness.
- Empirical evaluations highlight notable improvements, including up to 91% reduction in query cost and a 23% increase in success rates compared to conventional methods.
A Planner–Coder–Critic Agentic Pipeline is a unified architectural paradigm that fuses structured planning, code-based policy refinement, and critical feedback mechanisms into an interactive loop for sequential decision-making, reinforcement learning, and complex task automation. This design coordinates symbolic reasoning, experiential learning, and human or automated critique to accelerate learning, improve sample efficiency, and foster robustness in dynamic or uncertain environments. Building on rigorous foundations in both classical planning and reinforcement learning, it integrates modules that generate, execute, and iteratively improve plans or policies through structured knowledge, code-based modeling, and feedback incorporation.
1. Architecture and Core Components
A Planner–Coder–Critic pipeline decomposes the agent’s cognition into three synergistic modules:
- Planner: Generates initial, often high-level or symbolic, plans leveraging formal domain knowledge, LLMs, or combinatorial search over structured representations such as action languages.
- Example: The PACMAN framework uses a symbolic planner over formal action descriptions (e.g., the action language ℬC) to produce goal-directed plans, employing sample-based planning augmented with stochastic policies () as oracles (Lyu et al., 2019, Lyu et al., 2019).
- In CoPiC, LLMs generate multiple planning programs—instances of code that produce or refine candidate action sequences conditioned on task and environmental context (Tian et al., 23 Sep 2025).
- Coder: Converts (or directly produces) executable representations, such as Python functions or code blocks, which are evaluated or executed in the environment, possibly iteratively refined using feedback or interaction histories.
- In AdaPlanner, the coder is instantiated via code-style LLM prompt structures; the LLM generates “solution” functions in Python with assertions as checkpoints (Sun et al., 2023).
- In CP-Agent, the agent employs a persistent IPython kernel to iteratively write, debug, and verify code for constraint programming, dynamically refining its model with execution feedback (Szeider, 10 Aug 2025).
- Critic: Evaluates the quality and future utility of candidate policies or plans using value functions, learned reward models, or external human feedback, and provides corrective signals to the planner or coder.
- PACMAN allows human feedback signals to override the TD error, effectively treating fᵢ as an alternative advantage estimate in actor–critic RL updates: (Lyu et al., 2019).
- In CoPiC, a domain-adaptive critic (trained via RL with LoRA) scores LLM-generated planning programs for expected long-term reward alignment, selecting the optimal candidate for execution (Tian et al., 23 Sep 2025).
- PlanCritic formalizes critique using an RLHF-trained reward model to iteratively mutate and optimize PDDL constraint sets via a genetic algorithm, maximizing adherence to human preferences (Burns et al., 30 Nov 2024).
This modular segregation facilitates a bidirectional, interactive learning loop: plans are synthesized, executed (or simulated), critiqued, and refined, with each module leveraging different modalities (symbolic, code, behavioral feedback).
2. Symbolic Planning and Sample-Based Augmentation
The planning component typically encodes prior knowledge or high-level abstractions to constrain search and improve exploration efficiency:
- Formal Action Semantics: Domains are encoded using fluent and action constants, with static and dynamic causal laws. PACMAN defines dynamic laws as and augments these with predicates that enable integration with probabilistic policy sampling (Lyu et al., 2019).
- Sample-Based Planning: Rather than produce plans in isolation, the planner incorporates the current stochastic policy to bias action selection. This is formalized via:
where and are static and dynamic laws, and is generated by sampling the current policy at each step. The composite program is translated into an answer set program via a mapping , producing executable, symbolically guided plan candidates.
- Underlying Rationale: By leveraging symbolic plans, the agent achieves “jump-started” exploration, biasing trajectories toward semantically meaningful regions of the state space and reducing sample complexity.
3. Integration with Reinforcement Learning and Code-Based Policy Refinement
Execution of planned policies is refined through standard or customized reinforcement learning techniques:
- Actor–Critic Algorithms: The agent learns through both environmental rewards and (optionally) human-provided feedback:
- Policy Update:
- Value Function Update:
- Advantage/TD Error:
Optionally replaced with per human feedback.
- Code-Driven Policy Generation: LLMs or agent frameworks synthesize executable code (e.g., Python plans) that interact with the environment, update beliefs, and process feedback iteratively. Candidate plans are refined based on execution logs, assertion failures, or summarized histories, exemplified by AdaPlanner’s code-based prompting and CoPiC’s planning-program evolution with history summarization (Tian et al., 23 Sep 2025, Sun et al., 2023, Szeider, 10 Aug 2025).
- Dual Network Coordination: In layered control, a dual variable is learned to ensure consensus between planner outputs and tracker execution (), with convergence guarantees established in LQR; error updates follow
yielding exponentially contracting tracking error (Yang et al., 3 Aug 2024).
4. Feedback, Critique, and Robustness Mechanisms
- Human-Centered Feedback: The PACMAN architecture incorporates human feedback directly as an alternative advantage estimate, supporting robustness to infrequent or inconsistent supervised signals without destabilizing policy updates (Lyu et al., 2019, Lyu et al., 2019).
- Automatic Plan Critique and RLHF: In PlanCritic, plans generated by a symbolic planner are critiqued via an RLHF-trained reward model; a genetic algorithm mutates constraint sets, and the best candidates are evolved according to adherence to user preferences (Burns et al., 30 Nov 2024).
- Domain-Adaptive Critics: CoPiC employs a critic LLM that scores candidate planning programs normalized by length and interaction context, then softmax-selects plans most likely to yield high long-term reward. The critic is refined by RL based on end-of-episode environment returns, ensuring long-horizon efficacy even with sparse intermediate rewards (Tian et al., 23 Sep 2025).
- Code-Embedded Verification: Assertion-driven code (as in AdaPlanner) provides natural points for critique—failed assertions or error messages trigger plan refinement (either localized or global), minimizing silent failure and constraining hallucinations through structured code interfaces (Sun et al., 2023).
5. Empirical Results and Comparative Evaluation
A consistent empirical advantage is observed across diverse sequential decision-making and control benchmarks:
Framework | Key Benchmark / Domain | Performance Gains over Baselines | Robustness / Sample Efficiency |
---|---|---|---|
PACMAN | Four Rooms, Taxi | Jump-starts, rapid convergence, low variance | Robust to infrequent/inconsistent human feedback |
CoPiC | ALFWorld, NetHack, StarCraft II | 23.33% ↑ SR, 91.27% ↓ query cost vs AdaPlanner | High SR at substantially reduced token/query cost |
PlanCritic | Waterway disaster recovery | 75% SR vs 53% for LLM-only neurosymbolic | RLHF + GA yields correction in 88% of initial plan errors |
AdaPlanner | ALFWorld, MiniWoB++ | Outperforms baselines with 2x–600x fewer demos | Code-style interface dramatically reduces hallucinations |
Layered RL | Linear, nonlinear (unicycle) control | Near-iLQR cost, ≤0.02 mean tracking deviation | Explicit dual ensures tracking–planning consensus |
Key findings include:
- Jump-Start and Efficiency: Architectures leveraging symbolic or code-driven planning provide substantial jump-start at early learning stages, accelerate convergence, and decrease variance relative to canonical RL or naively reactive LLMs (Lyu et al., 2019, Lyu et al., 2019, Tian et al., 23 Sep 2025, Sun et al., 2023).
- Query/Compute Cost Minimization: By selecting candidate plans using a learned critic, approaches such as CoPiC achieve high success rates with up to 80–90% reductions in token or query cost compared to iterative LLM feedback baselines (Tian et al., 23 Sep 2025).
- Robustness to Feedback Quality: Symbolic constraints and plan verification (e.g., assertion-based checks, hard-coded action preconditions) act as guardrails, preventing policy drift under misleading or sparse feedback (Lyu et al., 2019, Tian et al., 23 Sep 2025).
- Correction and Adaptivity: Genetic search and domain-adaptive critics enable dynamic correction of plan errors, surpassing static LLM-generated plan approaches (Burns et al., 30 Nov 2024).
6. Theoretical Guarantees and Implementation Considerations
- Convergence: PACMAN’s iterative planning/RL loop, and Layered RL with explicit dual variable learning, are proven to converge under suitable conditions (e.g., monotonicity in UPOM (Patra et al., 2020); exponential contraction for dual variable in LQR (Yang et al., 3 Aug 2024)).
- Mutual Reinforcement: The planner steers exploration toward high-value or goal-relevant states, while RL-derived policy updates recursively improve subsequent plan proposals; human or critic feedback further aligns outputs with external goals or preferences.
- Limitations: Execution requires precise domain models (for symbolic planning), and mismatch between symbolic abstractions and low-level environment dynamics can impede exploration if constraints are too rigid. Integration of stochastic feedback and symbolic rules presents additional complexity, especially in high-dimensional or perception-heavy settings (Lyu et al., 2019).
- Generalizability: Modular code-based and sample-driven planning schemas enable adaptive generalization to new tasks, provided sufficient coverage and robust critic adaptation.
7. Relationship to Broader Agentic Methodologies
The Planner–Coder–Critic pipeline represents a convergence of symbolic AI (planning, logic programming), sample-based RL, LLM-powered code generation, and feedback-driven or RLHF-based critique. It stands in contrast to strictly end-to-end or black-box RL approaches by explicitly leveraging structured priors and modular verification. The bidirectional loop architecture enables bidirectional transfer of information across planning, learned policy, and external feedback, bridging gap between classical AI and modern learning-based agentic systems.
This paradigm increasingly informs state-of-the-art agentic designs in sequential decision-making, constraint programming, robotic control, and language-guided planning, serving as an extensible scaffold for integrating reasoning, execution, and adaptive critique.