Papers
Topics
Authors
Recent
2000 character limit reached

SGA-ACR: Graph-Augmented Actor-Critic Model

Updated 3 December 2025
  • The paper introduces SGA-ACR, a framework that overcomes LLM-plan alignment challenges through subgoal graph structuring, multi-agent separation, and real-time tracking.
  • It details a multi-stage planning pipeline where actor, critic, and refiner LLMs collaboratively generate, evaluate, and refine subgoal sequences for robust RL policy execution.
  • Experimental results on the Crafter environment demonstrate superior per-task success rates and effective long-horizon planning compared to state-of-the-art baselines.

Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR) is a framework designed to overcome the challenge of aligning high-level LLM-generated task plans with low-level, environment-executable behaviors in open-world reinforcement learning (RL) settings. SGA-ACR achieves this through an overview of environment-specific subgoal graph structuring, multi-LLM planning separation, and real-time subgoal tracking, allowing robust long-horizon planning and execution on complex RL domains such as Crafter (Fan, 26 Nov 2025).

1. Architectural Components and Knowledge Construction

SGA-ACR is constructed from three core components:

  • Subgoal Graph (G) and Entity Knowledge Base (K): Offline, the environment documentation ("Env_paper") and source code ("Env_code") are processed using LLM extraction to generate a directed subgoal graph G=(V,E)G=(V,E). Each node vVv \in V models a subgoal with associated name, description, preconditions, and postconditions. Directed edges encode prerequisite (AND/OR) relations, compositionality, and feasible action sequences. The entity knowledge base KK captures structured representations (name, type, description, associated subgoals) for each observable object, providing granular context for planning and execution.
  • Multi-LLM Actor–Critic–Refiner Pipeline: Instead of monolithic LLM planning, the pipeline separates roles across distinct LLM agent instances:

    1. Actor LLM generates kk candidate plans, each as a sequence of three subgoals from VV, with variant rationales, conditioned on textified observations, a verbalized subgoal graph, and currently observed entities.
    2. Critic LLM receives detailed subgoal information, ranks candidate plans via explicit feedback, and outputs a "Need_Modify" flag.
    3. Refiner LLM uses this critique to revise or select among plans, yielding a final executable sequence.
  • Subgoal Tracker: During online interaction, the tracker extracts object state deltas from observations, monitors achievement of each subgoal, provides auxiliary rewards, and updates graph success statistics.

Knowledge construction leverages LLM prompting to extract subgoal schemas and entity definitions, assigning pre/postconditions and forming the graph structure. This environment-specific structuring is critical for robust grounding and verification.

2. Formal Problem Formulation

SGA-ACR models the RL environment as a partially observable Markov decision process (POMDP), extended with a finite set of learned subgoals G=V\mathcal{G}=V:

  • State, Action, Observation: Standard POMDP tuples (S,A,P,Ω,O,R,γ)(S, A, P, \Omega, O, R, \gamma), with partial observation vector oto_t.
  • Goal-Conditioned Policy: The RL policy πθ(ao,g)\pi_\theta(a|o, g) is conditioned on a plan gG3g \in \mathcal{G}^3, i.e., a list of three subgoals.
  • Reward Signals: The reward decomposes into environment reward rtr_t and auxiliary subgoal reward rtr'_t:
    • rt=R(st,at)r_t = R(s_t, a_t) (e.g., +1+1 per new achievement, ±0.1\pm 0.1 per health fluctuation)
    • rtr'_t yields +α+\alpha on first-time subgoal achievement: rt=i=1nα1sgi achieved first at tr'_t = \sum_{i=1}^n \alpha \cdot 1_{\text{sg}_i\text{ achieved first at } t}
    • Total discounted reward: Rttotal=k=0Tγk(rt+k+rt+k)R_t^{\mathrm{total}} = \sum_{k=0}^{T} \gamma^k (r_{t+k} + r'_{t+k})
  • Curriculum via Graph Statistics: Each subgoal vv tracks a success rate ω(v)=Nva/Nvp\omega(v) = N^a_v / N^p_v (achieved/planned count), which biases LLM planning toward reliable solution paths by being injected into prompt contexts.
  • RL Optimization: Proximal Policy Optimization (PPO) is applied, with the clipped surrogate loss

LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left(r_t(\theta) A_t,\, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right) \right]

where At=Qϕ(st,at,g)Vϕ(st,g)A_t = Q_\phi(s_t, a_t, g) - V_\phi(s_t, g). The critic network parameters (Qϕ,Vϕ)(Q_\phi, V_\phi) are trained by mean squared error on empirical returns.

3. Online Planning, Subgoal Tracking, and Execution

SGA-ACR employs a structured planning loop at fixed intervals HH (or when the current subgoal plan is exhausted):

  1. Plan Generation:
    • Observations are textified (otexto_{text}); entities EobsE_{obs} are detected and mapped to KK.
    • The actor LLM proposes kk plans {(pi,qi)}i=1k\{(p_i, q_i)\}_{i=1}^k, each a three-subgoal tuple.
    • The critic LLM aggregates retrieved graph details G(v)G(v), delivers feedback {fi}\{f_i\} and ranks plans, raising the "Need_Modify" flag if plan inadequacies are detected.
    • If refinement is required, the refiner LLM integrates critiques to produce pfinalp_{\text{final}}.
  2. Integration with RL Agent:
    • The plan pfinalp_{\text{final}} is embedded and provided to the PPO goal-conditioned policy for action generation.
    • Data (ot,p,at,rt+rt,ot+1)(o_t, p, a_t, r_t+r'_t, o_{t+1}) is stored in replay buffer BB for periodic PPO updates.
  3. Subgoal Achievement and Graph Update:
    • The subgoal tracker monitors object transitions to match postconditions, issues auxiliary reward, and updates the planned/achieved counters (Nvp,Nva)(N^p_v, N^a_v).
    • Success rates ω(v)\omega(v) directly inform future planning bias.

Pseudocode for these procedures is specified in the source, defining deterministic interactions and stateless modular interfaces.

4. Experimental Protocols and Key Results

Evaluation is conducted on the open-world Crafter environment (22 achievements, 64×64 grid, FOV=9×79\times7). Key metrics include environment reward, per-task success rates, and a score defined as:

Score=exp(122iln(1+si))1\text{Score} = \exp\left(\frac{1}{22} \sum_i \ln(1 + s_i)\right) - 1

where sis_i is the per-task success rate.

Empirical benchmarks demonstrate:

Method Score (@1M steps) Reward (@1M) Score (@5M steps)
SGA-ACR 17.6% ±1.2 11.1 ±1.3 29.6%
AdaRefiner 14.3% 10.1 26.8%
Causal-aware LLM 13.2% 10.3 27.7%
PPO (ResNet) 14.9% 10.2

SGA-ACR achieves the highest per-task success rate in 20/22 tasks at 1M steps and uniquely unlocks iron tools within this regime. Scalability is demonstrated for LLM backbones from 8B up to 235B parameters, with baselines improving only at larger scales (Fan, 26 Nov 2025).

5. Ablation Analyses and Insights

Ablation experiments indicate the necessity of both architectural components and structured knowledge:

  • Multi-LLM Separation: Removal of the critic (13.9%) or refiner (15.3%) substantially reduces overall score. Random plan selection and actor-only planning also underperform. The explicit critique and refinement cycle, governed by a "Need_Modify" flag, suppresses overconfident or over-refined planning artifacts.
  • Environment Grounding: Subgoal graph and entity KB are essential; omitting both may reduce score to 13.1%. Structured graph reasoning yields explicit, verifiable subgoal chains, improving interpretability and reliability over unstructured (text-RAG) approaches (15.8%).

These results underscore the role of environment-specific structuring and multi-level plan verification for bridging high-level LLM reasoning and actionable policies.

SGA-ACR builds on recent work in LLM-guided RL planning, including AdaRefiner, Causal-aware LLM planning, SPRING, Reflexion, and ReAct baselines, as well as non-LLM-based approaches like PPO, Rainbow, and DreamerV3. It distinguishes itself by:

  • Explicitly separating LLM planning stages, countering overconfident and ungrounded plan proposals.
  • Integrating subgoal graph-induced curriculum to promote reliable skill acquisition by tracking ω(v)\omega(v) online.
  • Bridging environment-level abstraction and policy execution via structured knowledge extraction and plan tracking.

The framework provides a scalable, principled solution for long-horizon task decomposition and execution alignment in open-world RL, with demonstrated ability to outperform model-scaling improvements alone.

7. Significance and Future Directions

SGA-ACR advances the alignment of LLM-generated planning with RL execution in complex environments by leveraging graph-structured curriculum, multi-agent LLM separation, and real-time achievement tracking. This suggests new research directions in hierarchical RL, grounding, and automated knowledge extraction for broader classes of goal-conditioned agents and environments. A plausible implication is that LLM-based planning for RL will increasingly benefit from modular, structured, and adaptive components that bridge abstract reasoning and direct physical interaction, particularly as open-world benchmarks increase in complexity and diversity (Fan, 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR).