Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Published 26 May 2026 in cs.LG and cs.AI | (2605.26684v1)

Abstract: Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of LLMs and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces GraphGPO, a novel method that aggregates rollout trajectories into a unified state-transition graph for precise step-level credit assignment.
It leverages a graph-based distance metric to compute rewards, ensuring actions that reduce the shortest-path distance to the goal are appropriately incentivized.
Empirical evaluations on ALFWorld, WebShop, and Sokoban demonstrate significant success rate gains and robust learning efficiency with negligible computational overhead.

Graph-Based Credit Assignment for Agentic RL: Formal Summary

Motivation and Problem Statement

Agentic tasks solved by LLMs in complex, multi-turn environments require precise credit assignment for effective RL optimization. Traditional group-based RL frameworks, such as GRPO, rely on trajectory-level outcome attribution, granting uniform positive credit to all steps in successful trajectories and penalizing all steps in failed ones. This coarse mechanism fails to distinguish steps genuinely advancing toward the goal and overlooks progress hidden within failed trajectories. Empirical evidence shows that 22.0% of steps in failed trajectories actually contribute to task progress in ALFWorld, while 65.3% of steps in successful trajectories are non-progress steps (Figure 1).

Figure 1: Trajectory-level credit assignment misaligns step-level progress statistics in multi-turn ALFWorld tasks, with substantial progress obscured in failed trajectories and redundant actions rewarded in successful ones.

These findings motivate a finer-grained approach that transcends trajectory-level attribution to capture the true contribution of individual steps, especially in sparse-reward environments and long-horizon agentic settings.

Methodological Advances: GraphGPO

Aggregated State-Transition Graph Construction

GraphGPO introduces a paradigm shift by synthesizing all rollout trajectories into a unified directed state-transition graph. Each state encountered across all trajectories is mapped to a node, and each action-induced transition forms a directed edge with an associated cost. Identical states across different trajectories are merged into single nodes, facilitating the aggregation of global experiential information. The terminal states include both success (goal completion) and predefined failure (over-length or dead-end).

Figure 2: State-transition graph aggregates trajectory rollouts, merging identical states and encoding transitions across paths; credit is assigned via shortest-path distance to goal.

Graph-Based Step-Level Credit Assignment

For any state, the minimum-cost path to the goal is computed—yielding a distance metric $d(s)$ . Step-level rewards are defined as:

$R^G(s, \bm{a}, s') = r_{\text{succ}} \cdot \omega^{d(s') + c(s, \bm{a})}$

where $r_{\text{succ}}$ is the scalar reward for success, and $\omega$ is a discount factor. States empirically unable to reach the goal are penalized by substituting $d_{\max} + 1$ for $+\infty$ . Each outbound edge from a state is grouped, facilitating per-state group-based advantage estimation:

$A^G(s, \bm{a}, s') = \frac{R^G(s, \bm{a}, s') - \mu(G^G(s))}{\sigma(G^G(s))}$

where $\mu$ and $\sigma$ are group-level statistics. This advantage function is strictly monotonic: actions that move closer to the goal receive higher credit regardless of their trajectory outcome.

Policy Optimization Objective

To ensure robustness where graph-based credit could degenerate (e.g., singleton transitions), the final advantage used for policy optimization is a weighted sum of graph-based and episode-level advantage. Policy update follows a clipped importance-sampling formulation with a KL-divergence penalty to maintain reference policy proximity.

Empirical Results

GraphGPO is evaluated on challenging multi-turn agentic benchmarks—ALFWorld, WebShop, and Sokoban—using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct as base LLMs, and Qwen2.5-VL-3B-Instruct for VLM scenarios. Results are benchmarked against competitive baselines: GRPO, GiGPO, PPO, prompting-based (ReAct, Reflexion), and closed-source LLMs (GPT-4o, Gemini-2.5-Pro).

GraphGPO achieves average success rate gains of 14.85% and 11.98% over GRPO for Qwen2.5-1.5B and Qwen2.5-7B-Instruct respectively on ALFWorld.
On WebShop, GraphGPO improves success rate by 7.30% and 5.31% over GRPO.
In Sokoban, deterministic vision-language settings, GraphGPO surpasses GRPO by 19.88% and GiGPO by 10.06%.

Training dynamics consistently show faster early-stage improvement, superior convergence, and persistent lead of GraphGPO versus baselines, attributable to globally-informed step credit signals.

Figure 3: GraphGPO training curves across ALFWorld, WebShop and Sokoban display improved convergence and sustained advantage over GiGPO and GRPO.

Validation success rates reinforce these training trends, and ablation studies demonstrate that step-level graph-based advantages ( $A^G$ ) deliver higher fidelity supervision than GiGPO's trajectory-based step grouping, especially when combined with episode-level advantage.

Figure 4: Validation episode success rates versus training steps show robust generalization and efficiency of GraphGPO’s credit assignment.

Computational Efficiency

Despite the global aggregation and shortest-path computations, GraphGPO incurs negligible time and memory overhead. The graph is maintained via hash table structures, and Dijkstra-based distance calculations are dominated by rollout sampling and policy update costs ( $R^G(s, \bm{a}, s') = r_{\text{succ}} \cdot \omega^{d(s') + c(s, \bm{a})}$ 0 of total iteration runtime). Thus, GraphGPO retains the critic-free, low-memory, stable convergence characteristics of group-based RL.

Figure 5: Per-iteration runtime breakdown demonstrates minimal computation overhead for graph construction and advantage calculation in GraphGPO.

Theoretical Implications

GraphGPO demonstrates two key theoretical properties:

Monotonicity: Actions reducing shortest-path distance to the goal always receive higher advantage, controlling for episode-level outcome.
Conditional variance reduction: Graph-based step-level feedback exhibits strictly lower conditional variance compared to episode-attributed step feedback (GiGPO/GRPO), ensuring higher signal-to-noise ratio for RL gradients and more efficient learning.

Practical Implications and Prospects

GraphGPO establishes a scalable, critic-free RL framework for LLM-powered agents that faithfully uncovers latent progress within failed trajectories, penalizes redundancy, and leverages global rollout structure for rapid policy improvement. The formulation is robust to observation noise (with embedding-based state matching available), and flexible to rollout group size. As LLM agents transition to increasingly complex, high-dimensional, and real-world settings, graph-based credit assignment will serve as a foundational principle for learning in sparse-reward, long-horizon agentic domains.

Further avenues include hierarchical graph aggregation, dynamic group sampling, and extension to multi-agent or tool-use RL paradigms, potentially advancing autonomous reasoning and planning for future generalist models.

Conclusion

GraphGPO enables fine-grained, globally-informed, step-level credit assignment by aggregating episode rollouts into a state-transition graph and assigning rewards based on progress toward the goal. This approach outperforms trajectory-level baselines across diverse agentic tasks, improves learning efficiency, and retains operational scalability. It delivers a principled mechanism for RL optimization in LLM-powered agents, complementing and extending existing group-based RL techniques for long-horizon environments (2605.26684).

Markdown Report Issue