Papers
Topics
Authors
Recent
2000 character limit reached

StarPO: RL Framework for LLM Agents

Updated 9 November 2025
  • StarPO Framework is a reinforcement learning pipeline that formalizes agent training as a trajectory-level MDP to optimize sequential decision-making.
  • It integrates autoregressive LLMs with modular components like environment interfaces, policy networks, and explicit reward shaping for structured outputs.
  • StarPO-S enhances stability by using uncertainty-based trajectory filtering, critic incorporation, and asymmetric gradient clipping to improve agent reasoning in complex tasks.

The StarPO (State–Thinking–Actions–Reward Policy Optimization) framework is a general trajectory-level reinforcement learning (RL) methodology for training LLMs as interactive, multi-turn agents. Introduced and analyzed within the RAGEN modular agent system, StarPO directly addresses the challenges of exposing LLMs to sequential decision-making under bandit, combinatorial, and symbolic gridworld settings, with a particular focus on the emergence and stabilization of agent reasoning (Wang et al., 24 Apr 2025). The framework includes a stabilized variant, StarPO-S, which incorporates trajectory filtering based on outcome uncertainty, critic incorporation, and gradient stabilization mechanisms to address instability phenomena (notably, the "Echo Trap" regime). StarPO provides a formalization and practical RL pipeline for evolving LLM agents beyond imitation learning and static language modeling.

1. Formal Structure and Objective

StarPO formalizes the agent training problem as a trajectory-level Markov Decision Process (MDP), M=(S,A,P)\mathcal{M} = (S, A, P), in which:

  • SS is the state space, comprising textual observations and the agent’s interaction history τ<t\tau_{<t}
  • AA is the action space, consisting of token sequences parsed into structured reasoning traces (atTa_t^T) and atomic environment commands (ata_t)
  • PP defines the environment transition function P(st+1,rtst,at)P(s_{t+1}, r_t | s_t, a_t), with reward shaping and state rendering possibly mediated through text

At every time step, the LLM agent observes sts_t, produces a linearly structured output:

1
<think>...thought_t...</think><answer>a_t</answer>
where the answer ata_t is executed in the environment. The agent optimizes the expected total return over a rollout horizon,

JStarPO(θ)=Eτπθ(M)[R(τ)],R(τ)=t=0K1rtJ_{\mathrm{StarPO}}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\cdot|\mathcal{M})}[R(\tau)],\qquad R(\tau) = \sum_{t=0}^{K-1} r_t

where πθ\pi_\theta is the parameterized policy (the LLM).

Policy improvement is performed via Proximal Policy Optimization (PPO) or a critic-free Generalized Return Policy Optimization (GRPO). Token-level per-trajectory objectives are used, with the advantages computed via either GAE with critic, or normalized returns in the critic-free regime.

2. System Architecture and Components

The StarPO pipeline is instantiated within the RAGEN agent framework, with the following core components:

  • Environment Interface: Standardizes access to symbolic or simulated grid environments (Bandit, Sokoban, Frozen Lake), providing textual state prompts and interpretable rendering for LLM consumption.
  • Policy Network (Actor): An autoregressive LLM (e.g., Qwen-2.5, 0.5B parameters), operating autoregressively on text sequences, tasked with generating both the structured "thought-action" output and computing πθ\pi_\theta probabilities.
  • Critic Head (optional, for PPO): Predicts state values Vϕ(st)V_\phi(s_t), updated by temporal-difference error.
  • Rollout Manager: Generates multiple full-episode trajectories from a batch of PP prompts and NN rollouts per prompt, supporting diverse initializations and fresh sampling (Online-kk).
  • Reward Shaper: Encodes environment-specific rewards (e.g., Sokoban: +10 on success, –0.1/step), as well as format-based penalties for missing required thinking/action structure.
  • Replay Buffer: (optional in on-policy RL) Buffer for storing and sampling past trajectories.
  • Optimizer: Supports PPO (with GAE, clipping, KL penalty) and GRPO; uses Adam with β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, and configurable entropy regularization.

Hardware configuration in empirical studies included use of NVIDIA H100/A100 GPUs, FSDP sharding, and vLLM sampling.

3. StarPO-S: Stabilization via Trajectory Filtering and Gradient Control

StarPO-S extends StarPO to address observed instability under agent RL, especially "Echo Trap" phenomena, where reward variance collapses and gradients spike irreversibly. The StarPO-S variant introduces:

  • Uncertainty-Based Trajectory Filtering: For each prompt, compute the standard deviation of returns across NN rollouts, U(πθ,s0)=Stdτπθ(s0)[R(τ)]U(\pi_\theta, s_0) = \mathrm{Std}_{\tau \sim \pi_\theta(\cdot|s_0)} [R(\tau)] (Equation 10). Only the top p%p\% of prompts—with the highest uncertainty—are retained in each update to maintain an informative gradient signal (default p=25%p=25\%).
  • Critic-Incorporation: In PPO-S, use Ai,t=R(τi)V(τi)A_{i,t} = R(\tau_i) - V(\tau_i) to decorrelate advantage estimates and further reduce estimator variance.
  • Gradient Stabilization:
    • KL Removal: The DKL(πθ,πold)\mathrm{D}_{KL}(\pi_\theta,\pi_{old}) penalty is omitted, relying solely on clipping and entropy regularization.
    • Asymmetric Clipping: Probability ratio ρi,t\rho_{i,t} is clipped asymmetrically, e.g. in [1εlow,1+εhigh][1-\varepsilon_{\text{low}}, 1+\varepsilon_{\text{high}}] with εlow=0.2\varepsilon_{\text{low}}=0.2, εhigh=0.28\varepsilon_{\text{high}}=0.28.

Empirical results demonstrate that uncertainty filtering and asymmetric clipping are effective at delaying or preventing collapse, raising final task success rates from \sim20–30% to \sim35–50% in complex environments.

4. Learning Dynamics and Rollout Configuration

Effective application of StarPO relies on robust rollout and sampling configurations:

  • Diverse Initial States: Training batches draw PP distinct prompts (e.g., P=8P=8 in main experiments), each repeated for NN trajectories (N=16N=16). Best generalization is observed for moderate NN (4–8) per prompt.
  • Interaction Granularity: Supporting up to Amax=5A_{\mathrm{max}} = 5–$6$ actions per turn optimizes generalization and performance in Sokoban and similar environments.
  • Sampling Frequency: Frequent, on-policy rollouts (Online-1) converge faster and avoid overfitting to stale trajectories.
  • Reward Shaping: Explicit negative rewards for missing structured output (> …, <answer>…</answer>) discourage hallucinated or incomplete outputs; more detailed reward shaping for reasoning traces is an open direction.

A plausible implication is that agent diversity at initialization and prompt-level uncertainty minimization are jointly necessary for escaping reward and gradient collapse—a recurring pathology in multi-turn agent RL.

5. Empirical Evaluation and Analysis

Experiments in (Wang et al., 24 Apr 2025) cover bandit, deterministic, and stochastic environments of varying complexity. Key findings include:

  • StarPO Baseline: Early learning gains are followed by regime collapse (Echo Trap), detectable as a cliff in reward standard deviation and gradient-norm spikes. PPO exhibits improved stability relative to GRPO except in Frozen Lake.
  • StarPO-S Improvements: Uncertainty filtering (e.g., filtering >50% prompts) extends or eliminates collapse beyond 200 updates. Removal of KL penalty and the application of asymmetric clipping further improve performance and robustness.
  • Reasoning Trace Effect: Enforcing the presence of reasoning traces (<think> blocks) improves symbolic generalization in bandit-like domains (81.3% to 100% success) but has marginal impact for multi-step planning tasks (Sokoban: 19–21% success, reasoning length decays over time). Without reward signal targeted at reasoning, models default to shallow or hallucinated strategies.
  • Reward Shaping Deficiency: The emergence of deep reasoning is limited without fine-grained, reasoning-aware reward functions.
  • Learning Modality Gap: Supervised fine-tuning with BFS-generated demonstrations achieves much higher rates of success (e.g., 74.6% on Sokoban) than self-evolving RL agents with StarPO-S (20.3%), emphasizing the gap between imitation and self-play RL for reasoning behavior.

A plausible implication is that, for stable, robust agent evolution in LLMs, both trajectory uncertainty management and explicit reward signals for reasoning steps are essential.

6. Implementation Considerations and Scaling Behavior

  • Model: Qwen-2.5 (0.5B) main LLM; parameter-efficient LoRA variant is supported (rank 64, α=64\alpha=64).
  • Batching: P×N=128P \times N = 128 trajectories per training iteration; mini-batch size E=32E=32.
  • RL Hyperparameters: γ=1.0\gamma=1.0, λ=1.0\lambda=1.0, entropy regularizer β=0.001\beta=0.001, StarPO KL coefficient =0.001=0.001 (removed in StarPO-S).
  • Hardware: Experiments utilized H100/A100 GPUs, distributed with FSDP and vLLM prefill/sample modules.
  • Memory and Optimization: LoRA yields \sim50% GPU memory savings at parity performance.

Resource requirements remain moderate for \sim0.5B parameter models, but the frequency of full-length, diverse rollouts may limit scalability at larger model or prompt sizes unless further parallelized.

7. Limitations, Open Problems, and Future Directions

  • Echo Trap/Catastrophe: Even with StarPO-S stabilization, reward distribution collapse remains a risk under long-horizon, sparse-reward, or high-variance tasks.
  • Supervised vs. RL Performance: There is a consistent and wide performance gap between imitation-learned (SFT) and RL-evolved LLM agents, particularly in environments with clear task decompositions (Sokoban).
  • Reasoning Reward Shaping: Without direct, recallable reward signals for intermediate reasoning steps, the model demonstrates a tendency to revert to shallow or hallucinated thoughts.
  • Generalization: Optimal generalization requires balancing prompt diversity, rollout granularity, and sampling frequency. Over-sampling or reusing old rollouts degrades convergence.
  • Scaling: Scaling to larger models and more complex environments may require further architectural advances, e.g., higher-capacity critics, advanced uncertainty quantification, or adaptive rollout management.

The findings in (Wang et al., 24 Apr 2025) collectively indicate that for multi-turn LLM agent RL, stabilizing training, maintaining high-quality rollout diversity, and shaping rewards for genuine reasoning behavior are necessary for robust self-evolution, but achieving parity with demonstration-based methods in structured reasoning tasks remains an open challenge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to StarPO Framework.