StarPO: State-Thinking-Actions-Reward Policy Optimization

Updated 9 November 2025

The paper introduces a framework that integrates explicit thought actions with environment actions, leading to enhanced sample efficiency and improved performance in multi-turn scenarios.
It defines a thought-MDP that extends classical MDPs by incorporating deterministic, zero-reward thought actions to enable local policy refinements through internal reasoning.
It presents StarPO-S, a stabilized variant using uncertainty-based trajectory filtering, asymmetric clipping, and removal of KL penalties to mitigate training instabilities.

State-Thinking-Actions-Reward Policy Optimization (StarPO) is a trajectory-level reinforcement learning (RL) framework for agents—particularly LLMs—that decomposes decision processes into explicit state, thinking (reasoning), action, and reward components. Unlike standard RL, which typically maximizes return over actions alone, StarPO jointly models and optimizes both the reasoning trace (i.e., internal deliberation or "thinking") and external actions, integrating these into a single end-to-end policy that seeks to maximize cumulative trajectory reward. StarPO and its stabilized variant StarPO-S emerged as practical solutions to the unique challenges in training multi-turn, autoregressive RL agents and have provided crucial insight into the emergence and stability of reasoning behaviors in LLMs and other embodied agents (Hanna et al., 20 Jun 2025, Wang et al., 24 Apr 2025).

1. Conceptual Foundations and Formalism

StarPO builds from the "thought Markov decision process" (thought-MDP) framework, which minimally extends the classical MDP by augmenting the state-action space to explicitly include a set of thought states and thought actions. Formally, in a thought-MDP, an agent interacts with an environment according to the tuple:

$𝓜_{thought} = \langle S, T, A, C, p, p_T, r, γ \rangle$
- $S$ : environment-state space
- $A$ : environment action set (executable actions)
- $T$ : finite set of thought-states ( $\tau \in T$ )
- $C$ : finite set of thought-actions ( $c \in C$ )
- $p$ : standard transition kernel $p: S\times A \to \Delta(S)$
- $p_T$ : thought-state transition function $p_T: S \times T \times C \to \Delta(T)$ (typically deterministic)
- $r$ : nonnegative reward $r: S\times A \to \mathbb{R}_{\geq 0}$ (no direct reward for thinking)
- $γ \in [0,1)$ : discount factor

At each timestep, the agent observes $(s_t, \tau_t)$ and selects either an environment action $a_t \in A$ or a thought action $a_t \in C$ . If $a_t \in A$ , standard environment dynamics and reward obtain; if $a_t \in C$ , only the thought-state transitions—no environment change and $r_t = 0$ . The value function is $v_\pi(s,\tau) = E_\pi[\sum_{k=0}^\infty \gamma^k r_{t+k} | s_t = s, \tau_t = \tau]$ .

This structure enables the explicit modeling and optimization of periods of "thinking" or internal reasoning, treated formally as (zero-reward) actions that update an internal latent state (the thought state) but incur a temporal (discounting) cost.

2. Theoretical Results on Thinking and Policy Improvement

A key theoretical result is that, in thought-MDPs, optimal policies never execute thought actions: any such step only delays reward collection without increasing the agent's expected return (Proposition 1). However, during learning, thinking action may emerge as policy improvement steps in sub-optimal or pre-trained policies.

Theorem 1 establishes that a policy improvement step will set $\pi'(s, \tau) = c \in C$ (invoke thinking) only if, upon transitioning the thought-state to $\tau' = p_T(s, \tau, c)$ , the return $v_\pi(s, \tau') > v_\pi(s, \tau)$ . Thus, "thinking" corresponds exactly to a local policy refinement: the agent selects a (temporarily) sub-optimal sub-policy and executes one or more thought actions to switch into a higher-value local strategy before acting in the environment.

A corollary is that chaining multiple thought-actions for further refinement is justified only if each further thought-state properly increases the associated value.

Additional analysis (Proposition 2) shows that, in goal-oriented tasks, access to reward-improving thought transitions strictly reduces the effective planning horizon—accelerating goal arrival when thinking allows access to better strategies not immediately available from the current policy.

3. Emergence Conditions for Thinking Through RL

The emergence of thinking actions in model-free RL depends on three central conditions:

Policy Initialization: The initial policy $\pi_\theta$ must embed sub-policies $\pi(\cdot, \tau)$ with non-identical returns for some $s$ , so that switching via thought actions can expose higher-value subpolicies.
Deterministic, No-Cost Thought Dynamics: Thought actions induce deterministic transitions $p_T$ and have no intrinsic cost beyond time-discounting.
Nonnegative Reward Structure: The environment should offer nonnegative rewards with reachable positive reward from any state, ensuring agents are motivated to improve.

This set of requirements explains why pre-training or behavior cloning to initialize policies with "sub-skills" (i.e., high-value subpolicies in certain thought-states) can enable RL to exploit thinking: RL can then learn to invoke thought actions in order to trigger subpolicies that solve new or composite tasks more effectively.

Empirical evidence (LLM arithmetic tasks and the tailored Gridworld domain) demonstrates that agents initialized with such subpolicy structures plus thought actions achieve significant sample efficiency compared to agents lacking these properties.

4. StarPO Algorithm: Design and Implementation

StarPO operationalizes these theoretical principles as a practical RL algorithm. The objective is the maximization of the usual discounted return in the thought-MDP:

$J(\theta) = \mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right], \quad a_t \sim \pi_{\theta}(\cdot \mid s_t, \tau_t)$

StarPO uses a single (typically neural) policy network, parameterized by $\theta$ , which receives state embeddings for both $s_t$ and $\tau_t$ and outputs logits over the joint action space $A \cup C$ :

$h_t = f_{\theta}([\phi_{env}(s_t); \phi_{thought}(\tau_t)])$
$\pi_{\theta}(a \mid s_t, \tau_t) = \mathrm{softmax}(\text{logit}_a(h_t))$

The environment and thought transition functions $p, p_T$ are assumed known or fixed (e.g., small deterministic tables).

StarPO learning loop (policy gradient style):

for iteration in range(N):
    # Rollout K episodes in thought-MDP with π_θ
    for each episode:
        trajectory = []
        while not terminal:
            a_t ~ π_θ(.|s_t, τ_t)
            ... # step environment or update τ_t if a_t in C
        store trajectory
    # Compute returns G_t and perform policy-gradient update
    θ += α ∇θJ

Variance reduction can be incorporated with learned value functions or alternative baselines. Off-policy variants may learn a Q-function

Q_{\phi}(s,\tau,a)

and use e.g.

\epsilon

-greedy or softmax action selection. Practical design allows a variety of embeddings for

\tau

(from simple one-hots to recurrent/attention encodings depending on the size of

T

5. Advanced Stabilization: The StarPO-S Variant

Multi-turn RL with LLMs presents unique instability modes, notably the “Echo Trap” pattern where reward variance dramatically drops and gradients spike. StarPO-S introduces three empirical stabilization techniques (Wang et al., 24 Apr 2025):

Uncertainty-Based Trajectory Filtering: For each batch, prompts with the lowest in-group reward variance (i.e., "solved" or "unsolvable" instances) are filtered out. This focuses updates on informative, learnable samples, improving both learning signal and stability.
KL-Term Removal: The Kullback-Leibler divergence penalty term in PPO is omitted (i.e., $\beta_{KL} = 0$ ), decoupling learning dynamics from the initialization distribution and enhancing exploration.
Asymmetric ("Clip-Higher") Clipping: Token-level policy ratio clipping is set asymmetrically ( $\epsilon_{low} = 0.2$ , $\epsilon_{high} = 0.28$ ), permitting larger positive updates while preserving protection against excessive divergence.

These modifications delay or eliminate collapse phenomena across diverse environments, and augment peak performance and robustness.

6. Implementation in LLM Agent Contexts

In the LLM domain, StarPO decomposes each interaction turn as follows:

State ( $s_t$ ): full transformer input prefix, including all previous “> …</think>” and “<answer>…</answer>” segments.
- Thinking ( $a^T_t$ ): a token sequence with both explicit reasoning (“<think>…”) and output action (“<answer> a_t </answer>”).
Action ( $a_t$ ): environment-executable output extracted from <answer> tags.
Reward ( $r_t$ ): externally supplied environment feedback, possibly with additional shaping or format consistency penalties.

Policy optimization proceeds at the token level, with both actor-only (GRPO-style) and actor-critic (PPO-style) updates. The key equations include:

StarPO (trajectory-level) objective:

$J_{\text{StarPO}}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{K-1} r_t \right ]$

PPO token-level clipped loss:

$J_{\text{PPO}}(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \min \left[ \rho_{i,t} A_{i,t}, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_{i,t}\right]$

where $\rho_{i,t} = \frac{\pi_\theta(\tau_{i,(t)}|\tau_{i,<t})}{\pi_{old}(\cdot)}$

GRPO normalized advantage:

$\widehat{A}_{i,t} = \frac{R(\tau_i) - \text{mean}_j R(\tau_j)}{\text{std}_j R(\tau_j)}$

Uncertainty filtering metric (StarPO-S):

$U(\pi_\theta, \mathcal{M}, s_0) = \mathrm{Std}_{\tau \sim \pi_\theta (\cdot | s_0)}[R(\tau)]$

The structure allows credit assignment not just to final decisions but to the entirety of the "reasoning" trace. Reward signals can be shaped to penalize missing or malformed <think> structure, further encouraging explicit, trackable deliberation.

7. Contrasts with Prior LLM-RL and Broader Significance

Previously, RL for LLM agents typically applied PPO or GRPO at the prompt–response level, treating each generation in isolation. Such approaches struggle to handle multi-turn, long-horizon tasks and cannot assign reward to intermediate reasoning. StarPO's trajectory-centric, reasoning-aware approach enables unified optimization over both actions and reasoning sequences, bridging the gap between token-level policy training and end-to-end agent performance.

Empirical results indicate that integrating reasoning structures and thought actions—when appropriately initialized—not only improves performance in compositional or multi-step tasks but also yields dramatic sample efficiency gains unattainable with naïve, purely reactive RL. The introduction of StarPO-S also demonstrates that trajectory-level stabilization is crucial to prevent collapse in high-variance, multi-turn RL settings (Wang et al., 24 Apr 2025).

A plausible implication is that broad, multi-task pre-training followed by StarPO fine-tuning could be a paradigm for scalable, self-improving reasoning agents, especially as environment and reasoning complexity increase.

In summary, State-Thinking-Actions-Reward Policy Optimization (StarPO) defines a principled and practical framework for end-to-end RL in environments requiring both external action and explicit, creditable reasoning. By extending the policy domain to include internal thought processes and stabilizing training with targeted sampling and gradient control (StarPO-S), this approach provides a viable path toward general, multi-turn, reasoning-capable agents in both language and non-language domains (Hanna et al., 20 Jun 2025, Wang et al., 24 Apr 2025).

PDF Markdown Chat (Pro)

References (2)

When Can Model-Free Reinforcement Learning be Enough for Thinking? (2025)

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning (2025)

Follow Topic

Get notified by email when new papers are published related to State-Thinking-Actions-Reward Policy Optimization (StarPO).