Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trajectory Purification in Reinforcement Learning

Updated 28 January 2026
  • Trajectory purification is a technique that systematically eliminates or repairs failure segments in RL trajectories, ensuring only successful steps contribute to learning.
  • The SAAR mechanism uses adaptive lookahead correction and semantic similarity measures to decide between shallow and deep repairs for effective error handling.
  • This method significantly improves sample efficiency and credit assignment, yielding notable accuracy gains on benchmarks like AIME24/25 and GPQA.

Trajectory purification refers to methods for systematically eliminating or repairing failure segments and associated errors from agentic reinforcement learning (RL) trajectories, producing “purified” paths that yield more informative gradients and efficient policy optimization. In the context of LLMs augmented with tool use—such as Python code execution—trajectory purification is fundamental for mitigating spurious credit assignment and variance induced by noisy feedback. The Similarity-Aware Adaptive Rollback (SAAR) mechanism, introduced in the CLEANER framework, offers an influential and rigorously evaluated approach to this problem, combining self-correction and semantic similarity heuristics to generate training data comprised only of successful, self-consistent steps (Xu et al., 21 Jan 2026).

1. Formalization and Notation

Trajectory purification operates on RL rollouts (trajectories) comprising alternating sequences of reasoning, code, and observations. Let xx be the user query, E\mathcal{E} the code execution environment, and πθ\pi_\theta the agent’s policy. Each trajectory τ\tau is a sequence: τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr] where rtr_t is a text-based chain-of-thought, ctc_t the generated code, and oto_t the outcome, with ot+o_t^+ for success and oto_t^- for failure.

A failure segment at step E\mathcal{E}0 is given by E\mathcal{E}1. The model may attempt self-correction, producing E\mathcal{E}2 with E\mathcal{E}3. The result of trajectory purification via SAAR is a new sequence E\mathcal{E}4 where each E\mathcal{E}5 successfully corrected is replaced by E\mathcal{E}6.

2. Mechanism: Similarity-Aware Adaptive Rollback (SAAR)

The SAAR mechanism is applied during data collection whenever an execution failure E\mathcal{E}7 occurs, in two phases:

  • Phase I (Lookahead Correction): Augmenting the context with E\mathcal{E}8, the agent generates up to E\mathcal{E}9 corrective attempts under policy πθ\pi_\theta0 until a successful execution πθ\pi_\theta1 is produced.
  • Phase II (Similarity-Aware Replacement): Compute the semantic similarity πθ\pi_\theta2 between the failed and corrected code segments, using difflib.SequenceMatcher. With a threshold πθ\pi_\theta3 (default πθ\pi_\theta4), the “rollback rule” specifies the replacement:

πθ\pi_\theta5

The tuple πθ\pi_\theta6 then enters the purified trajectory.

Additionally, after substitution, πθ\pi_\theta7 is recomputed (with RadixAttention) to ensure correct importance weighting.

3. Adaptive Granularity and Purification Strategies

SAAR adaptively selects replacement granularity according to semantic similarity:

  • Shallow Repair (πθ\pi_\theta8): Minor corrections (e.g., syntactic changes), retaining the original reasoning πθ\pi_\theta9.
  • Deep Repair (τ\tau0): Substantive corrections (e.g., alternative algorithmic logic), replacing reasoning with τ\tau1.

This granular replacement ensures that error-recovery patterns are excised and only valid, direct reasoning chains are encoded in purified trajectories.

4. Algorithmic Description

The SAAR algorithm can be summarized as follows:

  1. Detect failure τ\tau2 at step τ\tau3.
  2. For up to τ\tau4 retries, generate corrections τ\tau5 and evaluate τ\tau6.
  3. If τ\tau7, compute τ\tau8.
  4. If τ\tau9, construct tuple with original τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]0 (shallow repair); else, substitute τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]1 (deep repair).
  5. Substitute τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]2 into τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]3.
  6. Update log-probabilities as needed.

This procedure results in trajectories for RL training that are free from superfluous error-correcting steps and systematically align credit assignment with actually successful reasoning and action choices.

5. Theoretical Implications

Trajectory purification via SAAR addresses critical credit assignment and variance reduction problems in RL with outcome-only reward τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]4. Standard RL assigns positive reward to all steps in any successful τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]5, including failed actions if subsequently self-corrected. Purification ensures that only directly successful actions are credited, yielding unbiased policy gradients. The formal claim is: τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]6 Purification thereby improves sample efficiency, especially in parameter-constrained model regimes.

6. Empirical Results and Experimental Context

CLEANER, using SAAR, was evaluated on AIME24/25, GPQA, and LiveCodeBench with 4B- and 7B-parameter LLMs. Key experimental settings are:

  • Retry limit τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]7
  • Similarity threshold τ=[x,(r0,c0,o0),(r1,c1,o1),,(rT,cT,oT)]\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]8
  • Curriculum mixing: SAAR on 70% of rollouts, 30% raw

Comparison to DAPO and SOTA baselines is summarized:

Method AIME24 AIME25 GPQA LiveCode RL Steps
DAPO-baseline (4B) 66.7 59.4 56.9 49.5 250
CLEANER-4B (SAAR) 72.7 67.1 60.2 54.9 250
DemyAgent-4B (SOTA) 72.6 70.0 58.5 51.7 750

Ablation with Qwen3-4B-Instruct demonstrates absolute accuracy gains of +6–7% on math (AIME24/25), +3% on GPQA, and +5% on LiveCodeBench versus the baseline. Notably, CLEANER with SAAR achieves or exceeds SOTA with only one-third the RL training steps. Figure 1 in the original text displays suppressed error rates and accelerated accuracy improvements over training.

7. Impact and Prospective Significance

The retrospective, self-purified trajectory construction enabled by SAAR ensures policies internalize correct reasoning patterns while minimizing error-recovery loops. The main effects are: (i) elimination of noisy and misleading credit assignments, (ii) reduction of policy gradient variance, and (iii) marked improvements in both accuracy and sample efficiency, particularly for compute-constrained LLMs. These outcomes have been corroborated on competitive mathematical, scientific, and code-based benchmarks (Xu et al., 21 Jan 2026).

A plausible implication is that trajectory purification, distinct from external filtering or reward shaping, constitutes a scalable paradigm for high-accuracy, efficient agentic RL with realistic tool-augmented LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory Purification.