Trajectory Purification in Reinforcement Learning

Updated 28 January 2026

Trajectory purification is a technique that systematically eliminates or repairs failure segments in RL trajectories, ensuring only successful steps contribute to learning.
The SAAR mechanism uses adaptive lookahead correction and semantic similarity measures to decide between shallow and deep repairs for effective error handling.
This method significantly improves sample efficiency and credit assignment, yielding notable accuracy gains on benchmarks like AIME24/25 and GPQA.

Trajectory purification refers to methods for systematically eliminating or repairing failure segments and associated errors from agentic reinforcement learning (RL) trajectories, producing “purified” paths that yield more informative gradients and efficient policy optimization. In the context of LLMs augmented with tool use—such as Python code execution—trajectory purification is fundamental for mitigating spurious credit assignment and variance induced by noisy feedback. The Similarity-Aware Adaptive Rollback (SAAR) mechanism, introduced in the CLEANER framework, offers an influential and rigorously evaluated approach to this problem, combining self-correction and semantic similarity heuristics to generate training data comprised only of successful, self-consistent steps (Xu et al., 21 Jan 2026).

1. Formalization and Notation

Trajectory purification operates on RL rollouts (trajectories) comprising alternating sequences of reasoning, code, and observations. Let $x$ be the user query, $\mathcal{E}$ the code execution environment, and $\pi_\theta$ the agent’s policy. Each trajectory $\tau$ is a sequence: $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ where $r_t$ is a text-based chain-of-thought, $c_t$ the generated code, and $o_t$ the outcome, with $o_t^+$ for success and $o_t^-$ for failure.

A failure segment at step $\mathcal{E}$ 0 is given by $\mathcal{E}$ 1. The model may attempt self-correction, producing $\mathcal{E}$ 2 with $\mathcal{E}$ 3. The result of trajectory purification via SAAR is a new sequence $\mathcal{E}$ 4 where each $\mathcal{E}$ 5 successfully corrected is replaced by $\mathcal{E}$ 6.

2. Mechanism: Similarity-Aware Adaptive Rollback (SAAR)

The SAAR mechanism is applied during data collection whenever an execution failure $\mathcal{E}$ 7 occurs, in two phases:

Phase I (Lookahead Correction): Augmenting the context with $\mathcal{E}$ 8, the agent generates up to $\mathcal{E}$ 9 corrective attempts under policy $\pi_\theta$ 0 until a successful execution $\pi_\theta$ 1 is produced.
Phase II (Similarity-Aware Replacement): Compute the semantic similarity $\pi_\theta$ 2 between the failed and corrected code segments, using difflib.SequenceMatcher. With a threshold $\pi_\theta$ 3 (default $\pi_\theta$ 4), the “rollback rule” specifies the replacement:

$\pi_\theta$ 5

The tuple $\pi_\theta$ 6 then enters the purified trajectory.

Additionally, after substitution, $\pi_\theta$ 7 is recomputed (with RadixAttention) to ensure correct importance weighting.

3. Adaptive Granularity and Purification Strategies

SAAR adaptively selects replacement granularity according to semantic similarity:

Shallow Repair ( $\pi_\theta$ 8): Minor corrections (e.g., syntactic changes), retaining the original reasoning $\pi_\theta$ 9.
Deep Repair ( $\tau$ 0): Substantive corrections (e.g., alternative algorithmic logic), replacing reasoning with $\tau$ 1.

This granular replacement ensures that error-recovery patterns are excised and only valid, direct reasoning chains are encoded in purified trajectories.

4. Algorithmic Description

The SAAR algorithm can be summarized as follows:

Detect failure $\tau$ 2 at step $\tau$ 3.
For up to $\tau$ 4 retries, generate corrections $\tau$ 5 and evaluate $\tau$ 6.
If $\tau$ 7, compute $\tau$ 8.
If $\tau$ 9, construct tuple with original $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 0 (shallow repair); else, substitute $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 1 (deep repair).
Substitute $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 2 into $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 3.
Update log-probabilities as needed.

This procedure results in trajectories for RL training that are free from superfluous error-correcting steps and systematically align credit assignment with actually successful reasoning and action choices.

5. Theoretical Implications

Trajectory purification via SAAR addresses critical credit assignment and variance reduction problems in RL with outcome-only reward $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 4. Standard RL assigns positive reward to all steps in any successful $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 5, including failed actions if subsequently self-corrected. Purification ensures that only directly successful actions are credited, yielding unbiased policy gradients. The formal claim is: $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 6 Purification thereby improves sample efficiency, especially in parameter-constrained model regimes.

6. Empirical Results and Experimental Context

CLEANER, using SAAR, was evaluated on AIME24/25, GPQA, and LiveCodeBench with 4B- and 7B-parameter LLMs. Key experimental settings are:

Retry limit $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 7
Similarity threshold $\tau = \bigl[x,\, (r_0, c_0, o_0),\, (r_1, c_1, o_1),\,\ldots,\, (r_T, c_T, o_T)\bigr]$ 8
Curriculum mixing: SAAR on 70% of rollouts, 30% raw

Comparison to DAPO and SOTA baselines is summarized:

Method	AIME24	AIME25	GPQA	LiveCode	RL Steps
DAPO-baseline (4B)	66.7	59.4	56.9	49.5	250
CLEANER-4B (SAAR)	72.7	67.1	60.2	54.9	250
DemyAgent-4B (SOTA)	72.6	70.0	58.5	51.7	750

Ablation with Qwen3-4B-Instruct demonstrates absolute accuracy gains of +6–7% on math (AIME24/25), +3% on GPQA, and +5% on LiveCodeBench versus the baseline. Notably, CLEANER with SAAR achieves or exceeds SOTA with only one-third the RL training steps. Figure 1 in the original text displays suppressed error rates and accelerated accuracy improvements over training.

7. Impact and Prospective Significance

The retrospective, self-purified trajectory construction enabled by SAAR ensures policies internalize correct reasoning patterns while minimizing error-recovery loops. The main effects are: (i) elimination of noisy and misleading credit assignments, (ii) reduction of policy gradient variance, and (iii) marked improvements in both accuracy and sample efficiency, particularly for compute-constrained LLMs. These outcomes have been corroborated on competitive mathematical, scientific, and code-based benchmarks (Xu et al., 21 Jan 2026).

A plausible implication is that trajectory purification, distinct from external filtering or reward shaping, constitutes a scalable paradigm for high-accuracy, efficient agentic RL with realistic tool-augmented LLMs.

Markdown Report Issue Upgrade to Chat

References (1)

CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory Purification.

Trajectory Purification in Reinforcement Learning

1. Formalization and Notation

2. Mechanism: Similarity-Aware Adaptive Rollback (SAAR)

3. Adaptive Granularity and Purification Strategies

4. Algorithmic Description

5. Theoretical Implications

6. Empirical Results and Experimental Context

7. Impact and Prospective Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Trajectory Purification in Reinforcement Learning

1. Formalization and Notation

2. Mechanism: Similarity-Aware Adaptive Rollback (SAAR)

3. Adaptive Granularity and Purification Strategies

4. Algorithmic Description

5. Theoretical Implications

6. Empirical Results and Experimental Context

7. Impact and Prospective Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research