Trajectory Purification in Reinforcement Learning
- Trajectory purification is a technique that systematically eliminates or repairs failure segments in RL trajectories, ensuring only successful steps contribute to learning.
- The SAAR mechanism uses adaptive lookahead correction and semantic similarity measures to decide between shallow and deep repairs for effective error handling.
- This method significantly improves sample efficiency and credit assignment, yielding notable accuracy gains on benchmarks like AIME24/25 and GPQA.
Trajectory purification refers to methods for systematically eliminating or repairing failure segments and associated errors from agentic reinforcement learning (RL) trajectories, producing “purified” paths that yield more informative gradients and efficient policy optimization. In the context of LLMs augmented with tool use—such as Python code execution—trajectory purification is fundamental for mitigating spurious credit assignment and variance induced by noisy feedback. The Similarity-Aware Adaptive Rollback (SAAR) mechanism, introduced in the CLEANER framework, offers an influential and rigorously evaluated approach to this problem, combining self-correction and semantic similarity heuristics to generate training data comprised only of successful, self-consistent steps (Xu et al., 21 Jan 2026).
1. Formalization and Notation
Trajectory purification operates on RL rollouts (trajectories) comprising alternating sequences of reasoning, code, and observations. Let be the user query, the code execution environment, and the agent’s policy. Each trajectory is a sequence: where is a text-based chain-of-thought, the generated code, and the outcome, with for success and for failure.
A failure segment at step 0 is given by 1. The model may attempt self-correction, producing 2 with 3. The result of trajectory purification via SAAR is a new sequence 4 where each 5 successfully corrected is replaced by 6.
2. Mechanism: Similarity-Aware Adaptive Rollback (SAAR)
The SAAR mechanism is applied during data collection whenever an execution failure 7 occurs, in two phases:
- Phase I (Lookahead Correction): Augmenting the context with 8, the agent generates up to 9 corrective attempts under policy 0 until a successful execution 1 is produced.
- Phase II (Similarity-Aware Replacement): Compute the semantic similarity 2 between the failed and corrected code segments, using difflib.SequenceMatcher. With a threshold 3 (default 4), the “rollback rule” specifies the replacement:
5
The tuple 6 then enters the purified trajectory.
Additionally, after substitution, 7 is recomputed (with RadixAttention) to ensure correct importance weighting.
3. Adaptive Granularity and Purification Strategies
SAAR adaptively selects replacement granularity according to semantic similarity:
- Shallow Repair (8): Minor corrections (e.g., syntactic changes), retaining the original reasoning 9.
- Deep Repair (0): Substantive corrections (e.g., alternative algorithmic logic), replacing reasoning with 1.
This granular replacement ensures that error-recovery patterns are excised and only valid, direct reasoning chains are encoded in purified trajectories.
4. Algorithmic Description
The SAAR algorithm can be summarized as follows:
- Detect failure 2 at step 3.
- For up to 4 retries, generate corrections 5 and evaluate 6.
- If 7, compute 8.
- If 9, construct tuple with original 0 (shallow repair); else, substitute 1 (deep repair).
- Substitute 2 into 3.
- Update log-probabilities as needed.
This procedure results in trajectories for RL training that are free from superfluous error-correcting steps and systematically align credit assignment with actually successful reasoning and action choices.
5. Theoretical Implications
Trajectory purification via SAAR addresses critical credit assignment and variance reduction problems in RL with outcome-only reward 4. Standard RL assigns positive reward to all steps in any successful 5, including failed actions if subsequently self-corrected. Purification ensures that only directly successful actions are credited, yielding unbiased policy gradients. The formal claim is: 6 Purification thereby improves sample efficiency, especially in parameter-constrained model regimes.
6. Empirical Results and Experimental Context
CLEANER, using SAAR, was evaluated on AIME24/25, GPQA, and LiveCodeBench with 4B- and 7B-parameter LLMs. Key experimental settings are:
- Retry limit 7
- Similarity threshold 8
- Curriculum mixing: SAAR on 70% of rollouts, 30% raw
Comparison to DAPO and SOTA baselines is summarized:
| Method | AIME24 | AIME25 | GPQA | LiveCode | RL Steps |
|---|---|---|---|---|---|
| DAPO-baseline (4B) | 66.7 | 59.4 | 56.9 | 49.5 | 250 |
| CLEANER-4B (SAAR) | 72.7 | 67.1 | 60.2 | 54.9 | 250 |
| DemyAgent-4B (SOTA) | 72.6 | 70.0 | 58.5 | 51.7 | 750 |
Ablation with Qwen3-4B-Instruct demonstrates absolute accuracy gains of +6–7% on math (AIME24/25), +3% on GPQA, and +5% on LiveCodeBench versus the baseline. Notably, CLEANER with SAAR achieves or exceeds SOTA with only one-third the RL training steps. Figure 1 in the original text displays suppressed error rates and accelerated accuracy improvements over training.
7. Impact and Prospective Significance
The retrospective, self-purified trajectory construction enabled by SAAR ensures policies internalize correct reasoning patterns while minimizing error-recovery loops. The main effects are: (i) elimination of noisy and misleading credit assignments, (ii) reduction of policy gradient variance, and (iii) marked improvements in both accuracy and sample efficiency, particularly for compute-constrained LLMs. These outcomes have been corroborated on competitive mathematical, scientific, and code-based benchmarks (Xu et al., 21 Jan 2026).
A plausible implication is that trajectory purification, distinct from external filtering or reward shaping, constitutes a scalable paradigm for high-accuracy, efficient agentic RL with realistic tool-augmented LLMs.