SAAR: Similarity-Aware Adaptive Rollback
- SAAR is a trajectory purification mechanism that addresses credit assignment in agentic RL by rewriting faulty rollouts using semantic code similarity.
- It leverages Python’s difflib.SequenceMatcher to distinguish between shallow and deep repairs, reducing gradient variance and improving policy optimization.
- Integration within the CLEANER framework yields measurable gains, including up to a 7.7% Pass@1 improvement across benchmarks with fewer training steps.
Similarity-Aware Adaptive Rollback (SAAR) is a trajectory purification mechanism introduced to address critical credit assignment issues in agentic reinforcement learning (RL) environments, particularly those involving LLMs augmented with tool use, such as Python interpreters. SAAR operates by leveraging the agent’s own self-correction behaviors to retrospectively rewrite failure-containing rollouts into counterfactual, error-free trajectories—directly mitigating the adverse effects of tool-induced execution failures on policy optimization. Embedded within the CLEANER framework, SAAR adaptively controls the granularity of rollout repair based on semantic code similarity, eliminating spurious credit assignment, reducing gradient variance, and achieving accelerated, stable convergence in RL training for parameter-constrained models (Xu et al., 21 Jan 2026).
1. Motivation and Challenges in Agentic RL with Tool-Augmented LLMs
Early agentic RL systems with tool-using LLMs—models typically in the 4B–7B parameter range—are characterized by frequent tool execution errors, especially during exploration. Under standard sparse outcome-based reward settings, a trajectory that ultimately achieves a positive result—despite intermediate erroneous steps—will reward all preceding actions, indiscriminately reinforcing correct and incorrect behaviors. This “trajectory noise” manifests as suboptimal credit assignment, corrupting policy learning.
Prior mitigation strategies have exhibited critical limitations:
- Dense/reward-shaping: Introducing intermediate rewards often induces reward hacking, as agents maximize synthetic signals rather than genuine task outcomes.
- Supersampling and post-hoc filtering: Methods requiring multiple rollouts per query (e.g., rStar2-Agent) approximately double or more the computational cost, rendering them prohibitive for large-scale training.
SAAR, as realized in the CLEANER framework, directly confronts these issues by enabling in-situ rollout purification through model-driven self-correction and judicious replacement of failure segments, obviating the need for extra rollouts or dense handcrafted rewards (Xu et al., 21 Jan 2026).
2. Semantic Similarity Metric for Adaptive Replacement
The distinctive feature of SAAR lies in its use of a precision-oriented code-level similarity metric to govern the granularity of rollback during trajectory repair. The semantic similarity between a failed code segment and its self-corrected replacement is quantified as
where SequenceMatcher refers to Python's difflib.SequenceMatcher. The similarity score represents the fraction of matching subsequences, capturing the semantic proximity of original and corrected code within the language’s operational context.
This metric serves as the principal criterion for distinguishing shallow (implementation-level) versus deep (reasoning-level) repairs, facilitating fine-grained, adaptive trajectory purification.
3. Rollback Granularity Criteria and Procedure
At each step , given a model-generated code snippet that has failed and a successfully generated correction , SAAR evaluates their similarity :
- Shallow Repair (): If the similarity exceeds the threshold ( in practice), only the failed action and associated output are replaced by the correction and its successful output . The original reasoning is retained.
- Deep Repair (): If similarity is below threshold, the entire tuple is replaced. The agent’s auxiliary correction thought and the successful code-output pair are spliced in.
The number of self-correction attempts at any failure point is empirically capped at . These replacement operations are applied on-the-fly within the same rollout, minimizing disturbance to the valid context and yielding counterfactual paths that strictly exhibit correct behavior.
4. Algorithmic Workflow and Pseudocode
The operational details of SAAR are as follows: For each query, the agent iteratively samples reasoning-code pairs, executes the code, and appends successful steps to the purified trajectory. Upon encountering a failed execution, the agent enters a correction phase—making up to additional attempts—until a successful correction is obtained or the rollout is aborted. The replacement of failed segments is then adaptively determined by the semantic similarity .
A “curriculum mix” is implemented: SAAR is applied to 70% of rollouts, while 30% are left raw to preserve debugging capability.
The purified trajectory is then used in the RL update. Log-probabilities for rollout segments modified by SAAR are recomputed under the updated context using RadixAttention (SGLang) for efficient reuse of KV caches. This maintains compatibility with standard RL policy optimization pipelines (Xu et al., 21 Jan 2026).
5. Integration with RL Objectives: Group Relative Policy Optimization
SAAR is employed within the Group Relative Policy Optimization (GRPO) framework. A purified trajectory belongs to a group of candidate solutions for a query. Each is assigned a binary, outcome-only reward . The advantage for the -th purified rollout is computed by
where and are the group mean and standard deviation, and is a small positive constant. The PPO-style surrogate objective remains unchanged except that it operates solely on clean, purified trajectories: with
The primary difference is that the updates are driven by error-free, consistently annotated learning signals.
6. Empirical Evaluation and Quantitative Impact
Empirical studies of SAAR within CLEANER demonstrate its scalability and effectiveness across diverse benchmarks:
- AIME24/AIME25 (Math Olympiad): Pass@1 improved from 66.7% to 72.7% (+6.0%) and 59.4% to 67.1% (+7.7%), respectively.
- GPQA (Graduate-level QA): Pass@1 increased from 56.9% to 60.2% (+3.3%).
- LiveCodeBench-v6 (Holistic Code Tasks): Pass@1 saw gains from 26.6% to 26.8% (+0.2%).
CLEANER, powered by SAAR, matches or surpasses state-of-the-art results—such as those achieved by rStar2-Agent-14B—using only one-third the number of RL training steps (250 vs. 750+). The average accuracy gains attained are approximately 6% on AIME, 3% on GPQA, and 5% on LiveCodeBench (Xu et al., 21 Jan 2026).
| Dataset | DAPO Baseline Pass@1 | CLEANER (SAAR) Pass@1 | Absolute Gain |
|---|---|---|---|
| AIME24 | 66.7% | 72.7% | +6.0% |
| AIME25 | 59.4% | 67.1% | +7.7% |
| GPQA | 56.9% | 60.2% | +3.3% |
| LiveCodeBench-v6 | 26.6% | 26.8% | +0.2% |
This suggests that trajectory purification via SAAR provides robust improvements to RL training efficiency and final accuracy without computationally expensive sampling strategies.
7. Theoretical and Practical Significance
By eliminating error-contaminated trajectory segments before objective optimization, SAAR achieves strict alignment between reward signals and correct agent behaviors. This directly resolves the “noisy success” credit assignment dilemma, reduces policy gradient variance, and prevents inadvertent reinforcement of suboptimal, trial-and-error solution paths.
Adaptive rollback ensures perturbation minimization: shallow, implementation-only repairs are preferred where possible, with deep, reasoning-level interventions reserved for substantial failures. This approach preserves valid reasoning structures while rectifying erroneous segments. Empirical training curves (see Figure 1 in (Xu et al., 21 Jan 2026)) corroborate the suppression of execution errors and the acceleration of accuracy gains.
A plausible implication is that SAAR’s lightweight, data-centric intervention constitutes a scalable and computationally economical alternative to supersampling and dense reward engineering, especially for parameter-constrained agentic RL systems.
In summary, Similarity-Aware Adaptive Rollback systematizes the self-purification of agentic RL trajectories by integrating semantic similarity-driven repair actions within the learning loop, yielding cleaner policy signals and demonstrably improved learning outcomes without additional rollout or reward engineering overhead (Xu et al., 21 Jan 2026).