Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning (2504.04383v2)

Published 6 Apr 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Large reasoning models exhibit remarkable reasoning capabilities via long, elaborate reasoning trajectories. Supervised fine-tuning on such reasoning traces, also known as distillation, can be a cost-effective way to boost reasoning capabilities of student models. However, empirical observations reveal that these reasoning trajectories are often suboptimal, switching excessively between different lines of thought, resulting in under-thinking, over-thinking, and even degenerate responses. We introduce Retro-Search, an MCTS-inspired search algorithm, for distilling higher quality reasoning paths from large reasoning models. Retro-Search retrospectively revises reasoning paths to discover better, yet shorter traces, which can then lead to student models with enhanced reasoning capabilities with shorter, thus faster inference. Our approach can enable two use cases: self-improvement, where models are fine-tuned on their own Retro-Search-ed thought traces, and weak-to-strong improvement, where a weaker model revises stronger model's thought traces via Retro-Search. For self-improving, R1-distill-7B, fine-tuned on its own Retro-Search-ed traces, reduces the average reasoning length by 31.2% while improving performance by 7.7% across seven math benchmarks. For weak-to-strong improvement, we retrospectively revise R1-671B's traces from the OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x smaller. Qwen2.5-32B, fine-tuned on this refined data, achieves performance comparable to R1-distill-32B, yielding an 11.3% reduction in reasoning length and a 2.4% performance improvement compared to fine-tuning on the original OpenThoughts data. Our work counters recently emergent viewpoints that question the relevance of search algorithms in the era of large reasoning models, by demonstrating that there are still opportunities for algorithmic advancements, even for frontier models.

Summary

The paper introduces Retro-Search, a novel algorithm that refines LLM reasoning by re-assessing and replacing suboptimal thought transitions.
It employs a retrospective analysis inspired by Monte-Carlo Tree Search to explore alternative reasoning paths and minimize redundant steps.
Evaluations show that fine-tuning models on Retro-Search revised data achieves up to 31% shorter reasoning trajectories with improved accuracy across benchmarks.

This paper introduces Retro-Search, a search algorithm designed to refine and improve reasoning trajectories generated by LLMs. The core problem addressed is that while LLMs, especially those fine-tuned on long reasoning traces (distillation), show improved reasoning, these traces are often suboptimal. They can exhibit "under-thinking" (prematurely abandoning promising lines of thought) or "over-thinking" (engaging in redundant steps after an answer is found), both leading to inefficient and sometimes incorrect reasoning.

Retro-Search aims to produce higher-quality reasoning paths that are both more accurate and shorter. These refined paths can then be used to fine-tune student models, enhancing their reasoning capabilities and inference speed.

How Retro-Search Works

The algorithm is inspired by Monte-Carlo Tree Search (MCTS) and retrospectively revises a given reasoning path.

Trajectory Decomposition: A reasoning trajectory $T$ is viewed as a sequence of "thoughts" $(\#1_1, \#1_2, \ldots, \#1_\tau)$ . These thoughts are identified by transition keywords (e.g., "alternatively," "wait," "however"). Each thought $\#1_\tau$ is further composed of intermediate steps $\#1_{\tau k}$ (e.g., sub-conclusions, calculations), typically delimited by double newlines. The complete trajectory is represented as: $T= \bigg\{\{ \#1_1^1, \#1_1^2, \ldots, \#1_1^{k_1} \}, \{ \#1_2^1, \#1_2^2, \ldots, \#1_2^{k_2} \}, \ldots, a \bigg\}$ , where $a$ is the final answer.
Identifying Suboptimal Switches: The algorithm iterates through the thoughts in a given trajectory. At each point where a thought switch occurs (e.g., from thought $\#1_\tau$ to $\#1_{\tau+1}$ ), Retro-Search explores an alternative.
Collecting Alternative Rollouts: Instead of allowing the model to switch to a new thought $\#1_{\tau+1}$ after completing step $\#1_\tau^k$ , Retro-Search prompts the revision model $\widehat{\mathcal{M}}$ to continue the current thought $\#1_\tau$ . This is achieved by generating the next step $\#1_\tau^{k+1}$ while prohibiting the use of thought-transition keywords during decoding for this immediate next step. Subsequent steps in this alternative rollout are generated without this constraint, allowing free exploration. The generation is: $\{ \#1_\tau^{k+1},\ldots, a' \} \sim \widehat{M}\left( \#1_1, \ldots, \{ \#1_\tau^1, \ldots, \#1_\tau^{k} \}\right)$ .
Evaluating Alternative Rollouts: To compare the original path continuation $\{ \#1_{\tau+1}^1,\ldots, a \}$ ${# 1_{τ + 1}^{1}, \dots, a}$ with the new alternative rollout $\{ \#1_\tau^{k+1},\ldots, a' \}$ ${# 1_{τ}^{k + 1}, \dots, a^{'}}$ , a value function $V(s_i, a^\star)$ $V (s_{i}, a^{⋆})$ is used. This function evaluates a path starting from step $s_i$ $s_{i}$ based on whether it reaches the correct ground truth answer $a^\star$ $a^{⋆}$ and how many steps it takes:

$V(s_i, a^\star) := \gamma^{N-i} R(a(s_i), a^\star)$

where:
- $N$ is the total number of steps in the trajectory $\{ s_1,\ldots, a \}$ .
- $a(s_i)$ is the answer generated from the continuation starting at $s_i$ .
- $R(a, a^\star)$ is a binary reward (1 if $a=a^\star$ , 0 otherwise).
- $\gamma$ is a discount factor (e.g., 0.9) penalizing longer paths.
Updating the Trajectory: If the value of the alternative rollout's starting step $V(\#1_\tau^{k+1})$ is greater than the value of the original next thought's starting step $V(\#1_{\tau+1}^1)$ , it means the alternative path is better (correct and shorter). The original path from $\#1_{\tau+1}^1$ onwards is then replaced with the new rollout. This process helps mitigate under-thinking (by exploring promising paths deeper) and over-thinking (by finding shorter paths to the solution if earlier steps suffice). In practice, multiple alternative rollouts (e.g., two) are sampled, and the best one is chosen.
Iterative Refinement: The algorithm then proceeds to the next thought-switch point in the (potentially updated) trajectory.

Use Cases and Key Findings

Retro-Search was evaluated in two main scenarios:

Self-Improvement (Self-Retro-Search): A model revises its own generated reasoning traces.
- Example: R1-distill-7B fine-tuned on its own Retro-Search-ed traces.
- Result: Reduced average reasoning length by 31.2% and improved accuracy by 7.7% on seven math benchmarks (greedy decoding). This demonstrates that models can be improved without relying on stronger teacher models.
Weak-to-Strong Revision (W2S-Retro-Search): A weaker (smaller, more efficient) model revises traces generated by a stronger, more expensive model.
- Example: Traces from DeepSeek-R1 671B (from OpenThoughts dataset) revised by R1-distill-32B (a 20x smaller model).
- Result: Qwen2.5-32B fine-tuned on this refined data achieved performance comparable to R1-distill-32B, with an 11.3% reduction in reasoning length and a 2.4% performance improvement compared to fine-tuning on the original OpenThoughts data (temperature sampling).
- Significance: R1-distill-7B and R1-distill-32B fine-tuned on this W2S-revised data achieved new state-of-the-art reasoning performance at their respective scales with high inference efficiency.

Implementation Details and Considerations

Data Generation:
- Used 40K math questions from NuminaMath.
- For Self-Retro-R1-7B: R1-distilled Qwen2.5-7B generated initial responses, which were then revised by the same model.
- For W2S-Retro-R1-32B: DeepSeek-R1 671B responses from OpenThoughts were revised by R1-distilled Qwen2.5-32B.
- Transition Keywords: 'But', 'Wait', 'Alternatively', 'However', 'Hmm', 'Hmmm', 'Not sure', 'Going back', 'Backtrack', 'Trace back', 'Another'.
- Decoding for Rollouts: Top-p sampling ( $p=0.98$ , $T=1.0$ ). Higher temperature was found to produce more diverse and beneficial training data.
- Max generation length: 16384 tokens.
Model Training:
- Models: Qwen2.5-7B-Instruct, R1-distilled Qwen2.5-7B, Qwen2.5-32B-Instruct, R1-distilled Qwen2.5-32B.
- Supervised fine-tuning for 5 epochs, LR 1e-5, sequence length 16K.
- Used HuggingFace TRL. Batch size 128, cosine LR scheduler, Adam optimizer.
Partial Revisions: For computationally expensive revision models (like R1-32B), a variant of Retro-Search was used where the revision process starts at a randomly sampled position in the trajectory instead of always from the beginning. This makes the process more efficient.
Evaluation:
- Benchmarks: AIME25, AIME24, AMC23, GaoKao23English, OlympiadBench, GSM8K, MATH500.
- Metrics: Accuracy (exact match) and average response length (number of output tokens).
- Decoding for evaluation: Greedy (T=0) and temperature sampling (T=0.6, top-p=0.95, 5 seeds).

Analysis of Retro-Search Effects

Analysis of the generated training data and the outputs of models fine-tuned on it showed:

Fewer Transition Keywords: Retro-Search data had significantly fewer thought-switches.
Deeper Thoughts: Consequently, thoughts in Retro-Search data had more steps on average.
Later Solution Appearance: The final solution tended to appear relatively later in the trajectory, suggesting a reduction in redundant thoughts after the answer was derived. These characteristics were also observed in the student models trained on Retro-Search data, indicating successful transfer of these more efficient reasoning patterns.

Pseudocode Overview

Algorithm Retro-Search(q, T_initial, M_revision, gamma, a_star, R_reward)
  Initialize T_revised = T_initial
  Initialize current_thought = first thought in T_revised

  While current_thought is not the last thought in T_revised:
    // Let current_thought end at step s_current_thought_last
    // Let original_next_thought start at step s_original_next_thought_first

    // Generate alternative rollout by continuing current_thought
    // Prohibit transition keywords for the immediate next step s_alternative_start
    alternative_rollout = M_revision.generate(
        prefix = (q, ..., s_current_thought_last),
        constraints = no_transition_keywords_for_first_step
    )
    s_alternative_start = first step of alternative_rollout

    // Evaluate values
    V_alternative = calculate_value(s_alternative_start, a_star, gamma, R_reward)
    V_original_next = calculate_value(s_original_next_thought_first, a_star, gamma, R_reward)

    If V_alternative > V_original_next:
      // Replace original continuation with the better alternative rollout
      T_revised = T_revised_up_to_s_current_thought_last + alternative_rollout
    End If

    current_thought = next thought in T_revised (considering potential updates)
  End While

  Return T_revised
End Algorithm

Conclusion

Retro-Search provides a practical method for improving the quality of reasoning data distilled from LLMs. By systematically exploring "untaken paths" and pruning inefficiencies, it generates shorter yet more effective reasoning traces. Fine-tuning models on this refined data leads to student models that are not only more accurate but also faster at inference due to generating shorter responses. The work effectively demonstrates that algorithmic advancements like search can still play a crucial role in enhancing the capabilities of even frontier LLMs, countering the notion that longer reasoning is always better or that search is becoming irrelevant.

PDF Markdown

Tweets

https://twitter.com/gm8xx8/status/1915219118099140969

https://twitter.com/GptMaestro/status/1912894819657302371

https://twitter.com/garret48001/status/1931766670973943915

https://twitter.com/fredrick_foodie/status/1931768851395383709