Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Published 22 May 2025 in cs.CL | (2505.16142v3)

Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller LLMs. However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

Summary

  • The paper demonstrates that RLKD, via a Generative Structure Reward Model, effectively transfers implicit multi-branch reasoning from teacher to student LLMs.
  • It employs a two-stage reward system to distinctly optimize meta-reasoning and solving phases, overcoming the limitations of traditional supervised fine-tuning.
  • Experiments on AIME and GPQA tasks show that RLKD significantly enhances reasoning performance using only 0.1% of the training data compared to conventional methods.

Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

The paper "Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning" presents a novel method to enhance the reasoning capabilities of student LLMs by transferring the implicit multi-branch structure of reasoning from teacher LLMs through a reinforcement learning-based knowledge distillation framework named RLKD.

Introduction

LLMs have made significant strides in mastering complex reasoning tasks by generating extended reasoning paths. Supervised fine-tuning (SFT) is a popular approach to distill reasoning capabilities from larger, powerful models to smaller ones. However, research has shown that SFT-based distillation often leads to rote imitation rather than genuine comprehension. This paper revisits the cognitive neuroscience foundations of reasoning to propose a new framework, identifying that authentic reasoning involves an implicit multi-branch structure characterized by meta-reasoning and solving phases. To address the limitations of token-level SFT, the paper introduces RLKD, a reinforcement learning-based knowledge distillation framework coupled with a Generative Structure Reward Model (GSRM). The GSRM evaluates the alignment between student and teacher models' reasoning paths regarding their inherent structure.

Methodology

The core of the RLKD framework is the GSRM, which scores the alignment between teacher and student LLM reasoning paths in terms of multi-branch structures. Authentic reasoning is decomposed into sequences of meta-reasoning and solving steps. The GSRM uses these decompositions to compute a structured reward reflecting the quality of alignment between reasoning paths of student and teacher LLMs.

Generative Structure Reward Model

The paper introduces a two-stage GSRM to reward LLMs based on the degree of alignment between the reasoning structures of student and teacher. The GSRM maps generated paths into sequences of meta-reasoning and solving pairs, scored on alignment through the structured reward mechanism as shown in (Figure 1). Figure 1

Figure 1: (a) The generated reasoning path has implicit multi-branch structure. (b) Distillation only based on SFT collapses the rich structure into a flat sequence of token prediction to memorize only the teacher's generated path. (c) Our proposed RL-based distillation can teach the student LLM to learn this structure by using a Generative Structure Reward Model to measure the alignment between the reasoning structure of the student and teacher, serving as the reward in RL.

The GSRM is designed through a structured fine-grained training method that treats meta-reasoning and solving task optimizations distinctly, with dynamic adjustments to optimization weights based on task difficulty and includes two main parts: (i) a generative reward model for semantic understanding, and (ii) a rule-based reward model for interpretability and control. Figure 1

Figure 1: (a) The generated reasoning path has implicit multi-branch structure. (b) Distillation only based on SFT collapses the rich structure into a flat sequence of token prediction to memorize only the teacher's generated path. (c) Proposed RL-based distillation can teach the student LLM to learn this structure by using a Generative Structure Reward Model.

Figure 2

Figure 2: One sequence generation example for math task in in-context learning prompts for GPT-4o.

RL-based Knowledge Distillation

The authors propose RLKD, a novel reinforcement learning-based framework for distilling knowledge from teacher to student LLMs. Central to this approach is the Generative Structure Reward Model (GSRM), designed to provide structured rewards, thereby allowing the student model to capture the implicit multi-branch structure inherent in effective reasoning.

Generative Structure Reward Model (GSRM)

The GSRM distinguishes between meta-reasoning and solving components at each reasoning step by generating a structured sequence of meta-reasoning and solving pairs. The generations from the teacher and student models are then scored for alignment to ensure the student mimics the teacher's implicit multi-branch reasoning structure, rather than superficially reproducing output paths. Figure 1

Figure 1: (a) The generated reasoning path has implicit multi-branch structure. (b) Distillation only based on SFT collapses the rich structure into a flat sequence of token prediction to memorize only the teacher's generated path. (c) Our proposed RL-based distillation can teach the student LLM to learn this structure by using a Generative Structure Reward Model to measure the alignment between the reasoning structure of the student and teacher, serving as the reward in RL.

Experimental Results

Experiments were conducted using math tasks from datasets such as AIME and MATH-500, alongside graduate-level GPQA tasks, to evaluate the effectiveness of RLKD against existing baselines.

The results on AIME24 and GPQA-Diamond demonstrate that RLKD outperforms the standard SFT-RL pipelines and existing RL baselines (Table 1). Notably, RLKD exhibits superior performance using just 0.1\% of the training data needed for SFT-RL. This underlines the efficacy of RLKD in efficiently utilizing limited data while achieving significant improvements in LLM reasoning capabilities. Figure 3

Figure 3

Figure 3

Figure 3: The variations of metrics during the RL training process.

Figure 2

Figure 2: One sequence generation example for math task in in-context learning prompts for GPT-4o.

The ablation studies in (Figure 4) and (Figure 5) underscore the efficacy of Structured Fine-grained Training and the impact of reward weights on the RLKD mechanism. Figure 4

Figure 4

Figure 4: Ablation study about our Structured Fine-grained Training method.

Figure 5

Figure 5

Figure 5: Ablation study about reward weights. Performance of pass@$1$ (n=16n=16) and pass@$16$ (n=16n=16) on AIME24 varying with RgsrmR_{gsrm}.

Figure 4

Figure 4

Figure 4: Ablation study about our Structured Fine-grained Training method.

Figure 5

Figure 5

Figure 5: Ablation study about reward weights. Performance of pass@1 (n=16)(n=16) and pass@16 (n=16)(n=16) on AIME24AIME24 varying with (RgsrmRacc\frac{R_{gsrm}}{R_{acc}}).

The diversity analysis presented in (Figure 6) highlights that RLKD aligns student models' reasoning paths closely with that of teacher models over multiple training steps, showcasing improvement in the student's ability to explore implicit problem-solving avenues. Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Diversity among different reasoning paths at each meta-reasoning step varying with training.

Comparative Analysis and Implications

Experimental findings reveal that RLKD surpasses Supervised Fine-Tuning (SFT) in transferring the reasoning capabilities of LLMs in domain-shift and out-of-domain settings, as illustrated through detailed comparisons (Figure 7). Notably, RLKD demonstrated substantial capability, outperforming existing reinforcement learning strategies such as PPO and GRPO. Figure 7

Figure 7

Figure 7: Comparison between SFT-based distillation and our RL-based distillation (RLKD) in domain shift and out-of-domain setting. SFT and RLKD see the same data (32 samples) at each step.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Diversity among different reasoning paths at each meta-reasoning step varying with training.

By leveraging GSRM to provide structured rewards, RLKD encourages LLMs to explore and learn the authentic multi-branch structure in reasoning, thereby enhancing their potential to solve complex tasks across varied domains and datasets.

Conclusion

This research introduces RLKD, the inaugural RL-based framework for LLM reasoning path distillation, addressing SFT's shortfall in teaching students about the intricate multi-branch reasoning structure. The method's exploitation of GSRM establishes a foundation for insightful learning signals. Experimental results corroborate that RLKD enhances LLM's reasoning capacity beyond the capabilities of classical distillation methods. While RLKD presents promising advances, further exploration is necessary to affirm its scalability to larger LLMs. Future research may look to refine structural reward strategies and incorporate human cognitive insights to foster more robust and efficient LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.