Reinforcement Learning Teachers of Test Time Scaling (2506.08388v2)

Published 10 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Training reasoning LMs with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.

Summary

The paper introduces Reinforcement-Learned Teachers (RLTs) that generate detailed explanations bridging questions to solutions for effective student model training.
The approach uses a dense reward function combining solution likelihood and interpretability to tackle sparse rewards and mismatch issues in RL.
Results demonstrate improved student performance on reasoning benchmarks and successful zero-shot transfer to out-of-distribution tasks with lower computational cost.

The paper "Reinforcement Learning Teachers of Test Time Scaling" (2506.08388) introduces a novel framework for training LMs specifically designed to be effective teachers for downstream distillation. The core motivation stems from two key challenges in applying Reinforcement Learning (RL) to train reasoning LMs:

Exploration Challenge: Traditional RL for reasoning relies on sparse, one-hot correctness rewards. This makes learning difficult unless the model already has a high chance of solving the task at initialization, limiting RL's ability to teach truly new skills or significantly improve smaller models.
Teacher-Student Mismatch: Often, LMs trained with RL are not deployed directly but serve as teachers to generate reasoning traces for distilling smaller student models or cold-starting future RL iterations. The traditional objective of solving problems from scratch is not perfectly aligned with the objective of producing instructive explanations for learning.

The paper proposes Reinforcement-Learned Teachers (RLTs) to address these issues. Instead of asking the LM to solve a problem from scratch, RLTs are given both the question and the ground-truth solution and are tasked with generating a detailed, step-by-step explanation that connects the question to the solution. This simpler task formulation fundamentally avoids the sparse reward exploration problem inherent in correctness-based RL.

RLT Framework and Implementation:

The RLT framework involves training an LM with RL, but with a new task and reward function.

Task Formulation: RLTs are prompted with the question ( $q_i$ ) and the corresponding ground-truth solution ( $s_i$ ) and are instructed to generate an explanation (think tokens, $t_{o_i}$ ) that bridges the gap. The output format is designed to be directly usable for student distillation datasets, structuring the response with tags for solution and explanation sections. This is a departure from traditional RL where the model receives only the question and must generate both the thinking process and the solution.
Dense Reward Function: The quality of an RLT's explanation is evaluated using a dense reward signal derived from a separate student model. The reward function has two main components:
- $r^\mathrm{SS}$ : Measures how well the student model understands the ground-truth solution ( $s_i$ ) given the question ( $q_i$ ) and the teacher's explanation ( $t_{o_i}$ ). It is based on the student's log probabilities of the solution tokens, using both the average and minimum log probability to ensure all parts of the solution are supported by the explanation.
- $r^\mathrm{KL}$ : Measures the interpretability of the explanation itself from the student's perspective. It uses the KL divergence between the teacher's distribution over the explanation tokens (conditioned on $q_i$ and $s_i$ ) and the student's distribution (conditioned only on $q_i$ and previous $t_{o_i}$ tokens). This term penalizes explanations that rely on "logical leaps" or tokens unlikely for the student given only the question context. It uses both average and maximum KL divergence. The final RLT reward $r^\mathrm{RLT}$ combines these terms: $r^\mathrm{RLT} = r^\mathrm{SS} - \lambda r^\mathrm{KL}$ , where $\lambda$ is a weighting coefficient. The use of average and min/max reductions prevents the reward from being dominated by certain tokens or the length of the explanation.
Training: The RLT is trained using an online RL algorithm like GRPO (Rafailov et al., 2023), optimizing the objective $J^\mathrm{RLT}(\theta)$ which incorporates the dense RLT reward $r^\mathrm{RLT}$ . The paper describes a short supervised fine-tuning phase to familiarize the base LM with the new RLT format before the RL phase. A separate student model (e.g., another Qwen-7B) is used only for computing the reward signal during RLT training; it is not trained itself during this phase.

Practical Implementation Details:

Base Model: The experiments use Qwen2.5-7B-Instruct as the base model for the RLTs and students.
Training Data: RLTs are trained on a dataset of question-solution pairs (less than 17K math/coding problems from (Matzken et al., 2023)).
RL Setup: A relatively short RL phase (250 steps, less than one epoch) with a small batch size (256) and constant learning rate ( $1 \times 10^{-6}$ ) is sufficient due to the dense reward. GRPO with a group size of 64 is used. Reference model synchronization helps stability.
Reward Computation: The student model used for reward computation is initialized from a checkpoint familiar with the student format. Offloading parameters or using faster distributed inference engines (like VLLM (Guo et al., 2023)) for generation during online RL can mitigate the computational cost of querying the student model.
Distillation: Student distillation datasets are created directly using the raw outputs of the trained RLTs for question-solution pairs. No additional postprocessing (like heuristic filtering or refinement by other LMs) is needed, simplifying the pipeline compared to prior work. Student models (7B and 32B) are then fine-tuned on this collected data using standard supervised learning recipes based on data size (Matzken et al., 2023, Zhu et al., 2023).
Evaluation: Students are evaluated on challenging reasoning benchmarks like AIME, MATH, GPQA, LiveCodeBench (Zhao et al., 2023), and OlympiadBench (Feng et al., 2023).

Key Results and Practical Applications:

The paper demonstrates significant improvements with RLTs:

Superior Distillation: Distilling students (7B and 32B) using the raw outputs of a 7B RLT achieves higher performance on AIME, MATH, and GPQA compared to pipelines that use reasoning traces from LMs orders of magnitude larger (like DeepSeek R1 (Wang et al., 2023)) even after extensive postprocessing. This shows that optimizing specifically for teaching is more effective than simply using a powerful solver.
Better RL Cold-starts: Data generated by RLTs provides a stronger cold-start initialization for traditional RL fine-tuning on the student format, leading to higher final performance compared to cold-starting from data generated by traditional RL-trained 7B teachers (even postprocessed with GPT-4) or Bespoke R1 traces. This indicates RLT explanations are more conducive to learning problem-solving skills via RL.
Zero-shot Transfer: RLTs trained on math/coding problems can generate effective distillation data for completely different, out-of-distribution tasks like Countdown (Matzken et al., 2023) without any further training on that task. Students distilled from these zero-shot RLT traces perform better than direct RL fine-tuning on the Countdown task itself, suggesting that the "teaching" skill learned by RLTs is more transferable than the "solving" skill learned by traditional RL.
Reward Efficacy: The RLT reward shows a strong correlation with downstream student performance, validating its design. Ablation studies highlight the importance of both the solution likelihood term ( $r^\mathrm{SS}$ ) and the interpretability term ( $r^\mathrm{KL}$ ), with the latter being crucial for preventing the teacher from simply repeating the solution.

Practical Implications:

The RLT framework offers practical benefits for developing reasoning LMs:

Reduced Computational Cost: Training a smaller, specialized teacher (like the 7B RLT) is significantly cheaper than training or querying large, general-purpose reasoning LMs required by existing distillation pipelines.
Simplified Pipeline: By directly using raw RLT outputs, the need for expensive and heuristic-driven postprocessing steps is eliminated.
Democratization of RL Reasoning: The framework makes it more feasible to apply RL techniques to smaller models and leverage their capabilities for generating high-quality training data.
Enhanced Reusability: RLTs demonstrate zero-shot transferability, allowing them to be used as teachers for diverse tasks beyond their training distribution, reducing the need for task-specific teacher training.

In essence, RLTs redefine the role of RL in reasoning LMs, shifting the focus from solving to teaching, and demonstrate that a smaller model optimized for explanation generation can be a more effective resource for training capable student models than larger models optimized purely for problem correctness.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (3)

Tweets

https://twitter.com/_AndrewZhao/status/1932691935812358227

https://twitter.com/hardmaru/status/1936972147702051218

https://twitter.com/fly51fly/status/1934005715485741269

https://twitter.com/ceobillionaire/status/1936995732927414740

https://twitter.com/theomitsa/status/1939299600138920419

https://twitter.com/SynthSquid/status/1938635201434779960

YouTube

Show All Videos