Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 57 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 104 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Kimi K2 216 tok/s Pro

2000 character limit reached

Incentivizing Strong Reasoning from Weak Supervision (2505.20072v2)

Published 26 May 2025 in cs.CL and cs.AI

Abstract: LLMs have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/w2sr.

Collections

Summary

The paper presents the W2SR paradigm, where strong student LLMs learn structured reasoning from weak teacher-generated chain-of-thought traces.
It shows that even imperfect reasoning traces can yield up to 94% of RL performance gains, emphasizing structured reasoning over teacher scale.
The approach offers a scalable and cost-effective alternative, significantly reducing computational costs compared to traditional reinforcement learning.

Incentivizing Strong Reasoning from Weak Supervision

Introduction

The research presented in "Incentivizing Strong Reasoning from Weak Supervision" addresses the challenge of effectively enhancing the reasoning capacities of LLMs without reliance on expensive high-quality demonstrations or reinforcement learning (RL). Traditional methods, such as RL with verifiable signals or supervised fine-tuning (SFT) with high-quality chain-of-thought (CoT) demonstrations, are resource-intensive, requiring significant computational power and data engineering, thereby limiting accessibility and scalability. This paper explores an alternative, cost-effective approach leveraging supervision from significantly weaker models to incentivize reasoning in stronger models.

Weak-to-Strong Reasoning Paradigm

The core innovation introduced is the Weak-to-Strong Reasoning (W2SR) paradigm, where a "strong" student model learns through supervision on reasoning traces generated by "weak" teacher models. In this paradigm, even considerably smaller and less accurate teachers provide useful and structured CoT reasoning traces. These traces, despite their potential inaccuracies, help in eliciting and amplifying the reasoning capabilities of stronger students.

The student model is fine-tuned via SFT using reasoning trajectories generated by the weaker teacher models. The principle is that these traces, though imperfect, are informative enough to unlock latent reasoning potential in students. The W2SR paradigm is tested against various baselines, demonstrating that it recovers a substantial portion of reasoning gains traditionally achieved through costly RL.

Experimental Setup

Datasets and Models

Experiments are conducted on diverse reasoning benchmarks including MATH, OlympiadBench, MinervaMath, AMC2023, and GPQA, ensuring a comprehensive evaluation across different reasoning tasks. The student models belong to the Qwen-2.5 series (ranging from 7B to 32B parameters), whereas the teachers are considerably weaker, ranging from 0.5B to 14B parameters.

Training and Evaluation

The training employs full-parameter fine-tuning with a consistent setup across various model scales. Evaluations measure performance through Pass@1 metrics and Reasoning Gap Recovered (RGR), quantifying the efficacy of weak-to-strong transfers compared to traditional RL.

Results and Analysis

Effectiveness of Weak Supervision

The W2SR paradigm shows that weak supervision can significantly enhance reasoning capabilities, achieving performance levels close to or even surpassing RL-trained models. For example, training a 7B student with a 1.5B reasoner recovers nearly 94% of the RL performance gain.

Figure 1: Benchmark performance of W2SR across student scales, demonstrating consistent strong reasoning with weak teachers across reasoning benchmarks.

Teacher Model Attributes

Experiments reveal that the reasoning capability—specifically the ability to generate structured CoT—is more critical than the scale or raw performance of the teacher model. Surprisingly, even reasoning traces that do not yield correct final answers can still provide significant learning value, indicating that the reasoning structure itself is a crucial component of effective learning.

Efficiency and Trade-Offs

W2SR presents a compelling efficiency advantage. Utilizing weaker teachers results in drastically reduced computational costs without a significant compromise in reasoning performance. This cost-effectiveness is evident when compared to the computational demands of RL approaches, making W2SR a practical alternative for incentivizing reasoning capabilities at scale.

Figure 2: Efficiency and performance comparison among GRPO, W2SR, and W2SR-P, highlighting substantial gains in training efficiency and performance using weaker teachers.

Conclusion

The "Incentivizing Strong Reasoning from Weak Supervision" paper illustrates a scalable, cost-effective alternative for enhancing LLM reasoning capabilities by utilizing weak-to-strong transfers. The results position W2SR as a promising approach for facilitating strong reasoning in LLMs outside the constraints of traditional, resource-heavy methods. Future research directions may include optimizing selection processes of weak supervision signals, extending the framework to multi-modal contexts, and refining theoretical foundations to further enhance weak-to-strong learning paradigms.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

YouTube

Show All Videos