Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
32 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
202 tokens/sec
2000 character limit reached

Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning (2508.09883v1)

Published 13 Aug 2025 in cs.LG and cs.AI

Abstract: LLMs demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.

Summary

  • The paper presents a novel distillation framework that bypasses scaling laws to achieve state-of-the-art performance on mathematical and code reasoning tasks.
  • It employs targeted teacher selection, rigorous corpus filtering, and diversity enhancement to reduce data needs while maintaining high accuracy.
  • Empirical results demonstrate significant gains on benchmarks like AIME, MATH, and LiveCodeBench, highlighting robust cross-domain generalization.

Data-Efficient Distillation for Reasoning: Challenging the Scaling Law Paradigm

Introduction

The paper "Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning" (DED) introduces a principled approach to reasoning model distillation that departs from the prevailing scaling law paradigm. Instead of relying on ever-larger corpora and computational budgets, the DED framework leverages targeted teacher selection, corpus curation, and diversity enhancement to achieve state-of-the-art (SOTA) performance in mathematical and code reasoning tasks with minimal data. This essay provides a technical analysis of the framework, its empirical results, and its implications for future research in efficient LLM post-training.

Reasoning Scaling Laws and Their Limitations

Recent advances in LLM reasoning have been driven by two main strategies: reinforcement learning with verifiable reward (RLVR) and supervised fine-tuning (SFT) from distilled chain-of-thought (CoT) trajectories. The scaling law hypothesis posits that reasoning performance increases monotonically with corpus size and model scale, but at the cost of diminishing returns and substantial resource requirements. Figure 1 illustrates this trend, showing that models fine-tuned from DeepSeek-R1-Distill-Qwen-32B adhere to a scaling curve, while the DED-trained NTele-R1-32B model breaks out of this regime and advances the Pareto frontier. Figure 1

Figure 1: The performance on AIME 2024/2025 varies with the scale of the training corpus. Models fine-tuned from DeepSeek-R1-Distill-Qwen-32B exhibit a potential reasoning scaling law. Our model, NTele-R1-32B, breaks out of this trend and advances the Pareto frontier.

The DED Framework: Methodology

The DED framework is structured around three sequential stages, each designed to maximize reasoning gains under data constraints:

  1. Teacher Selection: Rather than defaulting to the highest-performing LLM on benchmarks, DED empirically evaluates candidate teacher models via smoke tests, distilling small corpora and measuring student performance. This process reveals that teaching ability is not strictly correlated with raw benchmark scores.
  2. Corpus Filtering and Compression: The framework applies rigorous quality checks (length, format, correctness) and compresses the question set by filtering out easy samples (high student pass rate). This ensures that only challenging, high-value examples are retained.
  3. Diversity Enhancement: Inspired by RL roll-out strategies, DED augments the corpus by selecting diverse reasoning trajectories for each question, measured via Levenshtein distance, to encourage robust student reasoning. Figure 2

    Figure 2: Overview of our data-efficient distillation framework.

Empirical Results and Analysis

Mathematical Reasoning

DED achieves SOTA results on AIME 2024/2025 and MATH-500 benchmarks using only 0.8k curated examples, outperforming models trained on much larger corpora. Notably, NTele-R1-32B attains 81.87% and 77.29% accuracy on AIME 2024 and 2025, respectively, surpassing both its teacher models and other distillation baselines.

Teacher Model Specialization

Experiments demonstrate that QwQ-32B serves as a more effective teacher than DeepSeek-R1, Qwen3-32B, and Qwen3-235B-A22B, despite not being the top performer on math benchmarks. This contradicts the assumption that the strongest LRM is always the optimal teacher, highlighting the importance of corpus affinity and token entropy.

Corpus Compression and Diversity

Quality filtering and hard-example compression reduce the corpus size by 75% with only a modest performance drop. Diversity augmentation restores and even surpasses full-corpus performance, indicating that data quality and diversity are more critical than sheer quantity.

Code Generation

DED generalizes to code reasoning tasks, achieving SOTA on LiveCodeBench (LCB) with only 230 hard samples expanded to 925 via diversity. The largest gains are observed in medium and hard subsets, confirming the framework's efficacy in few-shot learning across domains.

Cross-Domain Generalization

Mixed training on math and code corpora yields improvements in both domains and enhances out-of-domain (OOD) generalization, as evidenced by doubled scores on the Aider benchmark and consistent gains across MMLU, CMMLU, C-EVAL, BBH, MBPP, GSM8K, and MATH.

Deep Analysis: Length, Entropy, and Latent Shifts

Token Length

Contrary to prior work, corpus and response length are not dominant factors in distillation performance. Models trained on shorter QwQ-32B responses outperform those trained on longer DeepSeek-R1 responses, and accuracy remains stable across length variations.

Token Entropy

Token entropy analysis reveals that QwQ-32B corpora exhibit lower entropy than DeepSeek-R1, resulting in more predictable and structured token distributions. This facilitates student convergence and enhances OOD robustness. Figure 3

Figure 3: Comparison of token entropy distribution of teacher models.

PCA Shift

PCA offset analysis shows that models distilled from QwQ-32B have smaller latent representation shifts than those distilled from DeepSeek-R1, indicating greater stability and generalization. This supports the hypothesis that corpus affinity and representational consistency are key to efficient distillation. Figure 4

Figure 4: PCA offset of DS-32B across various teacher models and tasks. disdis represents the Euclidean distance between the centroids of latent representations before and after training. Models trained on the QwQ-32B corpus exhibit smaller PCA offsets than DeepSeek-R1 across most tasks, indicating greater stability in their latent representations.

Implications and Future Directions

The DED framework demonstrates that reasoning scaling laws can be circumvented through principled teacher selection, corpus curation, and diversity enhancement. The findings challenge the reliance on superficial metrics such as teacher benchmark scores and token length, advocating for deeper analysis of token entropy and latent representation shifts. Practically, DED enables efficient reasoning model development in resource-constrained settings and provides a blueprint for cross-domain generalization.

Future research should explore the interpretability of DED-trained models, extend the framework to additional domains, and investigate the interplay between entropy, diversity, and latent stability in distillation. Theoretical work is needed to formalize the relationship between corpus affinity and generalization, and to develop automated methods for teacher and corpus selection.

Conclusion

The DED framework offers a data-efficient alternative to scaling-centric distillation, achieving SOTA reasoning performance with minimal data. By focusing on teacher specialization, corpus quality, and diversity, DED advances the Pareto frontier and provides robust generalization across domains. The results underscore the importance of token entropy and latent representation analysis in post-training, setting a new direction for efficient reasoning model development.

Youtube Logo Streamline Icon: https://streamlinehq.com