Small Models Struggle to Learn from Strong Reasoners (2502.12143v2)

Published 17 Feb 2025 in cs.AI

Abstract: LLMs excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

PDF Abstract

The paper "Small Models Struggle to Learn from Strong Reasoners" introduces the "Small Model Learnability Gap", observing that smaller LLMs ( $\leq$ 3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, these models perform better when fine-tuned on shorter, simpler reasoning chains that align with their intrinsic learning capacity. To mitigate this gap, the authors propose "Mix Distillation", which balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models.

The paper begins by highlighting the advancements in complex reasoning tasks achieved by LLMs through chain-of-thought prompting and the need to distill these capabilities into smaller, resource-efficient models. The central argument revolves around the discovery that small models often fail to improve—and may even degrade—in performance when trained on the complex reasoning sequences generated by larger teacher models. This is attributed to the limited capacity of small models to internalize multi-step logic and generalize effectively from the distribution shifts introduced by stronger teachers.

To empirically demonstrate this phenomenon, the authors conduct a series of experiments using models from the Qwen and Llama families. They define the "Long CoT Gap" ( $\Delta_{Long} = P_{Long} - P_{Short}$ ), which measures the performance difference between student models fine-tuned on long CoT data ( $P_{Long}$ ) and short CoT data ( $P_{Short}$ ). Similarly, they define the "Large Model CoT Gap" ( $\Delta_{Large} = P_{Large} - P_{Small}$ ), which quantifies the performance difference between models fine-tuned on CoT data from a larger teacher ( $P_{Large}$ ) and a smaller teacher ( $P_{Small}$ ).

The experimental results indicate that small student models benefit more from short CoT examples and smaller teacher models, while larger student models benefit more from longer CoT examples and larger teacher models. For example, the Qwen2.5-1.5B-Instruct model performs over 10 points lower on the MATH and AMC benchmarks when fine-tuned with long CoT data, whereas the Qwen2.5-32B-Instruct model improves significantly with long CoT. Similarly, the Qwen2.5-0.5B-Instruct model's performance degrades on the AMC benchmark when trained with a larger teacher model's CoT.

The authors propose Mix Distillation to bridge the Small Model Learnability Gap. Mix Distillation combines long and short CoT examples (Mix-Long) or mixes responses from larger and smaller models (Mix-Large). In Mix-Long, a small model is exposed to both detailed and concise reasoning steps, while in Mix-Large, the small model learns from reasoning chains that are better suited to its capacity.

The experimental results demonstrate that Mix Distillation consistently improves the reasoning performance of small models. For example, the Qwen2.5-3B-Instruct model improves by more than 8 points on MATH and AMC using Mix-Long and more than 7 points on MATH, AIME, and AMC using Mix-Large.

Further analysis reveals that math expert models exhibit a smaller learnability gap compared to general models, suggesting that limited domain knowledge contributes to the learnability challenges faced by small models. Additionally, base models show a more significant learnability gap than instruction-tuned models. Analysis of speaking styles indicated that long CoT and large teacher CoT primarily shift the student model's distribution of tokens associated with speaking styles.

In summary, this paper identifies the Small Model Learnability Gap, provides empirical evidence of its existence, and proposes Mix Distillation as a method to improve the reasoning capabilities of small LLMs by carefully balancing reasoning complexity in the training data.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yuetai Li (8 papers)
Xiang Yue (72 papers)
Zhangchen Xu (17 papers)
Fengqing Jiang (18 papers)
Luyao Niu (45 papers)
Bill Yuchen Lin (72 papers)
Bhaskar Ramasubramanian (35 papers)
Radha Poovendran (100 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/yuetai12575/status/1892673158219448661

https://twitter.com/_akhaliq/status/1892436016729833474

https://twitter.com/yuetai12575/status/1892673173067374747

https://twitter.com/TheTuringPost/status/1894346533728043143

https://twitter.com/ayushmudgal94/status/1919079393462510051

https://twitter.com/communicating/status/1892774007080116693

Reddit

Small Models Struggle to Learn from Strong Reasoners (36 points, 5 comments)