The paper "Small Models Struggle to Learn from Strong Reasoners" introduces the "Small Model Learnability Gap", observing that smaller LLMs (3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, these models perform better when fine-tuned on shorter, simpler reasoning chains that align with their intrinsic learning capacity. To mitigate this gap, the authors propose "Mix Distillation", which balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models.
The paper begins by highlighting the advancements in complex reasoning tasks achieved by LLMs through chain-of-thought prompting and the need to distill these capabilities into smaller, resource-efficient models. The central argument revolves around the discovery that small models often fail to improve—and may even degrade—in performance when trained on the complex reasoning sequences generated by larger teacher models. This is attributed to the limited capacity of small models to internalize multi-step logic and generalize effectively from the distribution shifts introduced by stronger teachers.
To empirically demonstrate this phenomenon, the authors conduct a series of experiments using models from the Qwen and Llama families. They define the "Long CoT Gap" (), which measures the performance difference between student models fine-tuned on long CoT data () and short CoT data (). Similarly, they define the "Large Model CoT Gap" (), which quantifies the performance difference between models fine-tuned on CoT data from a larger teacher () and a smaller teacher ().
The experimental results indicate that small student models benefit more from short CoT examples and smaller teacher models, while larger student models benefit more from longer CoT examples and larger teacher models. For example, the Qwen2.5-1.5B-Instruct model performs over 10 points lower on the MATH and AMC benchmarks when fine-tuned with long CoT data, whereas the Qwen2.5-32B-Instruct model improves significantly with long CoT. Similarly, the Qwen2.5-0.5B-Instruct model's performance degrades on the AMC benchmark when trained with a larger teacher model's CoT.
The authors propose Mix Distillation to bridge the Small Model Learnability Gap. Mix Distillation combines long and short CoT examples (Mix-Long) or mixes responses from larger and smaller models (Mix-Large). In Mix-Long, a small model is exposed to both detailed and concise reasoning steps, while in Mix-Large, the small model learns from reasoning chains that are better suited to its capacity.
The experimental results demonstrate that Mix Distillation consistently improves the reasoning performance of small models. For example, the Qwen2.5-3B-Instruct model improves by more than 8 points on MATH and AMC using Mix-Long and more than 7 points on MATH, AIME, and AMC using Mix-Large.
Further analysis reveals that math expert models exhibit a smaller learnability gap compared to general models, suggesting that limited domain knowledge contributes to the learnability challenges faced by small models. Additionally, base models show a more significant learnability gap than instruction-tuned models. Analysis of speaking styles indicated that long CoT and large teacher CoT primarily shift the student model's distribution of tokens associated with speaking styles.
In summary, this paper identifies the Small Model Learnability Gap, provides empirical evidence of its existence, and proposes Mix Distillation as a method to improve the reasoning capabilities of small LLMs by carefully balancing reasoning complexity in the training data.