Distilling Reasoning Capabilities into Smaller Language Models (2212.00193v2)

Published 1 Dec 2022 in cs.LG and cs.CL

Abstract: Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in LLMs. However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work. In this paper, we propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models. In this work, we propose an alternative reasoning scheme, Socratic CoT, that learns a decomposition of the original problem into a sequence of subproblems and uses it to guide the intermediate reasoning steps. We use Socratic CoT to train a combination of two small distilled models: a problem decomposer and a subproblem solver. In practice, given a new problem, the two distilled models work in sync to decompose and solve complex problems. On multiple reasoning datasets (GSM8K, StrategyQA, and SVAMP), our proposed distillation strategies boosts the performance of smaller models over 70% compared to the baselines. Finally, we investigate when Socratic CoT is an effective alternative to CoT, demonstrating cases where a much smaller model (GPT-2 large) can outperform a 10X larger model (GPT-3 6B). Our code is available here: https://github.com/kumar-shridhar/Distiiling-LM

PDF Abstract

Distilling Reasoning Capabilities into Smaller LLMs

This paper, authored by Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan, addresses the challenge of instilling reasoning abilities indicative of LLMs into smaller models via knowledge distillation. The paper focuses particularly on step-by-step reasoning methodologies such as chain-of-thought (CoT) approaches, which naturally enhance LLMs' performance in reasoning tasks but require models with a large number of parameters, often on the order of billions, to be effective.

Methodological Approach

The authors propose a novel distillation framework aimed at transferring the CoT reasoning capacity from larger to smaller models. They introduce an alternative reasoning scheme called Socratic CoT, which involves decomposing the primary problem into a sequence of subproblems, thus constructing a subproblem-solution guided reasoning chain.

Within this framework, two smaller distilled models are trained: a problem decomposer and a subproblem solver. Once trained, these models collaborate by decomposing a new problem into subproblems and their subsequent solutions. The approach was tested on reasoning datasets including GSM8K, StrategyQA, and SVAMP.

Experimental Results

The paper reports significant improvements in reasoning capabilities of smaller models trained under this distillation framework, showing performance increases over 70% compared to baseline models without distillation. Notably, the Socratic CoT method allowed a much smaller model (GPT-2 large) to sometimes outperform models about ten times its size (GPT-3 6B).

Theoretical and Practical Implications

The research provides a compelling argument for the utility of knowledge distillation as a tool to transfer sophisticated reasoning abilities from large, resource-intensive models to more computationally efficient small models. It underscores the potential for wider deployment of reasoning-capable models in environments where resources are constrained.

Theoretical implications include advancing the methodology for reasoning task formulation in AI, emphasizing semantic decomposition for enhanced reasoning accuracy. From a practical standpoint, this could democratize the deployment of AI reasoning capabilities, enabling more applications to access robust reasoning without substantial computational overheads.

Future Research Directions

Future developments could explore further granularity in the decomposition of reasoning tasks, seeking even more efficient ways to train subproblem identification and solving in smaller models. Moreover, continued improvements in LLM-generated annotations could augment accuracy and facilitate broader adoption of these distillation practices.

In summary, this paper offers a structured approach to distilling reasoning capabilities into smaller LLMs, showing substantial performance enhancements and the feasibility of deploying effective reasoning in constrained computational environments. These advancements not only refine current AI capabilities but also open new horizons for AI applications across diverse fields.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Kumar Shridhar (25 papers)
Alessandro Stolfo (12 papers)
Mrinmaya Sachan (124 papers)

Citations (127)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - kumar-shridhar/Distiiling-LM: The code for the paper : Distilling Reasoning Capabilities into Smaller Language Models (7 stars)

Tweets

https://twitter.com/den_run_ai/status/1781408794318455249