Teaching Small Language Models to Reason (2212.08410v3)

Published 16 Dec 2022 in cs.CL and cs.LG

Abstract: Chain of thought prompting successfully improves the reasoning capabilities of LLMs, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with a size of over 100 billion parameters. In this paper, we explore the transfer of such reasoning capabilities to models with less than 100 billion parameters via knowledge distillation. Specifically, we finetune a student model on the chain of thought outputs generated by a larger teacher model. Our experiments show that the proposed method improves task performance across arithmetic, commonsense and symbolic reasoning datasets. For example, the accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.

PDF Abstract

Teaching Small LLMs to Reason

The paper "Teaching Small LLMs to Reason" addresses a salient challenge in the domain of NLP: enhancing the reasoning capabilities of smaller LLMs. While recent techniques such as chain-of-thought (CoT) prompting have significantly improved the reasoning performance in LLMs comprising at least tens of billions of parameters, these advancements do not extend to models with fewer parameters, which often produce ineffective reasoning outputs. This research investigates the transfer of reasoning skills from extensive LLMs to smaller counterparts via knowledge distillation methodologies.

Methodology

The core approach involves a two-step pipeline. First, reasoning outputs or chain-of-thoughts are generated from LLMs such as PaLM 540B and GPT-3 175B and used to annotate supervised datasets. This data generation employs a modified few-shot prompting technique by incorporating solution targets in the prompts to refine the model's reasoning sequence. In the second step, smaller models, specifically models like T5 of various sizes, are fine-tuned on these distilled reasoning outputs using teacher forcing techniques. This process is designed to imbue smaller models with the reasoning capabilities observed in larger ones without requiring large model parameterizations.

Results and Analysis

The experiments demonstrate a notable enhancement in performance across arithmetic, commonsense, and symbolic reasoning datasets. When fine-tuned with CoT data generated from large models, T5 XXL's accuracy on the GSM8K dataset increased significantly from 8.11% to over 21%. Similarly, improvements were observed in commonsense reasoning tasks, with performance increases reported in tasks evaluated on the StrategyQA dataset. However, the degree of improvement varied, being more pronounced in tasks where factual knowledge was less pivotal.

An important insight from the ablation studies highlighted the methodological advancement that providing the expected task answer in the prompt significantly increased the quality and effectiveness of CoT outputs, thereby aiding the finetuning process's success.

Implications and Future Directions

The findings underscore the potential of knowledge distillation for improving the efficiency and capability of smaller LLMs. This approach allows smaller models to leverage detailed reasoning processes without scaling in size. Practically, this could lead to more efficient deployment in resource-constrained environments where maintaining large models is not feasible.

Theoretically, the results from this paper suggest that CoT and other structured reasoning processes developed using larger models can be effectively transferred to smaller models, expanding the applicability of cutting-edge reasoning techniques across various model sizes.

For future research, exploring how these distilled reasoning approaches can integrate into multi-task learning settings or how they can generate robust reasoning data autonomously is promising. Additionally, evaluating the trade-offs between model size and dataset size in achieving optimal performance stands as an exciting avenue for further exploration.

In summary, this research contributes essential insights into scaling reasoning capabilities across different model architectures and exhibits a robust paradigm for advancing LLM efficiency and effectiveness beyond mere parameter scaling.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Lucie Charlotte Magister (13 papers)
Jonathan Mallinson (13 papers)
Jakub Adamek (7 papers)
Eric Malmi (26 papers)
Aliaksei Severyn (29 papers)

Citations (197)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/srchvrs/status/1914028090679812457