Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step (2306.14050v2)
Abstract: Chain-of-thought prompting (e.g., "Let's think step-by-step") primes LLMs to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.
- I2d2: Inductive knowledge distillation with neurologic and self-imitation. arXiv preprint arXiv:2212.09246.
- BIG-bench collaboration. 2022. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems.
- e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-aligned equation generation for solving and reasoning math word problems. NAACL.
- Sanja Fidler et al. 2017. Teaching machines to describe images with natural language feedback. Advances in Neural Information Processing Systems, 30.
- Evaluating models’ local decision boundaries via contrast sets. Findings of EMNLP.
- ROSCOE: A suite of metrics for scoring step-by-step reasoning. arXiv preprint arXiv:2212.07919.
- Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. LNLS 2022, page 29.
- Generating visual explanations. In ECCV.
- Distilling the knowledge in a neural network. stat, 1050:9.
- Large language models are reasoning teachers.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- E-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1244–1254.
- Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems.
- Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726.
- On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.
- Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955.
- Learning word vectors for sentiment analysis. In ACL.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- MetaICL: Learning to learn in context. NAACL.
- Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546.
- Show your work: Scratchpads for intermediate computation with language models.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32.
- Language models as knowledge bases? In EMNLP-IJCNLP.
- Explain yourself! leveraging language models for commonsense reasoning. In ACL.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese bert-networks. EMNLP-IJCNLP.
- Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. EMNLP.
- Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. In EMNLP.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
- Zero-shot learning of classifiers from natural language quantification. In ACL.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Quarel: A dataset and models for answering questions about qualitative relationships. In AAAI.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. In NAACL-HLT.
- Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems.
- Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Finetuned language models are zero-shot learners. ICLR.
- Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.
- Symbolic knowledge distillation: from general language models to commonsense models. NAACL.
- Reframing human-ai collaboration for generating free-text explanations. NAACL.
- Huggingface’s transformers: State-of-the-art natural language processing.
- Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In International Conference on Learning Representations.
- Using “annotator rationales” to improve machine learning for text categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 260–267, Rochester, New York. Association for Computational Linguistics.
- Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems.
- Opt: Open pre-trained transformer language models.
- Rationale-augmented convolutional neural networks for text classification. In EMNLP.
- Towards interpretable natural language understanding with explanations as latent variables. Advances in Neural Information Processing Systems.