Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step (2306.14050v2)

Published 24 Jun 2023 in cs.CL

Abstract: Chain-of-thought prompting (e.g., "Let's think step-by-step") primes LLMs to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. I2d2: Inductive knowledge distillation with neurologic and self-imitation. arXiv preprint arXiv:2212.09246.
  2. BIG-bench collaboration. 2022. Beyond the imitation game: Measuring and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems.
  4. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
  5. Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-aligned equation generation for solving and reasoning math word problems. NAACL.
  6. Sanja Fidler et al. 2017. Teaching machines to describe images with natural language feedback. Advances in Neural Information Processing Systems, 30.
  7. Evaluating models’ local decision boundaries via contrast sets. Findings of EMNLP.
  8. ROSCOE: A suite of metrics for scoring step-by-step reasoning. arXiv preprint arXiv:2212.07919.
  9. Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. LNLS 2022, page 29.
  10. Generating visual explanations. In ECCV.
  11. Distilling the knowledge in a neural network. stat, 1050:9.
  12. Large language models are reasoning teachers.
  13. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  14. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  15. E-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1244–1254.
  16. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems.
  17. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726.
  18. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
  19. Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.
  20. Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955.
  21. Learning word vectors for sentiment analysis. In ACL.
  22. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  23. Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538.
  24. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  25. MetaICL: Learning to learn in context. NAACL.
  26. Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546.
  27. Show your work: Scratchpads for intermediate computation with language models.
  28. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32.
  29. Language models as knowledge bases? In EMNLP-IJCNLP.
  30. Explain yourself! leveraging language models for commonsense reasoning. In ACL.
  31. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese bert-networks. EMNLP-IJCNLP.
  32. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. EMNLP.
  33. Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. In EMNLP.
  34. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  35. Zero-shot learning of classifiers from natural language quantification. In ACL.
  36. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  37. Quarel: A dataset and models for answering questions about qualitative relationships. In AAAI.
  38. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In NAACL-HLT.
  39. Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems.
  40. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
  41. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  42. Finetuned language models are zero-shot learners. ICLR.
  43. Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.
  44. Symbolic knowledge distillation: from general language models to commonsense models. NAACL.
  45. Reframing human-ai collaboration for generating free-text explanations. NAACL.
  46. Huggingface’s transformers: State-of-the-art natural language processing.
  47. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In International Conference on Learning Representations.
  48. Using “annotator rationales” to improve machine learning for text categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 260–267, Rochester, New York. Association for Computational Linguistics.
  49. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems.
  50. Opt: Open pre-trained transformer language models.
  51. Rationale-augmented convolutional neural networks for text classification. In EMNLP.
  52. Towards interpretable natural language understanding with explanations as latent variables. Advances in Neural Information Processing Systems.
Citations (98)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com