Learning to Maximize Mutual Information for Chain-of-Thought Distillation (2403.03348v3)
Abstract: Knowledge distillation, the technique of transferring knowledge from large, complex models to smaller ones, marks a pivotal step towards efficient AI deployment. Distilling Step-by-Step~(DSS), a novel method utilizing chain-of-thought~(CoT) distillation, has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts. In DSS, the distilled model acquires the ability to generate rationales and predict labels concurrently through a multi-task learning framework. However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction. To this end, we investigate the mutual relationship of the two tasks from Information Bottleneck perspective and formulate it as maximizing the mutual information of the representation features of the two tasks. We propose a variational approach to solve this optimization problem using a learning-based method. Our experimental results across four datasets demonstrate that our method outperforms the state-of-the-art DSS. Our findings offer insightful guidance for future research on LLM distillation as well as applications involving CoT. Codes are available at \url{https://github.com/xinchen9/cot_distillation_ACL2024}.
- Deep variational information bottleneck. In International Conference on Learning Representations.
- Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations.
- Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32.
- Mutual information neural estimation. In International conference on machine learning, pages 531–540. PMLR.
- e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- Rich Caruana. 1997. Multitask learning. Machine learning, 28:41–75.
- Hanjie Chen and Yangfeng Ji. 2020. Learning variational word masks to improve the interpretability of neural text classifiers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4236–4251.
- Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017.
- A close look into the calibration of pre-trained language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1343–1367, Toronto, Canada. Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.
- Learning to maximize mutual information for dynamic feature selection. In International Conference on Machine Learning, pages 6424–6447. PMLR.
- Learning to learn with variational information bottleneck for domain generalization. In ECCV 2020, pages 200–216. Springer.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
- Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Linguistics.
- Gpt-4 as an effective zero-shot evaluator for scientific figure captions. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5464–5474.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
- Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491.
- Hard gate knowledge distillation-leverage calibration for robust and reliable language model. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9793–9803.
- Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050.
- Wei-Hong Li and Hakan Bilen. 2020. Knowledge distillation for multi-task learning. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 163–176. Springer.
- Pareto multi-task learning. Advances in neural information processing systems, 32.
- Feature selection with dynamic mutual information. Pattern Recognition, 42(7):1330–1339.
- Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4487–4496. Association for Computational Linguistics.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Sci-cot: Leveraging large language models for enhanced knowledge distillation in small models for scientific qa. arXiv preprint arXiv:2308.04679.
- Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, Toronto, Canada.
- David McAllester and Karl Stratos. 2020. Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pages 875–884. PMLR.
- Do text-to-text multi-task learners suffer from task conflict? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2843–2858.
- Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- An information bottleneck approach for controlling conciseness in rationale extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1938–1952.
- Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
- Mary Phuong and Christoph Lampert. 2019. Towards understanding knowledge distillation. In International conference on machine learning, pages 5142–5151. PMLR.
- On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171–5180. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Noam Slonim. 2002. The information bottleneck: Theory and applications. Ph.D. thesis, Hebrew University of Jerusalem Jerusalem, Israel.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Farewell to mutual information: Variational distillation for cross-modal person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1522–1531.
- Contrastive representation distillation. In International Conference on Learning Representations.
- Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. IEEE.
- On mutual information maximization for representation learning. In International Conference on Learning Representations.
- Multi-view information-bottleneck representation learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 10085–10092.
- Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada. Association for Computational Linguistics.
- Lin Wang and Kuk-Jin Yoon. 2021. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence, 44(6):3048–3068.
- SCOTT: Self-consistent chain-of-thought distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5546–5558, Toronto, Canada. Association for Computational Linguistics.
- Deep multi-view information bottleneck. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 37–45. SIAM.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Joseph Worsham and Jugal Kalita. 2020. Multi-task learning for natural language processing in the 2020s: where are we going? Pattern Recognition Letters, 136:120–126.
- Multi-task learning with knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21550–21559.
- Cross-task knowledge distillation in multi-task recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4318–4326.
- Improving the adversarial robustness of nlp models by information bottleneck. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3588–3598.
- Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4388–4403.
- Contrastive deep supervision. In European Conference on Computer Vision, pages 1–19. Springer.
- A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
- A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 943–956.
- Xin Chen (456 papers)
- Hanxian Huang (10 papers)
- Yanjun Gao (25 papers)
- Yi Wang (1038 papers)
- Jishen Zhao (24 papers)
- Ke Ding (30 papers)