Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation (2311.08640v4)
Abstract: We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted LLMs exhibit room for improvement. In this paper, we present the discovery that a student model distilled from a few-shot prompted LLM can commonly generalize better than its teacher to unseen examples on such tasks. We find that the student is able to learn a general pattern from the high-quality pseudolabels produced by the teacher during knowledge distillation (KD), and favorably not a general pattern from the low-quality pseudolables. Leveraging this discovery, we propose a new method, Multistage Collaborative Knowledge Distillation from an LLM (MCKD), for these tasks. MCKD first few-shot prompts an LLM to produce pseudolabels for unlabeled data. Then at each stage of an iterative KD process, a new pair of students is trained on disjoint partitions of the pseudolabeled data, and produces new and improved pseudolabels for their unseen partitions. We conduct extensive experiments on four syntactic and semantic parsing datasets and show the effectiveness of MCKD for low-resource semi-supervised sequence generation. On CRAFT biomedical parsing, for example, 3-stage MCKD with 50 labeled examples outperforms an LLM teacher and vanilla KD by 7.5% and 3.7% parsing F1, respectively, and matches the performance of supervised finetuning with 500 labeled examples.
- Self-training: A survey. arXiv preprint arXiv:2202.12040.
- Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In COLT’ 98.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Model compression. In Knowledge Discovery and Data Mining.
- Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
- Vitaly Feldman. 2020. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959.
- Born again neural networks. In International Conference on Machine Learning.
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 753–757.
- Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11020–11029.
- Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31.
- Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
- Weighted distillation with unlabeled examples. In Advances in Neural Information Processing Systems, volume 35.
- TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174. Association for Computational Linguistics.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327. Association for Computational Linguistics.
- Slam: Student-label mixing for distillation with unlabeled examples. In Thirty-seventh Conference on Neural Information Processing Systems.
- Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. medRxiv.
- Co-training improves prompt-based learning for large language models. In International Conference on Machine Learning, pages 11985–12003. PMLR.
- Eigenvalues of covariance matrices: Application to neural-network learning. Physical review letters, 66 18:2396–2399.
- On the feasibility of specialized ability stealing for large language code models. arXiv preprint arXiv:2303.03012.
- Noisy self-knowledge distillation for text summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 692–703. Association for Computational Linguistics.
- Rethinking parameter counting in deep models: Effective dimensionality revisited. ArXiv, abs/2003.02139.
- Building a large annotated corpus of english: The penn treebank. Comput. Linguistics, 19(2):313–330.
- Effective self-training for parsing. In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 4-9, 2006. The Association for Computational Linguistics.
- When is self-training effective for parsing? In COLING 2008, 22nd International Conference on Computational Linguistics, Proceedings of the Conference, 18-22 August 2008, pages 561–568.
- Self-training for unsupervised parsing with PRPN. In Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies, pages 105–110. Association for Computational Linguistics.
- Basins of attraction near the critical storage capacity for neural networks with constant stabilities. Journal of Physics A, 22.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- Distilling multi-step reasoning capabilities of large language models into smaller models via semantic decompositions. arXiv preprint arXiv:2212.00193.
- Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
- Guocong Song and Wei Chai. 2018. Collaborative learning for deep neural networks. Advances in neural information processing systems, 31.
- MobileBERT: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170. Association for Computational Linguistics.
- Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.
- What is left to be understood in atis? In 2010 IEEE Spoken Language Technology Workshop, pages 19–24. IEEE.
- A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC bioinformatics, 13(1):1–26.
- Grammar as a foreign language. Advances in neural information processing systems, 28.
- Selective knowledge distillation for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6456–6466.
- Let’s synthesize step by step: Iterative dataset synthesis with large language models by extrapolating errors from small models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11817–11831, Singapore. Association for Computational Linguistics.
- Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics.
- Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics.
- Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328.
- Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.