Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs (2402.12030v2)
Abstract: Deploying LLMs of several billion parameters can be impractical in most industrial use cases due to constraints such as cost, latency limitations, and hardware accessibility. Knowledge distillation (KD) offers a solution by compressing knowledge from resource-intensive large models to smaller ones. Various strategies exist, some relying on the text generated by the teacher model and optionally utilizing his logits to enhance learning. However, these methods based on logits often require both teacher and student models to share the same tokenizer, limiting their applicability across different LLM families. In this paper, we introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation. Our experimental results demonstrate the effectiveness of ULD loss in enabling distillation across models with different architectures and tokenizers, paving the way to a more widespread use of distillation techniques.
- The falcon series of open language models. arXiv:2311.16867. arXiv. ArXiv:2311.16867 [cs].
- Stochastic wasserstein autoencoder for probabilistic sentence generation.
- Knot: Knowledge distillation using optimal transport for solving nlp tasks.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv:2304.01373arXiv:2304.01373. arXiv. ArXiv:2304.01373 [cs].
- Gpt-neox-20b: An open-source autoregressive language model.
- Language models are few-shot learners. volume 33, pages 1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery.
- Dialogsum: A real-life scenario dialogue summarization dataset.
- Automatic text evaluation through the lens of Wasserstein barycenters. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10450–10466, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. volume 26.
- Cost-effective distillation of large language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7346–7354.
- Qlora: Efficient finetuning of quantized llms.
- Training on synthetic data beats real data in multimodal relation extraction.
- Revisiting instruction fine-tuned model evaluation to guide industrial applications. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Optimal transport for unsupervised hallucination detection in neural machine translation.
- Teacherlm: Teaching to fish rather than giving the fish, language modeling likewise. arXiv:2310.19019. arXiv. ArXiv:2310.19019 [cs].
- Generate, Annotate, and Learn: NLP with Synthetic Text. volume 10, pages 826–842.
- Distilling the knowledge in a neural network. arXiv:1503.02531arXiv:1503.02531. arXiv. ArXiv:1503.02531 [cs, stat].
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.
- Mistral 7b. arXiv:2310.06825. arXiv. ArXiv:2310.06825 [cs].
- Mixtral of experts.
- Tinybert: Distilling bert for natural language understanding.
- Pubmedqa: A dataset for biomedical research question answering.
- A study of bfloat16 for deep learning training.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Alina Kramchaninova and Arne Defauw. 2022. Synthetic data generation for multilingual domain-adaptable question answering systems. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 151–160, Ghent, Belgium. European Association for Machine Translation.
- Qed: A framework and dataset for explanations in question answering.
- Reducing retraining by recycling parameter-efficient prompts.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- dis.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Llm-pruner: On the structural pruning of large language models.
- Small-100: Introducing shallow multilingual machine translation model for low-resource languages. arXiv:2210.11621arXiv:2210.11621. arXiv. ArXiv:2210.11621 [cs].
- Crosslingual generalization through multitask finetuning. arXiv:2211.01786. arXiv. ArXiv:2211.01786 [cs].
- Training language models to follow instructions with human feedback.
- Computational optimal transport: With applications to data science. volume 11, pages 355–607. Now Publishers, Inc.
- Squad: 100,000+ questions for machine comprehension of text.
- For distillation, tokens are not all you need. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- A survey of hallucination in large foundation models.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
- Multitask prompted training enables zero-shot task generalization.
- Retrieval augmentation reduces hallucination in conversation.
- Beyond accuracy, f-score and roc: A family of discriminant measures for performance evaluation. volume Vol. 4304, pages 1015–1021.
- Mobilebert: a compact task-agnostic bert for resource-limited devices.
- Inar Timiryasov and Jean-Loup Tastet. 2023. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty.
- Llama: Open and efficient foundation language models. arXiv:2302.13971arXiv:2302.13971. arXiv. ArXiv:2302.13971 [cs].
- Llama 2: Open foundation and fine-tuned chat models.
- An empirical comparison of lm-based question and answer generation methods.
- Cédric Villani et al. 2009. Optimal transport: old and new, volume 338. Springer.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.
- Generalizing from a few examples: A survey on few-shot learning. volume 53, pages 1–34. ACM New York, NY, USA.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions.
- Distilled wasserstein learning for word embedding and topic modeling.
- Fantastic questions and where to find them: Fairytaleqa – an authentic dataset for narrative comprehension.
- Fast discrete distribution clustering using wasserstein barycenter with sparse support.
- FiD-ICL: A fusion-in-decoder approach for efficient in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8158–8185, Toronto, Canada. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models. arXiv:2205.01068. arXiv. ArXiv:2205.01068 [cs].
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Multistage collaborative knowledge distillation from large language models. arXiv:2311.08640. arXiv. ArXiv:2311.08640 [cs].
- Pytorch fsdp: Experiences on scaling fully sharded data parallel.
- Tianxun Zhou and Keng-Hwee Chiam. 2023. Synthetic data generation method for data-free knowledge distillation in regression neural networks. volume 227, page 120327. Elsevier.
- Wasserstein k𝑘kitalic_k-means for clustering probability distributions.
- Nicolas Boizard (5 papers)
- Céline Hudelot (50 papers)
- Pierre Colombo (48 papers)
- Kevin El Haddad (14 papers)