MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation (2403.09522v2)
Abstract: LLMs (LLM) have demonstrated their strong ability in the field of machine translation (MT), yet they suffer from high computational cost and latency. Therefore, transferring translation knowledge from giant LLMs to medium-sized machine translation models is a promising research direction. However, traditional knowledge distillation methods do not take the capability of student and teacher models into consideration, therefore repeatedly teaching student models on the knowledge they have learned, and failing to extend to novel contexts and knowledge. In this paper, we propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner. Considering the current translation ability of student MT models, we only identify and correct their translation errors, instead of distilling the whole translation from the teacher. Leveraging the strong language abilities of LLMs, we instruct LLM teachers to synthesize diverse contexts and anticipate more potential errors for the student. Experiment results on translating both specific language phenomena and general MT benchmarks demonstrate that finetuning the student MT model on about 10% examples can achieve comparable results to the traditional knowledge distillation method, and synthesized potential errors and diverse contexts further improve translation performances on unseen contexts and words.
- In-context examples selection for machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873, Toronto, Canada. Association for Computational Linguistics.
- Baichuan Inc. 2023. Baichuan 2: Open Large-scale Language Models.
- Language models are few-shot learners. CoRR, 2005.14165.
- Teaching large language models to self-debug.
- Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 575–593, Toronto, Canada. Association for Computational Linguistics.
- Joyce L. Epstein and Frances L. Van Voorhis. 2001. More than minutes: Teachers’ roles in designing homework. Educational Psychologist, 36(3):181–193.
- APE at scale and its implications on MT evaluation biases. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 34–44, Florence, Italy. Association for Computational Linguistics.
- Specializing smaller language models towards multi-step reasoning.
- Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR).
- How good are gpt models at machine translation? a comprehensive evaluation. CoRR, cs.CL/2302.09210v1.
- Distilling the knowledge in a neural network.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.
- Parrot: Translating during chat using large language models.
- Is chatgpt a good translator? yes with gpt-4 as the engine. CoRR, cs.CL/2301.08745v3.
- Scaling laws for neural language models. CoRR, cs.LG/2001.08361v1.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Jackson F Lee Jr and K Wayne Pruitt. 1979. Homework assignments: Classroom games or teaching tools? The Clearing House, 53(1):31–35.
- Eliciting the translation ability of large language models via multilingual finetuning with translation instructions. CoRR, cs.CL/2305.15083.
- Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Selective knowledge distillation for non-autoregressive neural machine translation. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press.
- Self-refine: Iterative refinement with self-feedback.
- No language left behind: Scaling human-centered machine translation.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
- Translationese as a language in “multilingual” NMT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online. Association for Computational Linguistics.
- Data augmentation for intent classification with off-the-shelf large language models. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 47–57, Dublin, Ireland. Association for Computational Linguistics.
- BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Multilingual neural machine translation with knowledge distillation. In International Conference on Learning Representations.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models.
- Prompting palm for translation: Assessing strategies and performance. CoRR, 2211.09102.
- Selective knowledge distillation for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6456–6466, Online. Association for Computational Linguistics.
- GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Large language model as attributed training data generator: A tale of diversity and bias.
- Synthbio: A case study in faster curation of text datasets. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- WuDaoCorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2.
- A survey of large language models.
- Understanding knowledge distillation in non-autoregressive machine translation. In International Conference on Learning Representations.
- Multilingual machine translation with large language models: Empirical results and analysis. CoRR, cs.CL/2304.04675v2.
- Jiahuan Li (10 papers)
- Shanbo Cheng (23 papers)
- Shujian Huang (106 papers)
- Jiajun Chen (125 papers)