Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Accurate Knowledge Distillation with n-best Reranking (2305.12057v4)

Published 20 May 2023 in cs.CL

Abstract: We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) where we extract pseudo-labels for student model's training data from top n-best hypotheses and leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available LLMs, to pick the highest-quality hypotheses as labels. The effectiveness of our proposal is validated through experiments on the WMT'21 German-English and Chinese-English translation tasks. Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model. In fact, our best student model achieves comparable accuracy to a large translation model from (Tran et al., 2021) with 4.7 billion parameters, while having two orders of magnitude fewer parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 1–61, Toronto, Canada (in-person and online). Association for Computational Linguistics.
  2. Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
  3. Mikel Artetxe and Holger Schwenk. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. CoRR, abs/1812.10464.
  4. Barry Haddow. 2021. WMT21 News Systems and Evaluations. https://github.com/wmt-conference/wmt21-news-systems/blob/main/scores/automatic-scores.tsv (May 5, 2023).
  5. Massive exploration of neural machine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1442–1451, Copenhagen, Denmark. Association for Computational Linguistics.
  6. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.
  7. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery.
  8. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 427–436, Montréal, Canada. Association for Computational Linguistics.
  9. Online large-margin training of syntactic and structural translation features. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 224–233, Honolulu, Hawaii. Association for Computational Linguistics.
  10. Distilling multiple domains for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4500–4511, Online. Association for Computational Linguistics.
  11. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1370–1380, Baltimore, Maryland. Association for Computational Linguistics.
  12. Thomas G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, MCS ’00, page 1–15, Berlin, Heidelberg. Springer-Verlag.
  13. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computational Linguistics.
  14. Beyond english-centric multilingual machine translation. arXiv preprint.
  15. Mara Finkelstein and Markus Freitag. 2024. MBR and QE finetuning: Training-time distillation of the best and most expensive decoding methods. In The Twelfth International Conference on Learning Representations.
  16. Mbr and qe finetuning: Training-time distillation of the best and most expensive decoding methods.
  17. No one representation to rule them all: Overlapping features of training methods. In International Conference on Learning Representations.
  18. Effective strategies in zero-shot neural machine translation. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 105–112, Tokyo, Japan. International Workshop on Spoken Language Translation.
  19. Distilling the knowledge in a neural network.
  20. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
  21. Nearest neighbor machine translation. In International Conference on Learning Representations (ICLR).
  22. Distilling the knowledge of large-scale generative models into retrieval models for efficient open-domain conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3357–3373, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  23. Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
  24. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  25. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for Computational Linguistics.
  26. Multilingual neural machine translation with deep encoder and multiple shallow decoders. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1613–1624, Online. Association for Computational Linguistics.
  27. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Massachusetts, USA. Association for Computational Linguistics.
  28. The NiuTrans machine translation systems for WMT19. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 257–266, Florence, Italy. Association for Computational Linguistics.
  29. Agreement on target-bidirectional neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 411–416, San Diego, California. Association for Computational Linguistics.
  30. Multilingual denoising pre-training for neural machine translation.
  31. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations.
  32. Combination of neural machine translation systems at WMT20. In Proceedings of the Fifth Conference on Machine Translation, pages 230–238, Online. Association for Computational Linguistics.
  33. Adaptive machine translation with large language models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 227–237, Tampere, Finland. European Association for Machine Translation.
  34. Crosslingual generalization through multitask finetuning.
  35. No language left behind: Scaling human-centered machine translation.
  36. A smorgasbord of features for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 161–168, Boston, Massachusetts, USA. Association for Computational Linguistics.
  37. Robert Östling and Jörg Tiedemann. 2016. Efficient word alignment with Markov Chain Monte Carlo. Prague Bulletin of Mathematical Linguistics, 106:125–146.
  38. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  39. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  40. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  41. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  42. The volctrans GLAT system: Non-autoregressive translation meets WMT21. In Proceedings of the Sixth Conference on Machine Translation, pages 187–196, Online. Association for Computational Linguistics.
  43. Language models are unsupervised multitask learners.
  44. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  45. Discriminative reranking for machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 177–184, Boston, Massachusetts, USA. Association for Computational Linguistics.
  46. Facebook AI’s WMT21 news translation task submission. In Proceedings of the Sixth Conference on Machine Translation, pages 205–215, Online. Association for Computational Linguistics.
  47. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  48. Nearest neighbor knowledge distillation for neural machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5546–5556, Seattle, United States. Association for Computational Linguistics.
  49. Simple and effective noisy channel modeling for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5696–5701, Hong Kong, China. Association for Computational Linguistics.
  50. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics, 7:91–105.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)