Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Diverse Modeling Contexts with Collaborating Learning for Neural Machine Translation (2402.18428v1)

Published 28 Feb 2024 in cs.CL

Abstract: Autoregressive (AR) and Non-autoregressive (NAR) models are two types of generative models for Neural Machine Translation (NMT). AR models predict tokens in a word-by-word manner and can effectively capture the distribution of real translations. NAR models predict tokens by extracting bidirectional contextual information which can improve the inference speed but they suffer from performance degradation. Previous works utilized AR models to enhance NAR models by reducing the training data's complexity or incorporating the global information into AR models by virtue of NAR models. However, those investigated methods only take advantage of the contextual information of a single type of model while neglecting the diversity in the contextual information that can be provided by different types of models. In this paper, we propose a novel generic collaborative learning method, DCMCL, where AR and NAR models are treated as collaborators instead of teachers and students. To hierarchically leverage the bilateral contextual information, token-level mutual learning and sequence-level contrastive learning are adopted between AR and NAR models. Extensive experiments on four widely used benchmarks show that the proposed DCMCL method can simultaneously improve both AR and NAR models with up to 1.38 and 2.98 BLEU scores respectively, and can also outperform the current best-unified model with up to 0.97 BLEU scores for both AR and NAR decoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,” in Proc. NeurIPS, 2019, pp. 11 179–11 189.
  2. Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” in Proc. EMNLP, 2016, pp. 1317–1327.
  3. C. Baziotis, B. Haddow, and A. Birch, “Language model prior for low-resource neural machine translation,” in Proc. EMNLP, 2020, pp. 7622–7634.
  4. L. Zhou, J. Zhang, and C. Zong, “Synchronous bidirectional neural machine translation,” TACL, vol. 7, pp. 91–105, 2019.
  5. X. Lan, X. Zhu, and S. Gong, “Knowledge distillation by on-the-fly native ensemble,” in Proc. NIPS, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., 2018, pp. 7528–7538.
  6. Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification,” in Proc. ICLR, 2020.
  7. E. Mansimov, A. Wang, and K. Cho, “A generalized framework of sequence generation with application to undirected sequence models,” CoRR, vol. abs/1905.12790, 2019.
  8. B. Liao, Y. Gao, and H. Ney, “Multi-agent mutual learning at sentence-level and token-level for neural machine translation,” in Findings. EMNLP.   Association for Computational Linguistics, 2020, pp. 1715–1724.
  9. P. Xie, Z. Li, and X. Hu, “Mvsr-nat: Multi-view subset regularization for non-autoregressive machine translation,” CoRR, vol. abs/2108.08447, 2021.
  10. C. Zhuang, A. L. Zhai, and D. Yamins, “Local aggregation for unsupervised learning of visual embeddings,” in Proc. ICCV, 2019, pp. 6001–6011.
  11. Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in Proc. ECCV, vol. 12356, 2020, pp. 776–794.
  12. K. Hassani and A. H. K. Ahmadi, “Contrastive multi-view representation learning on graphs,” in Proc. ICML, vol. 119.   PMLR, 2020, pp. 4116–4126.
  13. A. van den Oord, Y. Li, and et al., “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.
  14. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, 2015, pp. 1412–1421.
  15. Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for neural machine translation,” in Proc. ACL, 2016.
  16. T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” in Proc. EMNLP, 2021, pp. 6894–6910.
  17. H. Fang and P. Xie, “CERT: contrastive self-supervised learning for language understanding,” CoRR, vol. abs/2005.12766, 2020.
  18. S. Wang, H. Huang, and S. Shi, “Improving non-autoregressive machine translation using sentence-level semantic agreement,” Applied Sciences, vol. 12, no. 10, p. 5003, 2022.
  19. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A Fast, Extensible Toolkit for Sequence modeling,” in Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  20. J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proc. EMNLP.   Association for Computational Linguistics, 2018, pp. 1173–1182.
  21. M. Luong and C. D. Manning, “Stanford neural machine translation systems for spoken language domains,” in Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign@IWSLT 2015, 2015.
  22. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. ACL, 2016.
  23. K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. ACL, 2002, pp. 311–318.
  24. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, Y. Bengio and Y. LeCun, Eds., 2015.
  25. M. Ghazvininejad, O. Levy, and L. Zettlemoyer, “Semi-autoregressive training improves mask-predict decoding,” CoRR, vol. abs/2001.08785, 2020.
  26. X. Geng, X. Feng, and B. Qin, “Learning to rewrite for non-autoregressive neural machine translation,” in Proc. EMNLP, 2021, pp. 3297–3308.
  27. X. S. Huang, F. Perez, and M. Volkovs, “Improving non-autoregressive translation models without distillation,” in Proc. ICLR, 2022. [Online]. Available: https://openreview.net/forum?id=I2Hw58KHp8O
  28. X. Wang, Z. Zheng, and S. Huang, “Helping the weak makes you strong: Simple multi-task learning improves non-autoregressive translators,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 5513–5519. [Online]. Available: https://aclanthology.org/2022.emnlp-main.371
  29. C. Zhou, F. Meng, J. Zhou, M. Zhang, H. Wang, and J. Su, “Confidence based bidirectional global context aware training framework for neural machine translation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds.   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2878–2889. [Online]. Available: https://aclanthology.org/2022.acl-long.206
  30. Y. Hao, S. He, W. Jiao, Z. Tu, M. Lyu, and X. Wang, “Multi-task learning with shared encoder for non-autoregressive machine translation,” arXiv preprint arXiv:2010.12868, 2020.
  31. X. Liang, L. Wu, J. Li, and M. Zhang, “JANUS: joint autoregressive and non-autoregressive training with auxiliary loss for sequence generation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Association for Computational Linguistics, 2022, pp. 8050–8060. [Online]. Available: https://aclanthology.org/2022.emnlp-main.550
  32. R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “COMET: A neural framework for MT evaluation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu, Eds.   Association for Computational Linguistics, 2020, pp. 2685–2702. [Online]. Available: https://doi.org/10.18653/v1/2020.emnlp-main.213
  33. M. Popovic, “chrf: character n-gram f-score for automatic MT evaluation,” in Proceedings of the Tenth Workshop on Statistical Machine Translation, WMT@EMNLP 2015, 17-18 September 2015, Lisbon, Portugal.   The Association for Computer Linguistics, 2015, pp. 392–395. [Online]. Available: https://doi.org/10.18653/v1/w15-3049
  34. C. Saharia, W. Chan, S. Saxena, and M. Norouzi, “Non-autoregressive machine translation with latent alignments,” in Proc. EMNLP, 2020, pp. 1098–1108.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yusheng Liao (16 papers)
  2. Yanfeng Wang (211 papers)
  3. Yu Wang (939 papers)