Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Did Translation Models Get More Robust Without Anyone Even Noticing? (2403.03923v1)

Published 6 Mar 2024 in cs.CL

Abstract: Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to "noisy" inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and LLMs applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments -- LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Introducción a la tarea compartida Tweet-Norm 2013: Normalización léxica de tuits en Español. In Tweet-Norm@ SEPLN, pages 1–9.
  2. Steering large language models for machine translation with finetuning and in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11127–11148, Singapore. Association for Computational Linguistics.
  3. Tower: An open multilingual large language model for translation-related tasks.
  4. Antonios Anastasopoulos. 2019. An analysis of source-side grammatical errors in NMT. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 213–223, Florence, Italy. Association for Computational Linguistics.
  5. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  6. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text, pages 126–135, Beijing, China. Association for Computational Linguistics.
  7. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations.
  8. Unnatural error correction: GPT-4 can almost perfectly handle unnatural scrambled text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8898–8913, Singapore. Association for Computational Linguistics.
  9. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 653–663, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  10. Beyond english-centric multilingual machine translation. The Journal of Machine Learning Research, 22(1):4839–4886.
  11. MultiLexNorm: A shared task on multilingual lexical normalization. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 493–509, Online. Association for Computational Linguistics.
  12. How good are gpt models at machine translation? a comprehensive evaluation.
  13. NeuSpell: A neural spelling correction toolkit. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 158–164, Online. Association for Computational Linguistics.
  14. Marian: Fast neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, pages 116–121, Melbourne, Australia. Association for Computational Linguistics.
  15. Training on synthetic noise improves robustness to natural noise in machine translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47, Hong Kong, China. Association for Computational Linguistics.
  16. Huda Khayrallah and Philipp Koehn. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74–83, Melbourne, Australia. Association for Computational Linguistics.
  17. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  18. Lexical modeling of asr errors for robust speech translation. In Proceedings of Interspeech 2021, pages 2282–2286. ISCA.
  19. Paul Michel and Graham Neubig. 2018. MTNT: A testbed for machine translation of noisy text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 543–553, Brussels, Belgium. Association for Computational Linguistics.
  20. Between words and characters: a brief history of open-vocabulary modeling and tokenization in nlp. arXiv preprint arXiv:2112.10508.
  21. No language left behind: Scaling human-centered machine translation.
  22. Filipp Ozinov. 2019. Jamspell.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  24. Adversarial subword regularization for robust neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1945–1953, Online. Association for Computational Linguistics.
  25. Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36.
  26. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  27. Improving language understanding by generative pre-training. OpenAI blog.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  29. D Raj Reddy et al. 1977. Speech understanding systems: A summary of results of the five-year research effort. Department of Computer Science. Camegie-Mell University, Pittsburgh, PA, 17.
  30. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  31. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  32. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  33. Luz Rello and Ricardo Baeza-Yates. 2012. Social media is not that bad! the lexical quality of social media. In Proceedings of the International AAAI Conference on Web and Social Media, volume 6, pages 559–562.
  34. Language modelling with pixels. arXiv preprint arXiv:2207.06991.
  35. Robust open-vocabulary translation from visual text representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7235–7252, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  36. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  37. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  38. Rule-based normalization of German Twitter messages. In Proc. of the GSCL Workshop Verarbeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation.
  39. Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1017–1024.
  40. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
  41. Jörg Tiedemann. 2020. The tatoeba translation challenge – realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online. Association for Computational Linguistics.
  42. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  44. Improving robustness of machine translation with synthetic noise. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1916–1920, Minneapolis, Minnesota. Association for Computational Linguistics.
  45. Attention is all you need. Advances in neural information processing systems, 30.
  46. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  47. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  48. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ben Peters (8 papers)
  2. André F. T. Martins (113 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com