Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting (2310.07081v2)

Published 10 Oct 2023 in cs.CL

Abstract: Idioms are common in everyday language, but often pose a challenge to translators because their meanings do not follow from the meanings of their parts. Despite significant advances, machine translation systems still struggle to translate idiomatic expressions. We provide a simple characterization of idiomatic translation and related issues. This allows us to conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To expand multilingual resources, we compile a dataset of ~4k natural sentences containing idiomatic expressions in French, Finnish, and Japanese. To improve translation of natural idioms, we introduce two straightforward yet effective techniques: the strategic upweighting of training loss on potentially idiomatic sentences, and using retrieval-augmented models. This not only improves the accuracy of a strong pretrained MT model on idiomatic sentences by up to 13% in absolute accuracy, but also holds potential benefits for non-idiomatic sentences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. No more beating about the bush : A step towards idiom handling for Indian language NLP. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  2. Dimitra Anastasiou. 2010. Idiom treatment experiments in machine translation. Ph.D. thesis.
  3. Yehoshua Bar-Hillel. 1952. The treatment of “idioms” by a translating machine. In Proceedings of the Conference on Mechanical Translation, Massachusetts Institute of Technology.
  4. Automatic evaluation and analysis of idioms in neural machine translation.
  5. Towards best practice for multiword expressions in computational lexicons. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. European Language Resources Association (ELRA).
  6. Santiago Castro. 2017. Fast Krippendorff: Fast computation of Krippendorff’s alpha agreement measure. https://github.com/pln-fing-udelar/fast-krippendorff.
  7. Thomas C. Cooper. 1999. Processing of Idioms by L2 Learners of English. TESOL Quarterly, 33(2):233–262.
  8. The paradox of the compositionality of natural language: A neural machine translation case study. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4154–4175, Dublin, Ireland. Association for Computational Linguistics.
  9. Can transformer be too compositional? analysing idiom processing in neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3608–3626, Dublin, Ireland. Association for Computational Linguistics.
  10. Examining the tip of the iceberg: A data set for idiom translation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  11. Discriminative instance weighting for domain adaptation in statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 451–459, Cambridge, MA. Association for Computational Linguistics.
  12. Randomized significance tests in machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 266–274. Association for Computational Linguistics.
  13. MAGPIE: A large corpus of potentially idiomatic expressions. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 279–287, Marseille, France. European Language Resources Association.
  14. Understanding transformer memorization recall through idioms.
  15. Identifying idioms in Chinese translations. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 716–721, Reykjavik, Iceland. European Language Resources Association (ELRA).
  16. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  17. Large language models struggle to learn long-tail knowledge.
  18. Nearest neighbor machine translation.
  19. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
  20. Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30:61 – 70.
  21. Resources for the detection of conventionalized metaphors in four languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 498–501, Reykjavik, Iceland. European Language Resources Association (ELRA).
  22. OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  23. Dayan Liu. 2012. Translation and culture: Translating idioms between english and chinese from a cultural perspective. Theory and Practice in Language Studies, 2:2357–2362.
  24. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Tradumàtica, (12):0455–463.
  25. Deltalm: Encoder-decoder pre-training for language generation and translation by augmenting pretrained multilingual encoders.
  26. Evgeny Matusov. 2019. The challenges of using neural machine translation for literature. In Proceedings of the Qualities of Literary Machine Translation, pages 10–19, Dublin, Ireland. European Association for Machine Translation.
  27. Makoto Nagao. 1984. A framework of a mechanical translation between japanese and english by analogy principle. In Proc. of the International NATO Symposium on Artificial and Human Intelligence, page 173–180, USA. Elsevier North-Holland, Inc.
  28. Abdulfattah Omar and Yasser Gomaa. 2020. The machine translation of literature: Implications for translation pedagogy. International Journal of Emerging Technologies in Learning (iJET), 15:228.
  29. Sixth Confrence on Machine Translation (WMT21). Shared task: Large-scale multilingual machine translation.
  30. fairseq: A fast, extensible toolkit for sequence modeling.
  31. Idiom paraphrases: Seventh heaven vs cloud nine. In Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics, pages 76–82, Lisbon, Portugal. Association for Computational Linguistics.
  32. Thierry Poibeau. 2022. On "human parity" and "super human performance" in machine translation evaluation. Marseille, France. Language Resource and Evaluation Conference.
  33. Paul H. Portner. 2005. What is meaning?: Fundamentals of formal semantics. Blackwell Publishing, Malden, MA.
  34. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  35. Andrew Radford. 2004. English Syntax: An Introduction. Cambridge University Press, Cambridge, UK.
  36. Unbabel’s participation in the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 911–920, Online. Association for Computational Linguistics.
  37. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  38. Multiword expressions: A pain in the neck for nlp. In Computational Linguistics and Intelligent Text Processing, pages 1–15, Berlin, Heidelberg. Springer Berlin Heidelberg.
  39. Evaluation of a substitution method for idiom transformation in statistical machine translation. In Proceedings of the 10th Workshop on Multiword Expressions (MWE), pages 38–42, Gothenburg, Sweden. Association for Computational Linguistics.
  40. Evaluating machine translation performance on chinese idioms with a blacklist method.
  41. Antonio Toral and Andy Way. 2018. What level of quality can neural machine translation attain on literary text? In J Moorkens, S Castilho, F Gaspari, and S Doherty, editors, Translation Quality Assessment: Technologies and Applications. Springer, Cham.
  42. Unbabel. 2019. Why translating idioms is hard.
  43. Attention is all you need.
  44. Instance weighting for neural machine translation domain adaptation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1482–1488, Copenhagen, Denmark. Association for Computational Linguistics.
  45. Eric Wehrli. 1998. Translating idioms. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.
  46. Chunli Yang. 2010. Cultural differences on chinese and english idioms of diet and the translation. English Language Teaching, 3.
  47. Andrea Zaninello and Alexandra Birch. 2020. Multiword expression aware neural machine translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3816–3825, Marseille, France. European Language Resources Association.
Citations (6)

Summary

We haven't generated a summary for this paper yet.