Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets (2404.06107v1)

Published 9 Apr 2024 in cs.CL

Abstract: Recent research in the field of multimodal machine translation (MMT) has indicated that the visual modality is either dispensable or offers only marginal advantages. However, most of these conclusions are drawn from the analysis of experimental results based on a limited set of bilingual sentence-image pairs, such as Multi30k. In these kinds of datasets, the content of one bilingual parallel sentence pair must be well represented by a manually annotated image, which is different from the real-world translation scenario. In this work, we adhere to the universal multimodal machine translation framework proposed by Tang et al. (2022). This approach allows us to delve into the impact of the visual modality on translation efficacy by leveraging real-world translation datasets. Through a comprehensive exploration via probing tasks, we find that the visual modality proves advantageous for the majority of authentic translation datasets. Notably, the translation performance primarily hinges on the alignment and coherence between textual and visual contents. Furthermore, our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Alfred V. Aho and Jeffrey D. Ullman. 1972. The Theory of Parsing, Translation and Compiling, volume 1. Prentice-Hall, Englewood Cliffs, NJ.
  2. American Psychological Association. 1983. Publications Manual. American Psychological Association, Washington, DC.
  3. Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853.
  4. Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
  5. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  6. Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, volume 2, pages 308–327.
  7. Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976.
  8. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North, pages 4159–4170. Association for Computational Linguistics.
  9. Iacer Calixto and Qun Liu. 2017. Incorporating global visual features into attention-based neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 992–1003.
  10. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1913–1924.
  11. Alternation. Journal of the Association for Computing Machinery, 28(1):114–133.
  12. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  13. Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation, 49(2):375–395.
  14. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 176–181.
  15. James W. Cooley and John W. Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301.
  16. Jean-Benoit Delbrouck and Stéphane Dupont. 2017a. An empirical study on the effectiveness of images in multimodal neural machine translation. arXiv preprint arXiv:1707.00995.
  17. Jean-Benoit Delbrouck and Stephane Dupont. 2017b. Multimodal compact bilinear pooling for multimodal neural machine translation. arXiv preprint arXiv:1703.08084.
  18. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  19. MultiUN: A multilingual corpus from united nation documents. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).
  20. Desmond Elliott. 2018. Adversarial evaluation of multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2974–2978.
  21. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, pages 215–233.
  22. Multilingual image description with neural sequence models. arXiv preprint arXiv:1510.04709.
  23. Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459.
  24. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.
  25. Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks, pages 37–45.
  26. The MeMAD submission to the WMT18 multimodal translation task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 603–611, Belgium, Brussels. Association for Computational Linguistics.
  27. The memad submission to the wmt18 multimodal translation task. arXiv preprint arXiv:1808.10802.
  28. Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK.
  29. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  30. Attention-based multimodal neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 639–645.
  31. Distilling translations with visual awareness. arXiv preprint arXiv:1906.07701.
  32. Opennmt: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72.
  33. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981.
  34. Sheffield submissions for WMT18 multimodal translation shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 624–631, Belgium, Brussels. Association for Computational Linguistics.
  35. On vision features in multimodal machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6327–6337, Dublin, Ireland. Association for Computational Linguistics.
  36. Explicit sentence compression for neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8311–8318.
  37. Data-dependent gaussian prior objective for language generation. In Eighth International Conference on Learning Representations.
  38. Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1320–1329.
  39. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  42. Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate dependency parser. Computing Research Repository, arXiv:1503.06733. Version 2.
  43. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
  44. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 543–553.
  45. Multi-modal neural machine translation with deep semantic interactions. Information Sciences, 554:47–60.
  46. Multimodal neural machine translation with search engine based image retrieval. In Proceedings of the 9th Workshop on Asian Translation, pages 89–98.
  47. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73.
  48. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  49. Peter D Turney. 2000. Learning algorithms for keyphrase extraction. Information retrieval, 2(4):303–336.
  50. Attention is all you need. Advances in neural information processing systems, 30.
  51. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542.
  52. Kea: Practical automated keyphrase extraction. In Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129–152. IGI global.
  53. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6153–6166, Online. Association for Computational Linguistics.
  54. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR.
  55. Nict-naist system for wmt17 multimodal translation task. In Proceedings of the Second Conference on Machine Translation, pages 477–482.
  56. Neural machine translation with universal visual representation. In International Conference on Learning Representations.
  57. Semantics-aware bert for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9628–9635.
  58. Word-region alignment-guided multimodal neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:244–259.
  59. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zi Long (9 papers)
  2. Zhenhao Tang (7 papers)
  3. Xianghua Fu (11 papers)
  4. Jian Chen (257 papers)
  5. Shilong Hou (2 papers)
  6. Jinze Lyu (1 paper)
Citations (2)