Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adding Multimodal Capabilities to a Text-only Translation Model (2403.03045v1)

Published 5 Mar 2024 in cs.CL

Abstract: While most current work in multimodal machine translation (MMT) uses the Multi30k dataset for training and evaluation, we find that the resulting models overfit to the Multi30k dataset to an extreme degree. Consequently, these models perform very badly when evaluated against typical text-only testing sets such as the WMT newstest datasets. In order to perform well on both Multi30k and typical text-only datasets, we use a performant text-only machine translation (MT) model as the starting point of our MMT model. We add vision-text adapter layers connected via gating mechanisms to the MT model, and incrementally transform the MT model into an MMT model by 1) pre-training using vision-based masking of the source text and 2) fine-tuning on Multi30k.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Flamingo: a Visual Language Model for Few-Shot Learning.
  2. Cross-lingual visual pre-training for multimodal machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1317–1324, Online. Association for Computational Linguistics.
  3. Probing the Need for Visual Context in Multimodal Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4159–4170, Minneapolis, Minnesota. Association for Computational Linguistics.
  4. End-to-end object detection with transformers. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, page 213–229, Berlin, Heidelberg. Springer-Verlag.
  5. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74. Association for Computational Linguistics. Event-place: Berlin, Germany.
  6. Tackling ambiguity with images: Improved multimodal machine translation and contrastive evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413, Toronto, Canada. Association for Computational Linguistics.
  7. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  8. Distilling translations with visual awareness. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6525–6538, Florence, Italy. Association for Computational Linguistics.
  9. Perceiver: General Perception with Iterative Attention. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR.
  10. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  11. Valhalla: Visual hallucination for machine translation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5206–5216, Los Alamitos, CA, USA. IEEE Computer Society.
  12. Dynamic Context-guided Capsule Network for Multimodal Machine Translation. Proceedings of the 28th ACM International Conference on Multimedia. ISBN: 9781450379885 Publisher: ACM.
  13. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319, Florence, Italy. Association for Computational Linguistics.
  14. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  15. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision, 123:74 – 93.
  16. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  17. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  18. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
  19. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  20. Multimodal translation models should be evaluated on text-only datasets. ArXiV.
  21. Dexin Wang and Deyi Xiong. 2021. Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):2720–2728.
  22. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6153–6166, Online. Association for Computational Linguistics.
  23. Shaowei Yao and Xiaojun Wan. 2020. Multimodal Transformer for Multimodal Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4346–4350, Online. Association for Computational Linguistics.
  24. A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3025–3035, Online. Association for Computational Linguistics.
  25. VinVL: Revisiting Visual Representations in Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588. _eprint: 2101.00529.
  26. Neural Machine Translation with Universal Visual Representation. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vipin Vijayan (10 papers)
  2. Braeden Bowen (3 papers)
  3. Scott Grigsby (6 papers)
  4. Timothy Anderson (3 papers)
  5. Jeremy Gwinnup (7 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.