Detecting Concrete Visual Tokens for Multimodal Machine Translation (2403.03075v1)
Abstract: The challenge of visual grounding and masking in multimodal machine translation (MMT) systems has encouraged varying approaches to the detection and selection of visually-grounded text tokens for masking. We introduce new methods for detection of visually and contextually relevant (concrete) tokens from source sentences, including detection with NLP, detection with object detection, and a joint detection-verification technique. We also introduce new methods for selection of detected tokens, including shortest $n$ tokens, longest $n$ tokens, and all detected concrete tokens. We utilize the GRAM MMT architecture to train models against synthetically collated multimodal datasets of source images with masked sentences, showing performance improvements and improved usage of visual context during translation tasks over the baseline model.
- Does multimodality help human and machine for translation and image captioning? In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. Association for Computational Linguistics.
- Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4159–4170, Minneapolis, Minnesota. Association for Computational Linguistics.
- End-to-end object detection with transformers.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Multi30k: Multilingual english-german image descriptions.
- Christiane Fellbaum and George A. Miller. 1998. WordNet: An Electronic Lexical Database. The MIT Press.
- Tackling ambiguity with images: Improved multimodal machine translation and contrastive evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413, Toronto, Canada. Association for Computational Linguistics.
- Distilling Translations with Visual Awareness. ArXiv, abs/1906.07701.
- SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Mdetr – modulated detection for end-to-end multi-modal understanding.
- Chiraag Lala and Lucia Specia. 2018. Multimodal lexical translation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Edward Loper and Steven Bird. 2002. NLTK: The natural language toolkit. Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.
- Facebook FAIR’s WMT19 News Translation Task Submission.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2641–2649.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Choosing what to mask: More informed masking for multimodal machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 244–253, Toronto, Canada. Association for Computational Linguistics.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
- Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision.
- Multimodal neural machine translation with search engine based image retrieval. In Proceedings of the 9th Workshop on Asian Translation, pages 89–98, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
- Attention is all you need.
- Adding multimodal capabilities to a text-only translation model. ArXiV.
- Dexin Wang and Deyi Xiong. 2021. Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):2720–2728.
- Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation.
- Amom: Adaptive masking over masking for conditional masked language model.
- A visual attention grounding neural model for multimodal machine translation.
- Visual grounding helps learn word meanings in low-data regimes.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.