Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Vision Features in Multimodal Machine Translation (2203.09173v1)

Published 17 Mar 2022 in cs.CL

Abstract: Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at \url{https://github.com/libeineu/fairseq_mmt}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Bei Li (51 papers)
  2. Chuanhao Lv (3 papers)
  3. Zefan Zhou (3 papers)
  4. Tao Zhou (398 papers)
  5. Tong Xiao (119 papers)
  6. Anxiang Ma (4 papers)
  7. Jingbo Zhu (79 papers)
Citations (56)