Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models (2103.08849v3)

Published 16 Mar 2021 in cs.CV and cs.CL

Abstract: This paper studies zero-shot cross-lingual transfer of vision-LLMs. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Po-Yao Huang (31 papers)
  2. Mandela Patrick (7 papers)
  3. Junjie Hu (111 papers)
  4. Graham Neubig (342 papers)
  5. Florian Metze (79 papers)
  6. Alexander Hauptmann (46 papers)
Citations (54)
Github Logo Streamline Icon: https://streamlinehq.com