Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training (2104.00332v1)

Published 1 Apr 2021 in cs.CV

Abstract: Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked LLMing and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation LLMing (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Mingyang Zhou (27 papers)
  2. Luowei Zhou (31 papers)
  3. Shuohang Wang (69 papers)
  4. Yu Cheng (354 papers)
  5. Linjie Li (89 papers)
  6. Zhou Yu (206 papers)
  7. Jingjing Liu (139 papers)
Citations (81)