RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training (2305.07927v1)

Published 13 May 2023 in cs.CL

Abstract: Multilingual vision-language (V&L) pre-training has achieved remarkable progress in learning universal representations across different modalities and languages. In spite of recent success, there still remain challenges limiting further improvements of V&L pre-trained models in multilingual settings. Particularly, current V&L pre-training methods rely heavily on strictly-aligned multilingual image-text pairs generated from English-centric datasets through machine translation. However, the cost of collecting and translating such strictly-aligned datasets is usually unbearable. In this paper, we propose Regularized Contrastive Cross-lingual Cross-modal (RC³⁾ pre-training, which further exploits more abundant weakly-aligned multilingual image-text pairs. Specifically, we design a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs according to textual relevance. Besides, existing V&L pre-training approaches mainly deal with visual inputs by either region-of-interest (ROI) features or patch embeddings. We flexibly integrate the two forms of visual features into our model for pre-training and downstream multi-modal tasks. Extensive experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method over competitive contrast models with stronger zero-shot capability.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (6)

Chulun Zhou (13 papers)
Yunlong Liang (33 papers)
Fandong Meng (174 papers)
Jinan Xu (64 papers)
Jinsong Su (96 papers)
Jie Zhou (687 papers)

Citations (3)

View on Semantic Scholar

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training (2305.07927v1)

Related Papers