Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training (2206.00621v2)

Published 1 Jun 2022 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: In this paper, we introduce Cross-View LLMing, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view LLMing framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked LLMing and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal LLM, with the cross-view LLMing framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-LLMs by zero-shot cross-lingual transfer.

PDF HTML Abstract

Cross-View LLMing: Towards Unified Cross-Lingual Cross-Modal Pre-training

The paper "Cross-View LLMing: Towards Unified Cross-Lingual Cross-Modal Pre-training" introduces a novel pre-training framework named Cross-View LLMing (CVLM). The primary goal of this framework is to unify the methodologies employed in cross-lingual and cross-modal pre-training. Cross-lingual pre-training typically focuses on alignments across different languages, while cross-modal pre-training aligns representations across different data modalities, such as text and images. This paper posits that both tasks fundamentally aim to align disparate data views within a shared semantic space.

Methodology

The core innovation of this paper is treating multi-modal data (image-caption pairs) and multi-lingual data (parallel sentence pairs) as different views of the same semantic object. The CVLM framework leverages Transformer-based architectures, applying conditional masked LLMing and contrastive learning to maximize mutual information between paired data views. Specifically, the authors suggest that both multi-modal and multi-lingual datasets can be used interchangeably as input to CVLM, with representations fused through a shared cross-attention mechanism.

Model and Pre-training Strategy

The framework is instantiated in the Cross-lingual Cross-modal LLM (CCLM), which combines pre-trained image and text encoders alongside a fusion model for integrated representation. CCLM is evaluated across multiple tasks, benefiting from the mutual information maximization achieved through the multi-type input data. The CCLM's architecture's modularity allows it to seamlessly transition between tasks utilizing image-caption or sentence pairs without altering the underlying process.

Experimental Results

CCLM was empirically validated on the IGLUE benchmark, which encompasses diverse vision-language understanding and retrieval tasks across multiple languages. Remarkably, CCLM demonstrated an average improvement exceeding 10% over the state-of-the-art models in multi-modal tasks. Furthermore, it achieved superior zero-shot cross-lingual transfer performance over English-only models, validating the utility of the unified pre-training approach.

Moreover, the ablation studies conducted substantiate the necessity of shared architectures and objectives for effective cross-lingual and cross-modal transfer. The inclusion of parallel text data, notably neglected by previous models, proved pivotal in enhancing representations within the common semantic space.

Practical and Theoretical Implications

From a practical perspective, the CVLM framework, particularly in its CCLM embodiment, proffers significant advancements in multi-lingual, multi-modal pre-trained model applicability. By diminishing the performance gap in non-English tasks, CCLM enables wider real-world application scopes for multi-modal tasks. Theoretically, CVLM underscores the viability of generalized pre-training strategies wherein cross-lingual and cross-modal tasks are unified beyond traditional separations. This seeks inspiration and potential enhancements from the synergies established between these traditionally distinct areas.

Future Directions

The research sets a compelling precedent for future explorations into integrating additional modalities, such as audio and video, within the CVLM framework. Expanding CVLM to embrace broader modalities could further generalize pre-training applications and potentially inspire new approaches and methodologies in AI model pre-training.

In conclusion, this paper constitutes a significant advancement in the integration of cross-lingual and cross-modal learning, reflecting promising avenues for both theoretical research and practical applications in AI.

PDF Markdown Bookmark Chat (Pro)

References (56)

Authors (5)

Yan Zeng (46 papers)
Wangchunshu Zhou (73 papers)
Ao Luo (30 papers)
Ziming Cheng (6 papers)
Xinsong Zhang (13 papers)

Citations (26)

View on Semantic Scholar

GitHub

GitHub - zengyan-97/CCLM: Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training (ACL 2023)) (90 stars)