Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training (2206.00621v2)

Published 1 Jun 2022 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: In this paper, we introduce Cross-View LLMing, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view LLMing framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked LLMing and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal LLM, with the cross-view LLMing framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-LLMs by zero-shot cross-lingual transfer.

Cross-View LLMing: Towards Unified Cross-Lingual Cross-Modal Pre-training

The paper "Cross-View LLMing: Towards Unified Cross-Lingual Cross-Modal Pre-training" introduces a novel pre-training framework named Cross-View LLMing (CVLM). The primary goal of this framework is to unify the methodologies employed in cross-lingual and cross-modal pre-training. Cross-lingual pre-training typically focuses on alignments across different languages, while cross-modal pre-training aligns representations across different data modalities, such as text and images. This paper posits that both tasks fundamentally aim to align disparate data views within a shared semantic space.

Methodology

The core innovation of this paper is treating multi-modal data (image-caption pairs) and multi-lingual data (parallel sentence pairs) as different views of the same semantic object. The CVLM framework leverages Transformer-based architectures, applying conditional masked LLMing and contrastive learning to maximize mutual information between paired data views. Specifically, the authors suggest that both multi-modal and multi-lingual datasets can be used interchangeably as input to CVLM, with representations fused through a shared cross-attention mechanism.

Model and Pre-training Strategy

The framework is instantiated in the Cross-lingual Cross-modal LLM (CCLM), which combines pre-trained image and text encoders alongside a fusion model for integrated representation. CCLM is evaluated across multiple tasks, benefiting from the mutual information maximization achieved through the multi-type input data. The CCLM's architecture's modularity allows it to seamlessly transition between tasks utilizing image-caption or sentence pairs without altering the underlying process.

Experimental Results

CCLM was empirically validated on the IGLUE benchmark, which encompasses diverse vision-language understanding and retrieval tasks across multiple languages. Remarkably, CCLM demonstrated an average improvement exceeding 10% over the state-of-the-art models in multi-modal tasks. Furthermore, it achieved superior zero-shot cross-lingual transfer performance over English-only models, validating the utility of the unified pre-training approach.

Moreover, the ablation studies conducted substantiate the necessity of shared architectures and objectives for effective cross-lingual and cross-modal transfer. The inclusion of parallel text data, notably neglected by previous models, proved pivotal in enhancing representations within the common semantic space.

Practical and Theoretical Implications

From a practical perspective, the CVLM framework, particularly in its CCLM embodiment, proffers significant advancements in multi-lingual, multi-modal pre-trained model applicability. By diminishing the performance gap in non-English tasks, CCLM enables wider real-world application scopes for multi-modal tasks. Theoretically, CVLM underscores the viability of generalized pre-training strategies wherein cross-lingual and cross-modal tasks are unified beyond traditional separations. This seeks inspiration and potential enhancements from the synergies established between these traditionally distinct areas.

Future Directions

The research sets a compelling precedent for future explorations into integrating additional modalities, such as audio and video, within the CVLM framework. Expanding CVLM to embrace broader modalities could further generalize pre-training applications and potentially inspire new approaches and methodologies in AI model pre-training.

In conclusion, this paper constitutes a significant advancement in the integration of cross-lingual and cross-modal learning, reflecting promising avenues for both theoretical research and practical applications in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Željko Agić and Natalie Schluter. 2018. Baselines and test data for cross-lingual inference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  2. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  4. Iglue: A benchmark for transfer learning across modalities, tasks, and languages. ArXiv preprint, abs/2201.11732.
  5. Microsoft coco captions: Data collection and evaluation server. ArXiv preprint, abs/1504.00325.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  7. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588, Online. Association for Computational Linguistics.
  8. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  9. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7057–7067.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  11. Unified language model pre-training for natural language understanding and generation. In NeurIPS, pages 13042–13054.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  13. Multi30K: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin, Germany. Association for Computational Linguistics.
  14. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  15. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2485–2494, Hong Kong, China. Association for Computational Linguistics.
  16. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6700–6709. Computer Vision Foundation / IEEE.
  17. Mural: multimodal, multitask retrieval across languages. ArXiv preprint, abs/2109.05125.
  18. Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3128–3137. IEEE Computer Society.
  19. A mutual information maximization perspective of language representation learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  21. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34.
  22. Coco-cn for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9):2347–2360.
  23. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
  24. Visually grounded reasoning across languages and cultures. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. Roberta: A robustly optimized bert pretraining approach. ArXiv preprint, abs/1907.11692.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE.
  27. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  28. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13–23.
  29. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3977–3986.
  30. Representation learning with contrastive predictive coding. ArXiv preprint, abs/1807.03748.
  31. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  32. xgqa: Cross-lingual visual question answering. ArXiv preprint, abs/2109.06082.
  33. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics.
  34. Improving language understanding by generative pre-training.
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv preprint, abs/1910.10683.
  37. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1351–1361, Online. Association for Computational Linguistics.
  38. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
  39. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449.
  40. VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  41. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy. Association for Computational Linguistics.
  42. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China. Association for Computational Linguistics.
  43. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Lrec, volume 2012, pages 2214–2218. Citeseer.
  44. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  45. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
  46. Visual entailment: A novel task for fine-grained image understanding. ArXiv preprint, abs/1901.06706.
  47. STAIR captions: Constructing a large-scale Japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 417–421, Vancouver, Canada. Association for Computational Linguistics.
  48. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
  49. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
  50. Danting Zeng. 2021. Multi task learning based framework for multimodal classification. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 30–35, Mexico City, Mexico. Association for Computational Linguistics.
  51. Yan Zeng and Jian-Yun Nie. 2021. An investigation of suitability of pre-trained language models for dialogue generation – avoiding discrepancies. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4481–4494, Online. Association for Computational Linguistics.
  52. Multi-grained vision language pre-training: Aligning texts with visual concepts. ArXiv preprint, abs/2111.08276.
  53. X22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-vlm: All-in-one pre-trained model for vision-language tasks. arXiv preprint arXiv:2211.12402.
  54. Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4155–4165.
  55. Vlue: A multi-task benchmark for evaluating vision-language models. CoRR, abs/2205.15237.
  56. The united nations parallel corpus v1. 0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3530–3534.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yan Zeng (46 papers)
  2. Wangchunshu Zhou (73 papers)
  3. Ao Luo (30 papers)
  4. Ziming Cheng (6 papers)
  5. Xinsong Zhang (13 papers)
Citations (26)