Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey (2302.10035v3)

Published 20 Feb 2023 in cs.CV, cs.AI, and cs.MM

Abstract: With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey. This paper has been published by the journal Machine Intelligence Research (MIR), https://link.springer.com/article/10.1007/s11633-022-1410-8, DOI: 10.1007/s11633-022-1410-8, vol. 20, no. 4, pp. 447-482, 2023.

PDF Abstract

Overview of Large-Scale Multi-Modal Pre-trained Models

The paper "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey" provides an expansive exploration of the landscape of multi-modal pre-trained models (MM-PTMs), which are increasingly gaining traction in artificial intelligence research. This survey is a thorough examination of these models, drawing on foundational work in single-modality pre-training and extending into the complexities of multi-modal environments. The paper is a valuable reference for researchers seeking to grasp the current state and future directions of MM-PTMs.

Background and Motivation

The authors begin with a discussion of the historical context, emphasizing the breakthroughs in recognition performance achieved by networks like AlexNet, which laid the groundwork for contemporary deep learning models. However, despite the progress in single domains such as computer vision, natural language processing, and speech processing, challenges in generalization prompted the emergence of models that can leverage multiple modalities. This is where MM-PTMs have made significant advancements, aiming to capture and synthesize information from diverse sources like text, images, and audio.

Framework for Multi-Modal Pre-training

The paper identifies the core components of multi-modal pre-training, detailing task definitions and the key challenges faced in this domain. These challenges include the acquisition and processing of large-scale multi-modal data, the design of sophisticated network architectures capable of fusing information across modalities, and the development of effective pre-training objectives. It is noteworthy that the authors consider the computational demands involved in training these models, which often require supercomputing resources given the scale of their parameters and data.

Pre-training Data and Objectives

A significant portion of the survey is devoted to discussing the datasets and objectives employed in training MM-PTMs. The paper catalogs numerous datasets that underpin multi-modal research and offers a detailed critique of pre-training objectives such as contrastive loss, masked LLMing, and cross-modal alignments. These components are essential for achieving the representational richness needed for downstream tasks like image captioning, visual question answering, and multi-modal machine translation.

Network Architectures and Knowledge Integration

The authors review various network architectures, prominently featuring the Transformer as a unifying model for multi-modal inputs. They discuss both single-stream and cross-stream architectures and delve into the mechanisms of modality interactive learning, such as co-attention and cross-attention layers. Moreover, the integration of structured and unstructured knowledge to enhance pre-training models is examined, pointing to future research opportunities in knowledge fusion and reasoning within these frameworks.

Evaluations and Implications

The surveyed models demonstrate improved performances across a range of downstream tasks. The paper emphasizes the necessity of comprehensive evaluation metrics and benchmarks, presenting evidence from several representative tasks. This affirms not only the practicality of MM-PTMs but also their emerging role in advancing multi-modal understanding and reasoning, presenting new challenges and opportunities for AI development.

Future Directions

The survey concludes with discussions on prospective research areas, such as expanding multi-modal capabilities to include a wider range of sensory inputs, refining the pre-training process to be more adaptive and computationally efficient, and enhancing models' logical reasoning abilities through advanced knowledge integration. The call for incremental learning strategies and improved prompt learning methods reflects a broader trend towards making MM-PTMs more dynamic and flexible in real-world applications.

In summary, this comprehensive survey provides an invaluable resource for understanding the current trends and future potential of large-scale multi-modal pre-trained models, catalyzing further advancements in the cohesive integration of diverse information sources within AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiao Wang (507 papers)
Guangyao Chen (36 papers)
Guangwu Qian (5 papers)
Pengcheng Gao (5 papers)
Xiao-Yong Wei (21 papers)
Yaowei Wang (149 papers)
Yonghong Tian (184 papers)
Wen Gao (114 papers)

Citations (154)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - wangxiao5791509/MultiModal_BigModels_Survey: [MIR-2023-Survey] A continuously updated paper list for multi-modal pre-trained big models (282 stars)