Towards artificial general intelligence via a multimodal foundation model (2110.14378v2)

Published 27 Oct 2021 in cs.AI

Abstract: The fundamental goal of AI is to mimic the core cognitive activities of human. Despite tremendous success in the AI research, most of existing methods have only single-cognitive ability. To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks. To achieve this goal, we propose to pre-train our foundation model by self-supervised learning with weak semantic correlation data crawled from the Internet and show that promising results can be obtained on a wide range of downstream tasks. Particularly, with the developed model-interpretability tools, we demonstrate that strong imagination ability is now possessed by our foundation model. We believe that our work makes a transformative stride towards AGI, from our common practice of "weak or narrow AI" to that of "strong or generalized AI".

PDF Abstract

Overview of "Towards Artificial General Intelligence via a Multimodal Foundation Model"

The paper "Towards Artificial General Intelligence via a Multimodal Foundation Model," authored by Nanyi Fei et al., presents a comprehensive paper on developing a multimodal foundation model named BriVL (Bridging-Vision-and-Language). This work aims to advance the field towards achieving AGI by leveraging large-scale multimodal data and sophisticated model architectures.

Contributions and Methodology

The central contribution of this paper is the development of the BriVL model, which is pre-trained on a massive dataset of 650 million image-text pairs collected from the internet. This dataset encompasses weak semantic correlations, contrasting with prior works that typically rely on strongly correlated data such as image-caption pairs. The weak semantic alignment is argued to potentially enrich the model's cognitive capability by exposing it to a broad spectrum of human emotions and abstract thought not typically captured by stronger semantic datasets.

The BriVL model employs a four-tower architecture, using a combination of two encoders for images and texts along with momentum encoders to facilitate cross-modal contrastive learning. This approach, derivative of the MoCo paradigm, allows for efficient utilization of negative samples across batches without the necessity for prohibitively large batch sizes, thus rendering the pre-training process more accessible resource-wise.

Two crucial innovations are highlighted in this paper:

Multi-Scale Patch Pooling (MSPP): This module allows for the efficient extraction of fine-grained visual features without relying on computationally intensive object detectors.
Use of Momentum Encoders: Inspired by the MoCo framework, the use of momentum encoders allows the model to maintain large-scale queues of negative samples, which is critical for the effectiveness of the contrastive learning paradigm applied to this multimodal context.

Empirical Evaluation

The paper encompasses a series of rigorous empirical evaluations across various tasks, reflecting the multimodal model's ability to generalize and transfer learned knowledge across domains. Notably, the model demonstrated strong performance in both cross-domain tasks (e.g., remote sensing scene classification, news classification) and cross-modal tasks (e.g., visual question answering, cross-modal retrieval).

Zero-Shot Learning on Remote Sensing: The BriVL model exhibits superior zero-shot classification accuracy on the UC Merced and AID datasets, outperforming models like CLIP, highlighting its robust cross-domain knowledge transfer capabilities.
Text-to-Image Generation and Neural Visualization: Through both neural network visualization and text-to-image generation, the model's imaginative capabilities are scrutinized. The visualization results elucidate complex scene understanding and abstract conceptualization, attributes that are indicative of advances towards AGI.
Zero-Shot News Classification: Enhanced performance on Chinese news classification datasets validates the model’s potential in enhancing single-modal tasks through multimodal learning.

Implications and Future Directions

The findings suggest that pre-training on weakly correlated multimodal data can substantially enhance a model's cognitive abilities and move the research community closer to achieving AGI. This approach has implications for multiple fields, including neuroscience and healthcare, by potentially providing general models that can be adapted to domain-specific applications efficiently. The recursive insights into multimodal neural alignment also offer intriguing possibilities for understanding neural cognition analogies in human brains, providing a bridge into neuroscience.

Future work suggested by this paper includes expanding the model’s capacity using additional modalities, like audio and video, and improved understanding of model interpretations. The ethical ramifications of creating such potent models are noted, highlighting the need for careful monitoring to mitigate biases and potential misuse in generating misleading content.

In summary, the paper by Fei et al. successfully elucidates a pathway towards AGI through extensive multimodal pre-training, advocating for a paradigm shift that moves beyond the constraints of traditionally strong, singular cognitive AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Nanyi Fei (14 papers)
Zhiwu Lu (51 papers)
Yizhao Gao (19 papers)
Guoxing Yang (11 papers)
Yuqi Huo (19 papers)
Jingyuan Wen (5 papers)
Haoyu Lu (24 papers)
Ruihua Song (48 papers)
Xin Gao (208 papers)
Tao Xiang (324 papers)
Hao Sun (383 papers)
Ji-Rong Wen (299 papers)

Citations (163)

View on Semantic Scholar

Towards artificial general intelligence via a multimodal foundation model (2110.14378v2)

Overview of "Towards Artificial General Intelligence via a Multimodal Foundation Model"

Contributions and Methodology

Empirical Evaluation

Implications and Future Directions

Related Papers