Papers
Topics
Authors
Recent
Search
2000 character limit reached

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Published 31 Dec 2020 in cs.CL | (2012.15409v4)

Abstract: Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at the UNIMO project page https://unimo-ptm.github.io/

Citations (358)

Summary

  • The paper presents a unified pre-training architecture that integrates single- and multi-modal datasets to enhance robust representation learning.
  • It employs cross-modal contrastive learning to align visual and textual information by constructing effective positive and negative pairs.
  • The model outperforms state-of-the-art approaches on tasks like visual question answering, image captioning, and image-text retrieval.

Overview of "UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning"

The paper "UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning" presents a novel pre-training framework, UNIMO, designed to seamlessly integrate single-modal and multi-modal understanding and generation tasks. This work addresses limitations in existing pre-training architectures, which traditionally focus on either single-modal or multi-modal tasks, often resulting in suboptimal performance when needing to adapt between modalities.

Key Contributions

  1. Unified-Modal Pre-training Architecture: UNIMO is structured to leverage both single-modal (texts or images) and multi-modal (image-text pairs) datasets. By doing so, it enhances its capacity to learn robust representations that are effective across different tasks and modalities.
  2. Cross-Modal Contrastive Learning (CMCL): This technique is central to the model, allowing UNIMO to effectively align visual and textual information into a shared semantic space. This is achieved by constructing positive and negative pairs through text rewriting techniques and retrieving related images and texts, ensuring detailed semantic alignments.
  3. Comprehensive Datasets Usage: The model capitalizes on a large corpus comprising both massive text collections and image datasets. This broad dataset spectrum enables the model to learn and generalize from a wealth of non-paired and paired data.
  4. Simultaneous Visual and Textual Learning: The model executes visual learning using masked region prediction and language learning using bidirectional and sequence-to-sequence prediction. This dual approach strengthens the model's understanding and generation capacities across both domains.

Experimental Results

UNIMO demonstrates superior performance across both single-modal and multi-modal tasks, including visual question answering, image captioning, image-text retrieval, and various natural language processing benchmarks. The model outperforms several existing state-of-the-art models in terms of variability and adaptability across different downstream tasks.

Implications and Future Directions

The UNIMO framework bridges the gap between single-modal and multi-modal learning, introducing a versatile architecture capable of handling a diverse range of tasks. This has significant implications for the development of AI systems that require flexibility and efficiency across varying input types and tasks.

The use of CMCL in aligning cross-modal representations paves the way for sophisticated semantic understandings that could improve performance in tasks such as multi-modal machine translation and visual dialogue systems. As future research directions, expanding the model's scale and data diversity could further enhance its generalization capabilities and robustness. Additionally, exploring end-to-end visual-linguistic training and integration could extend UNIMO's applications in real-world AI systems requiring nuanced understanding and generation across modalities.

In conclusion, UNIMO marks a significant step toward truly adaptable AI architectures, enabling seamless integration and understanding across diverse data modalities and tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.