Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning (2012.15409v4)

Published 31 Dec 2020 in cs.CL

Abstract: Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at the UNIMO project page https://unimo-ptm.github.io/

Overview of "UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning"

The paper "UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning" presents a novel pre-training framework, UNIMO, designed to seamlessly integrate single-modal and multi-modal understanding and generation tasks. This work addresses limitations in existing pre-training architectures, which traditionally focus on either single-modal or multi-modal tasks, often resulting in suboptimal performance when needing to adapt between modalities.

Key Contributions

  1. Unified-Modal Pre-training Architecture: UNIMO is structured to leverage both single-modal (texts or images) and multi-modal (image-text pairs) datasets. By doing so, it enhances its capacity to learn robust representations that are effective across different tasks and modalities.
  2. Cross-Modal Contrastive Learning (CMCL): This technique is central to the model, allowing UNIMO to effectively align visual and textual information into a shared semantic space. This is achieved by constructing positive and negative pairs through text rewriting techniques and retrieving related images and texts, ensuring detailed semantic alignments.
  3. Comprehensive Datasets Usage: The model capitalizes on a large corpus comprising both massive text collections and image datasets. This broad dataset spectrum enables the model to learn and generalize from a wealth of non-paired and paired data.
  4. Simultaneous Visual and Textual Learning: The model executes visual learning using masked region prediction and language learning using bidirectional and sequence-to-sequence prediction. This dual approach strengthens the model's understanding and generation capacities across both domains.

Experimental Results

UNIMO demonstrates superior performance across both single-modal and multi-modal tasks, including visual question answering, image captioning, image-text retrieval, and various natural language processing benchmarks. The model outperforms several existing state-of-the-art models in terms of variability and adaptability across different downstream tasks.

Implications and Future Directions

The UNIMO framework bridges the gap between single-modal and multi-modal learning, introducing a versatile architecture capable of handling a diverse range of tasks. This has significant implications for the development of AI systems that require flexibility and efficiency across varying input types and tasks.

The use of CMCL in aligning cross-modal representations paves the way for sophisticated semantic understandings that could improve performance in tasks such as multi-modal machine translation and visual dialogue systems. As future research directions, expanding the model's scale and data diversity could further enhance its generalization capabilities and robustness. Additionally, exploring end-to-end visual-linguistic training and integration could extend UNIMO's applications in real-world AI systems requiring nuanced understanding and generation across modalities.

In conclusion, UNIMO marks a significant step toward truly adaptable AI architectures, enabling seamless integration and understanding across diverse data modalities and tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wei Li (1121 papers)
  2. Can Gao (20 papers)
  3. Guocheng Niu (5 papers)
  4. Xinyan Xiao (41 papers)
  5. Hao Liu (497 papers)
  6. Jiachen Liu (45 papers)
  7. Hua Wu (191 papers)
  8. Haifeng Wang (194 papers)
Citations (358)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com