Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UNITER: UNiversal Image-TExt Representation Learning (1909.11740v3)

Published 25 Sep 2019 in cs.CV, cs.CL, and cs.LG

Abstract: Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked LLMing (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$2$. Code is available at https://github.com/ChenRocks/UNITER.

UNITER: Universal Image-TExt Representation Learning

The UNITER model presents a significant advancement in the domain of Vision-and-Language (V+L) representation learning. This paper introduces an architecture designed to learn joint representations for both images and text through large-scale pre-training, utilizing datasets such as COCO, Visual Genome, Conceptual Captions, and SBU Captions. The model is evaluated across a variety of downstream V+L tasks, showcasing its versatility and robustness.

Model Architecture

UNITER leverages a Transformer-based architecture for its core model, employing the self-attention mechanism to learn contextualized embeddings. This approach allows for the effective integration of visual and textual information within a unified framework. The architecture consists of an Image Embedder and a Text Embedder, which process the image regions and textual tokens, respectively. These are followed by a multi-layer Transformer that learns cross-modal contextualized embeddings.

Pre-training Tasks

The authors propose four key pre-training tasks to nurture this joint multimodal representation:

  1. Masked LLMing (MLM): Conditional masking is applied, allowing the model to predict masked words based on surrounding text and full visual context.
  2. Masked Region Modeling (MRM): Three variants are used to predict masked visual regions: feature regression, classification, and KL-divergence-based classification.
  3. Image-Text Matching (ITM): This task involves predicting the alignment between image-text pairs.
  4. Word-Region Alignment (WRA): Utilizing Optimal Transport, this task ensures fine-grained alignment between textual tokens and image regions.

Results and Comparative Analysis

Empirically, UNITER achieves state-of-the-art performance across several benchmarks, including Visual Question Answering, Image-Text Retrieval, and Visual Entailment, among others. Notably, the inclusion of Optimal Transport-based WRA significantly enhances alignment capacities, yielding improvements particularly in tasks involving region-level recognition. Conditional masking further mitigates potential misalignments, enhancing learning efficiency and accuracy.

Theoretical and Practical Implications

The comprehensive analysis presented indicates that both conditional masking and OT-based WRA jointly contribute to robust pre-training. The novel approach to pre-training tasks ensures that the representations learned by UNITER are more generalizable across varied V+L tasks. This work extends further into the development of universal representations that can transcend task-specific constraints, signaling a step towards more efficient multimodal models.

Future Directions

Moving forward, the potential for early interactions between tokens and raw image pixels presents intriguing avenues for research, promising to refine the granularity of cross-modal embeddings.

Overall, UNITER offers a well-validated and theoretically sound approach to universal image-text representation learning, setting a new standard for future research in this domain.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yen-Chun Chen (33 papers)
  2. Linjie Li (89 papers)
  3. Licheng Yu (47 papers)
  4. Ahmed El Kholy (4 papers)
  5. Faisal Ahmed (16 papers)
  6. Zhe Gan (135 papers)
  7. Yu Cheng (354 papers)
  8. Jingjing Liu (139 papers)
Citations (436)
Youtube Logo Streamline Icon: https://streamlinehq.com