Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning (2111.10023v1)

Published 19 Nov 2021 in cs.CV

Abstract: In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked LLMing loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and achieve new state of the arts on visual question answering, COCO image captioning (cross-entropy optimization) and nocaps (in SPICE). On other downstream tasks, e.g., image-text retrieval, we also achieve competitive performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jianfeng Wang (149 papers)
  2. Xiaowei Hu (54 papers)
  3. Zhe Gan (135 papers)
  4. Zhengyuan Yang (86 papers)
  5. Xiyang Dai (53 papers)
  6. Zicheng Liu (153 papers)
  7. Yumao Lu (8 papers)
  8. Lijuan Wang (133 papers)
Citations (55)