Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XGPT: Cross-modal Generative Pre-Training for Image Captioning (2003.01473v2)

Published 3 Mar 2020 in cs.CL, cs.CV, and cs.LG

Abstract: While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked LLMing (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Qiaolin Xia (7 papers)
  2. Haoyang Huang (27 papers)
  3. Nan Duan (172 papers)
  4. Dongdong Zhang (79 papers)
  5. Lei Ji (33 papers)
  6. Zhifang Sui (89 papers)
  7. Edward Cui (5 papers)
  8. Taroon Bharti (6 papers)
  9. Xin Liu (820 papers)
  10. Ming Zhou (182 papers)
Citations (68)