Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training (2207.05333v2)

Published 12 Jul 2022 in cs.CV and cs.LG

Abstract: Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Xinyu Huang (75 papers)
  2. Youcai Zhang (44 papers)
  3. Ying Cheng (17 papers)
  4. Weiwei Tian (5 papers)
  5. Ruiwei Zhao (2 papers)
  6. Rui Feng (67 papers)
  7. Yuejie Zhang (31 papers)
  8. Yaqian Li (17 papers)
  9. Yandong Guo (78 papers)
  10. Xiaobo Zhang (19 papers)
Citations (13)