Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training (2108.09479v1)

Published 21 Aug 2021 in cs.MM, cs.CL, and cs.CV

Abstract: Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks. We hope that our findings help to further advance the state of the art of vision-language pre-training, and provide a new direction towards effective and efficient VLP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ming Yan (190 papers)
  2. Haiyang Xu (67 papers)
  3. Chenliang Li (92 papers)
  4. Bin Bi (24 papers)
  5. Junfeng Tian (19 papers)
  6. Min Gui (4 papers)
  7. Wei Wang (1793 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.