Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment (2308.03475v2)

Published 7 Aug 2023 in cs.MM

Abstract: Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. VQA: Visual Question Answering. International Journal of Computer Vision 123 (2015), 4–31.
  2. nocaps: novel object captioning at scale. CoRR abs/1812.08658 (2018). arXiv:1812.08658 http://arxiv.org/abs/1812.08658
  3. Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. arXiv preprint arXiv:2004.07159 (2020).
  4. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.
  5. Rethinking Attention with Performers. ArXiv abs/2009.14794 (2021).
  6. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 702–703.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
  8. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2021).
  9. An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387 (2021).
  10. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS.
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
  12. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2980–2988.
  13. Rethinking Spatial Dimensions of Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 11916–11925.
  14. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. ArXiv abs/2004.00849 (2020).
  15. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021).
  16. Exploiting Pseudo Image Captions for Multimodal Summarization. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258564588
  17. TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection. In Conference on Empirical Methods in Natural Language Processing.
  18. BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 2888–2898. https://api.semanticscholar.org/CorpusID:259937725
  19. Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258557659
  20. TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training. ArXiv abs/2312.08846 (2023). https://api.semanticscholar.org/CorpusID:266209702
  21. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.
  22. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
  23. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML.
  24. Reformer: The Efficient Transformer. ArXiv abs/2001.04451 (2020).
  25. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123 (2016), 32–73.
  26. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086 (2022).
  28. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS.
  29. VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv abs/1908.03557 (2019).
  30. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
  31. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations. ArXiv abs/2202.07800 (2022).
  32. Microsoft COCO: Common Objects in Context. In ECCV.
  33. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ArXiv abs/2303.05499 (2023).
  34. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9992–10002.
  35. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.
  36. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS (2019).
  37. Im2Text: Describing Images Using 1 Million Captioned Photographs. In NIPS.
  38. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123 (2015), 74–93.
  39. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  40. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In NeurIPS.
  41. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779–788.
  42. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
  43. Self-Critical Sequence Training for Image Captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1179–1195. https://doi.org/10.1109/CVPR.2017.131
  44. TokenLearner: Adaptive Space-Time Tokenization for Videos. In NeurIPS.
  45. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics.
  46. FLAVA: A Foundational Language And Vision Alignment Model. ArXiv abs/2112.04482 (2021).
  47. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. ArXiv abs/1908.08530 (2020).
  48. Hao Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv preprint arXiv:1908.07490 (2019).
  49. Attention is All you Need. ArXiv abs/1706.03762 (2017).
  50. MiniVLM: A Smaller and Faster Vision-Language Model. ArXiv abs/2012.06946 (2020).
  51. Linformer: Self-Attention with Linear Complexity. ArXiv abs/2006.04768 (2020).
  52. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. ArXiv abs/2111.02358 (2021).
  53. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 548–558.
  54. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. ArXiv abs/2108.10904 (2021).
  55. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. ArXiv abs/2106.01804 (2021).
  56. Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling. CoRR abs/2111.12085 (2021). arXiv:2111.12085 https://arxiv.org/abs/2111.12085
  57. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. In AAAI.
  58. Modeling context in referring expressions. In European Conference on Computer Vision. Springer, 69–85.
  59. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. ArXiv abs/2111.08276 (2021).
  60. VinVL: Making Visual Representations Matter in Vision-Language Models. (2021).
  61. Unified Vision-Language Pre-Training for Image Captioning and VQA. ArXiv abs/1909.11059 (2020).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chaoya Jiang (15 papers)
  2. Haiyang Xu (67 papers)
  3. Wei Ye (110 papers)
  4. Qinghao Ye (31 papers)
  5. Chenliang Li (92 papers)
  6. Ming Yan (190 papers)
  7. Bin Bi (24 papers)
  8. Shikun Zhang (82 papers)
  9. Ji Zhang (176 papers)
  10. Fei Huang (408 papers)
Citations (8)