COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment (2308.03475v2)
Abstract: Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.
- VQA: Visual Question Answering. International Journal of Computer Vision 123 (2015), 4–31.
- nocaps: novel object captioning at scale. CoRR abs/1812.08658 (2018). arXiv:1812.08658 http://arxiv.org/abs/1812.08658
- Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. arXiv preprint arXiv:2004.07159 (2020).
- UNITER: UNiversal Image-TExt Representation Learning. In ECCV.
- Rethinking Attention with Performers. ArXiv abs/2009.14794 (2021).
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 702–703.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2021).
- An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387 (2021).
- Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
- Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 2980–2988.
- Rethinking Spatial Dimensions of Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 11916–11925.
- Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. ArXiv abs/2004.00849 (2020).
- Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021).
- Exploiting Pseudo Image Captions for Multimodal Summarization. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258564588
- TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection. In Conference on Empirical Methods in Natural Language Processing.
- BUS : Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 2888–2898. https://api.semanticscholar.org/CorpusID:259937725
- Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258557659
- TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training. ArXiv abs/2312.08846 (2023). https://api.semanticscholar.org/CorpusID:266209702
- MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780–1790.
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML.
- Reformer: The Efficient Transformer. ArXiv abs/2001.04451 (2020).
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123 (2016), 32–73.
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086 (2022).
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv abs/1908.03557 (2019).
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
- Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations. ArXiv abs/2202.07800 (2022).
- Microsoft COCO: Common Objects in Context. In ECCV.
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ArXiv abs/2303.05499 (2023).
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 9992–10002.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. NeurIPS (2019).
- Im2Text: Describing Images Using 1 Million Captioned Photographs. In NIPS.
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123 (2015), 74–93.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML.
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In NeurIPS.
- You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 779–788.
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
- Self-Critical Sequence Training for Image Captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1179–1195. https://doi.org/10.1109/CVPR.2017.131
- TokenLearner: Adaptive Space-Time Tokenization for Videos. In NeurIPS.
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics.
- FLAVA: A Foundational Language And Vision Alignment Model. ArXiv abs/2112.04482 (2021).
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations. ArXiv abs/1908.08530 (2020).
- Hao Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv preprint arXiv:1908.07490 (2019).
- Attention is All you Need. ArXiv abs/1706.03762 (2017).
- MiniVLM: A Smaller and Faster Vision-Language Model. ArXiv abs/2012.06946 (2020).
- Linformer: Self-Attention with Linear Complexity. ArXiv abs/2006.04768 (2020).
- VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. ArXiv abs/2111.02358 (2021).
- Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 548–558.
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. ArXiv abs/2108.10904 (2021).
- E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. ArXiv abs/2106.01804 (2021).
- Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling. CoRR abs/2111.12085 (2021). arXiv:2111.12085 https://arxiv.org/abs/2111.12085
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. In AAAI.
- Modeling context in referring expressions. In European Conference on Computer Vision. Springer, 69–85.
- Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. ArXiv abs/2111.08276 (2021).
- VinVL: Making Visual Representations Matter in Vision-Language Models. (2021).
- Unified Vision-Language Pre-Training for Image Captioning and VQA. ArXiv abs/1909.11059 (2020).
- Chaoya Jiang (15 papers)
- Haiyang Xu (67 papers)
- Wei Ye (110 papers)
- Qinghao Ye (31 papers)
- Chenliang Li (92 papers)
- Ming Yan (190 papers)
- Bin Bi (24 papers)
- Shikun Zhang (82 papers)
- Ji Zhang (176 papers)
- Fei Huang (408 papers)