SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment (2401.02137v1)
Abstract: Multimodal alignment between language and vision is the fundamental topic in current vision-LLM research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.
- Nocaps: Novel object captioning at scale. In ICCV, pages 8948–8957, 2019.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In NeurIPS, pages 32897–32912, 2022.
- Food-101–mining discriminative components with random forests. In ECCV, pages 446–461, 2014.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.
- Generative pretraining from pixels. In ICML, pages 1691–1703. PMLR, 2020a.
- Uniter: Universal image-text representation learning. In ECCV, pages 104–120, 2020b.
- Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022.
- Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
- An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pages 215–223, 2011.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- Unsupervised visual representation learning by context prediction. In ICCV, pages 1422–1430, 2015.
- Maskclip: Masked self-distillation advances contrastive language-image pretraining. In CVPR, pages 10995–11005, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPRW, pages 178–178, 2004.
- Cyclip: Cyclic contrastive language-image pretraining. In NeurIPS, pages 6704–6719, 2022.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021a.
- Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
- Scaling up vision-language pre-training for image captioning. In CVPR, pages 17980–17989, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
- Novel dataset for fine-grained image categorization. In CVPRW, 2011.
- Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, pages 5583–5594, 2021.
- Learning multiple layers of features from tiny images. 2009.
- Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, pages 9694–9705, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022c.
- Clipa-v2: Scaling clip training with 81.1 arXiv preprint arXiv:2306.15658, 2023b.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Slip: Self-supervision meets language-image pre-training. In ECCV, pages 529–544, 2022.
- Automated flower classification over a large number of classes. In ICVGIP, pages 722–729, 2008.
- Cats and dogs. In CVPR, pages 3498–3505, 2012.
- Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649, 2015.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400, 2019.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 35:25278–25294, 2022.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- How much can clip benefit visionand-language tasks? arXiv preprint arXiv:2107.06383, 3, 2021.
- Flava: A foundational language and vision alignment model. In CVPR, pages 15638–15650, 2022.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
- Selfie: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940, 2019.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
- Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, pages 19175–19186, 2023a.
- Vilta: Enhancing vision-language pre-training through textual augmentation. In ICCV, pages 3158–3169, 2023b.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
- Simmim: A simple framework for masked image modeling. In CVPR, pages 9653–9663, 2022.
- mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
- Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022a.
- Unified contrastive learning in image-text-label space. In CVPR, pages 19163–19173, 2022b.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022b.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133, 2022a.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133, 2022b.
- Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
- Vinvl: Revisiting visual representations in vision-language models. In CVPR, pages 5579–5588, 2021.
- Regionclip: Region-based language-image pretraining. In CVPR, pages 16793–16803, 2022.
- Ziping Ma (4 papers)
- Furong Xu (22 papers)
- Jian Liu (404 papers)
- Ming Yang (289 papers)
- Qingpei Guo (27 papers)