Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training (2401.02347v1)
Abstract: Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.
- VQA: Visual Question Answering. International Journal of Computer Vision, 123: 4–31.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736.
- SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision.
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6077–6086.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433.
- METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In IEEvaluation@ACL.
- The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 4661–4669.
- Language Models are Few-Shot Learners. ArXiv, abs/2005.14165.
- All You May Need for VQA are Image Captions. In North American Chapter of the Association for Computational Linguistics.
- Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3557–3567.
- Visual Dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1080–1089.
- Unsupervised Image Captioning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4120–4129.
- From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models. ArXiv, abs/2212.10846.
- CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention. ArXiv, abs/2209.14169.
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6693–6702.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
- Towards Unsupervised Image Captioning With Shared Multimodal Embeddings. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 7413–7423.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning.
- Grounded Language-Image Pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10955–10965.
- DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training. arXiv preprint arXiv:2303.03032.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35: 17612–17625.
- Aligning Source Visual and Target Language Domains for Unpaired Video Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44: 9255–9268.
- Visual Instruction Tuning. ArXiv, abs/2304.08485.
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3190–3199.
- Mokady, R. 2021. ClipCap: CLIP Prefix for Image Captioning. ArXiv, abs/2111.09734.
- HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models. ArXiv, abs/2303.15786.
- Recursive Visual Attention in Visual Dialog. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6672–6681.
- Text-Only Training for Image Captioning using Noise-Injected CLIP. arXiv preprint arXiv:2211.00575.
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics.
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2065–2074.
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision, 123: 74–93.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18061–18070.
- How Much Can CLIP Benefit Vision-and-Language Tasks? ArXiv, abs/2107.06383.
- Language models can see: plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.
- ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training. ArXiv, abs/2210.08773.
- Attention is All you Need. In NIPS.
- CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4566–4575.
- Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 652–663.
- Understanding the Behaviour of Contrastive Loss. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2495–2504.
- Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. ArXiv, abs/2005.10242.
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning.
- Exploring Visual Relationship for Image Captioning. In European Conference on Computer Vision.
- Multimodal Knowledge Alignment with Reinforcement Learning. ArXiv, abs/2205.12630.
- When and why vision-language models behave like bags-of-words, and what to do about it? ArXiv, abs/2210.01936.
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. ArXiv, abs/2204.00598.
- GLIPv2: Unifying Localization and Vision-Language Understanding. ArXiv, abs/2206.05836.
- PointCLIP: Point Cloud Understanding by CLIP. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8542–8552.
- OPT: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068.
- Extract Free Dense Labels from CLIP. In European Conference on Computer Vision.
- Unified Vision-Language Pre-Training for Image Captioning and VQA. ArXiv, abs/1909.11059.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. ArXiv, abs/2304.10592.
- Longtian Qiu (9 papers)
- Shan Ning (2 papers)
- Xuming He (109 papers)