UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding (2307.00862v1)
Abstract: Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Uniter: Learning universal image-text representations.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Visualcomet: Reasoning about the dynamic context of a still image. In European Conference on Computer Vision, pages 508–524. Springer.
- Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
- Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190.
- Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
- Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- Attention is all you need. Advances in neural information processing systems, 30.
- OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR.
- Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. arXiv preprint arXiv:2204.10496.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731.
- Rui Sun (105 papers)
- Zhecan Wang (18 papers)
- Haoxuan You (33 papers)
- Noel Codella (21 papers)
- Kai-Wei Chang (292 papers)
- Shih-Fu Chang (131 papers)