Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding (2307.00862v1)

Published 3 Jul 2023 in cs.CV and cs.CL

Abstract: Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method. Code will be available at https://github.com/ThreeSR/UniFine

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  3. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  4. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  5. Uniter: Learning universal image-text representations.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  8. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  10. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
  11. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
  12. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  13. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
  14. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
  15. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  16. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  17. Visualcomet: Reasoning about the dynamic context of a still image. In European Conference on Computer Vision, pages 508–524. Springer.
  18. Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050.
  19. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  21. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
  22. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
  23. Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190.
  24. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
  25. Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991.
  26. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  27. Attention is all you need. Advances in neural information processing systems, 30.
  28. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR.
  29. Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. arXiv preprint arXiv:2204.10496.
  30. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
  31. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
  32. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rui Sun (105 papers)
  2. Zhecan Wang (18 papers)
  3. Haoxuan You (33 papers)
  4. Noel Codella (21 papers)
  5. Kai-Wei Chang (292 papers)
  6. Shih-Fu Chang (131 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com