Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding (2312.00081v2)
Abstract: Vision LLMs (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC.
- Nocaps: Novel object captioning at scale. In ICCV, 2019.
- Going beyond nouns with vision & language models using synthetic data. In ICCV, 2023.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022.
- Vilem: Visual-language error modeling for image-text retrieval. In CVPR, 2023.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Teaching structured vision & language concepts to vision & language models. In CVPR, 2023.
- Shortcut learning in deep neural networks. Nat. Mach. Intell., 2020.
- Simple copy-paste is a strong data augmentation method for instance segmentation. CVPR, 2020.
- Cyclip: Cyclic contrastive language-image pretraining. In NeurIPS, 2022.
- Deep residual learning for image recognition. CVPR, 2015.
- Probing image-language transformers for verb understanding. In Findings, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Segment anything in high quality. ArXiv, 2023.
- Segment anything. In ICCV, 2023.
- Elevater: A benchmark and toolkit for evaluating language-augmented visual models. In NeurIPS, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Visual spatial reasoning. TACL, 2023a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. ArXiv, 2023b.
- Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. In ACL, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. ArXiv, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Bloom: A 176b-parameter open-access multilingual language model. ArXiv, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv, 2021.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- Construct-vl: Data-free continual structured vl concepts learning*. In CVPR, 2023.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
- Omnivl: One foundation model for image-language and video-language tasks. In NeurIPS, 2022.
- Equivariant similarity for vision-language foundation models. In ICCV, 2023.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.
- Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022.
- When and why vision-language models behave like bag-of-words models, and what to do about it? ICLR, 2023.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. In ICML, 2021.
- X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML, 2022a.
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. ArXiv, 2022b.
- Wujian Peng (8 papers)
- Sicheng Xie (4 papers)
- Zuyao You (3 papers)
- Shiyi Lan (38 papers)
- Zuxuan Wu (144 papers)