Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Neglected Tails in Vision-Language Models (2401.12425v3)

Published 23 Jan 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Vision-LLMs (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using LLMs to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage and 10,000x less training time!

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv:2108.02818, 2021.
  2. Stability AI. Stable diffusion online, 2023.
  3. Improving image generation with better captions. Note on Dalle-3, 2023.
  4. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv:2110.01963, 2021.
  5. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  6. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
  7. Debiasing vision-language models via biased prompts. arXiv preprint 2302.00070, 2023.
  8. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  9. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  10. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
  11. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  12. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
  13. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021b.
  14. Openclip, 2021. If you use this software, please cite it as below.
  15. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2661–2671, 2019.
  16. 3d object representations for fine-grained categorization, 2013.
  17. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  18. Internet explorer: Targeted representation learning on the open web. In ICML, 2023.
  19. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Neural Information Processing Systems, 2022.
  20. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19325–19337, 2023.
  21. Visual instruction tuning. In NeurIPS, 2023a.
  22. Learning customized visual models with retrieval-augmented knowledge. In CVPR, 2023b.
  23. Language models as black-box optimizers for vision-language models. arXiv preprint arXiv:2309.05950, 2023c.
  24. Fine-grained visual classification of aircraft. arXiv:1306.5151, 2013.
  25. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
  26. Visual classification via description from large language models. arXiv:2210.07183, 2022.
  27. Task bias in vision-language models. arXiv:2212.04412, 2022.
  28. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  29. OpenAI. Gpt-4 technical report, 2023.
  30. Prompting scientific names for zero-shot species recognition. In EMNLP, 2023.
  31. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  32. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023.
  33. Learning transferable visual models from natural language supervision. In ICML, 2021.
  34. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  35. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv:2111.02114, 2021.
  36. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  37. Investigating the limitation of clip models: The worst-performing categories. arXiv:2310.03324, 2023.
  38. K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
  39. Improving image captioning with better use of captions. arXiv:2006.11807, 2020.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  41. The caltech-ucsd birds-200-2011 dataset, 2011.
  42. Neural priming for sample-efficient adaptation. arXiv:2306.10191, 2023.
  43. Revise: A tool for measuring and mitigating bias in visual datasets. International Journal of Computer Vision, 130(7):1790–1810, 2022a.
  44. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
  45. Debiased learning from naturally imbalanced pseudo-labels. In CVPR, 2022b.
  46. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  47. Demystifying clip data. arXiv:2309.16671, 2023.
  48. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
  49. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  50. Generalized logit adjustment: Calibrating fine-tuned models by removing label bias in foundation models. In NeurIPS, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shubham Parashar (6 papers)
  2. Zhiqiu Lin (19 papers)
  3. Tian Liu (80 papers)
  4. Xiangjue Dong (16 papers)
  5. Yanan Li (54 papers)
  6. Deva Ramanan (152 papers)
  7. James Caverlee (56 papers)
  8. Shu Kong (50 papers)
Citations (20)
X Twitter Logo Streamline Icon: https://streamlinehq.com