Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study (2403.10499v1)

Published 15 Mar 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Threat of adversarial attacks on deep learning in computer vision: A survey. Ieee Access, pages 14410–14430, 2018.
  2. Synthesizing robust adversarial examples. In ICML, pages 284–293, 2018.
  3. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, pages 9453–9463, 2019.
  4. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017.
  5. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. Agent instructs large language models to be general zero-shot reasoners. arXiv preprint arXiv:2310.03710, 2023.
  7. Virtex: Learning visual representations from textual annotations. arXiv preprint arXiv:2006.06666, 2020.
  8. Boosting adversarial attacks with momentum. In CVPR, pages 9185–9193, 2018.
  9. Benchmarking adversarial robustness on image classification. In CVPR, pages 321–331, 2020.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  12. Multimodal neurons in artificial neural networks. Distill, 2021.
  13. Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR, pages 4230–4239, 2017.
  14. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  15. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  16. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  17. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
  18. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020a.
  19. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020b.
  20. Black-box adversarial attacks with limited queries and information. In ICML, pages 2137–2146, 2018.
  21. Learning visual features from large weakly supervised data. In ECCV, pages 67–84, 2016.
  22. Wilds: A benchmark of in-the-wild distribution shifts. In ICML, pages 5637–5664. PMLR, 2021.
  23. Learning multiple layers of features from tiny images. 2009.
  24. Adversarial examples in the physical world, 2016.
  25. Learning visual n-grams from web data. In ICCV, pages 4183–4192, 2017.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  27. Grounded language-image pre-training. In CVPR, pages 10955–10965, 2022.
  28. The effect of natural distribution shift on question answering models. In ICML, pages 6905–6916, 2020.
  29. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR, pages 2574–2582, 2016.
  30. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804, 2022.
  31. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  32. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, page 3, 2016.
  33. How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611, 2020.
  34. Language models are unsupervised multitask learners. OpenAI blog, page 9, 2019.
  35. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
  37. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, pages 5296–5305, 2017.
  38. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400, 2019.
  39. Imagenet large scale visual recognition challenge. IJCV, pages 211–252, 2015.
  40. Learning visual representations with caption annotations. arXiv preprint arXiv:2008.01392, 2020.
  41. Do image classifiers generalize across time? arXiv preprint arXiv:1906.02168, 2019.
  42. Measuring vision-language stem skills of neural models. arXiv preprint arXiv:2402.17205, 2024.
  43. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  44. Measuring robustness to natural distribution shifts in image classification. NeurIPS, 2020.
  45. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  46. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  47. Adversarial risk and the dangers of evaluating against weak attacks. In ICML, pages 5025–5034, 2018.
  48. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  49. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP, pages 2153–2162, 2019.
  50. Language models are open knowledge graphs. arXiv preprint arXiv:2010.11967, 2020.
  51. Deepstruct: Pretraining of language models for structure prediction. In ACL, 2022.
  52. Learning robust global representations by penalizing local predictive power. arXiv preprint arXiv:1905.13549, 2019.
  53. Improving transferability of adversarial examples with input diversity. In CVPR, pages 2730–2739, 2019.
  54. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747, 2020.
  55. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chenguang Wang (59 papers)
  2. Ruoxi Jia (88 papers)
  3. Xin Liu (820 papers)
  4. Dawn Song (229 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com