Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are Vision Language Models Texture or Shape Biased and Can We Steer Them? (2403.09193v1)

Published 14 Mar 2024 in cs.CV, cs.AI, cs.LG, and q-bio.NC
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Abstract: Vision LLMs (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

Exploring the Texture vs. Shape Bias in Vision LLMs (VLMs)

Introduction

Vision LLMs (VLMs) have evolved to be a pivotal component in the intersection of computer vision and natural language processing, enabling a myriad of applications from zero-shot image classification to comprehensive image captioning. A fascinating question that arises in the context of VLMs is their alignment with human visual perception, particularly in how they navigate the balance between texture and shape bias. Historically, vision-only models displayed a pronounced preference for texture over shape, a pattern that diverges from human visual tendencies which favor shape. This paper explores the texture vs. shape bias within various VLMs and assesses whether the bias can be moderated or redirected through linguistic prompts, laying the groundwork for deeper inquiry into how these models perceive and interpret visual information.

Texture vs. Shape Bias in VLMs

An exhaustive analysis of popular VLMs reveals a nuanced landscape where, contrary to prior vision-only models, many VLMs exhibit a stronger inclination toward shape bias when processing visual information. This shift suggests that multimodal training involving both text and images does not merely transplant vision encoders' biases into VLMs but instead modulates these biases through linguistic integration. Crucially, while VLMs demonstrate a more shape-oriented approach than their vision-only counterparts, they still fall short of replicating the human propensity to prioritize shape significantly. Notably, certain models demonstrate an ability to adjust their bias based on the task, displaying varying levels of shape preference in tasks like visual question answering (VQA) and image captioning.

Investigation of Bias Modulation

The central inquiry into whether and how the visual biases in VLMs can be influenced through language reveals compelling outcomes. By employing task-specific prompting and altering the visual input (through pre-processing techniques like patch shuffling and noise addition), the paper explores the malleability of shape and texture biases. Intriguingly, text-based manipulations underscore the possibility of steering these biases to a considerable degree, albeit not as substantially as through visual alterations. This discovery opens intriguing avenues for research into the interplay between textual and visual information in guiding model perception.

Implications and Future Directions

The findings of this paper have broad implications, both theoretical and practical. On a theoretical level, the evidence that VLMs’ visual biases can be partially steered through linguistic inputs enriches our understanding of multimodal learning dynamics and the complex interplay between text and image processing. Practically, the ability to modulate visual biases in VLMs could enhance model performance across tasks that require nuanced visual understanding, from improved accessibility tools to more accurate visual search and annotation systems.

Looking ahead, this exploration sets the stage for further studies into the multimodal workings of VLMs, encouraging a deeper dive into the mechanisms that underpin bias modulation. Additionally, given the rapid evolution of VLM technologies, future work could extend beyond texture and shape bias to uncover other potential biases and the extent to which they can be shaped through multimodal interactions.

Conclusion

This paper provides a foundational exploration of the texture vs. shape bias in VLMs, revealing a marked departure from the tendencies observed in vision-only models. Through meticulous experimentation, it establishes that while VLMs naturally exhibit a more significant shape bias, this bias can be influenced, albeit modestly, through linguistic prompts. These insights not only enrich our understanding of VLMs’ operational dynamics but also offer practical pathways to enhance their alignment with human visual perception, marking a significant step forward in the quest to create more intuitive and effective multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel, “Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.,” in International Conference on Learning Representations, 2019.
  2. Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai, “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” 2024.
  3. D.H. Wolpert and W.G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67–82, 1997.
  4. David H. Wolpert, “The Lack of A Priori Distinctions Between Learning Algorithms,” Neural Computation, vol. 8, pp. 1341–1390, 10 1996.
  5. Katherine Hermann, Ting Chen, and Simon Kornblith, “The origins and prevalence of texture bias in convolutional neural networks,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, eds.), vol. 33, pp. 19000–19015, Curran Associates, Inc., 2020.
  6. Baifeng Shi, Dinghuai Zhang, Qi Dai, Zhanxing Zhu, Yadong Mu, and Jingdong Wang, “Informative dropout for robust representation learning: A shape-bias perspective,” in Proceedings of the 37th International Conference on Machine Learning (Hal Daumé III and Aarti Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, pp. 8828–8839, PMLR, 13–18 Jul 2020.
  7. Md Amirul Islam, Matthew Kowal, Patrick Esser, Sen Jia, Björn Ommer, Konstantinos G. Derpanis, and Neil Bruce, “Shape or texture: Understanding discriminative features in CNNs,” in International Conference on Learning Representations, 2021.
  8. Elior Benarous, Sotiris Anagnostidis, Luca Biggio, and Thomas Hofmann, “Harnessing synthetic datasets: The role of shape bias in deep neural network generalization,” in NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023.
  9. Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Intriguing properties of vision transformers,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, eds.), vol. 34, pp. 23296–23308, Curran Associates, Inc., 2021.
  10. Ajay Subramanian, Elena Sizikova, Najib Majaj, and Denis Pelli, “Spatial-frequency channels, shape bias, and adversarial robustness,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  11. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning (Marina Meila and Tong Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research, pp. 8748–8763, PMLR, 18–24 Jul 2021.
  12. OpenAI, “Gpt-4 technical report,” 2023.
  13. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan, “Flamingo: a visual language model for few-shot learning,” in Advances in Neural Information Processing Systems (S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, pp. 23716–23736, Curran Associates, Inc., 2022.
  14. Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei, “Language is not all you need: Aligning perception with language models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  15. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning, pp. 4904–4916, PMLR, 2021.
  16. Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” CVPR, 2022.
  17. Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao, “Eva-clip: Improved training techniques for clip at scale,” 2023.
  18. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
  19. Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu, “Coca: Contrastive captioners are image-text foundation models,” in TMLR, 2022.
  20. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” 2023.
  21. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  22. Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji, “Cheap and quick: Efficient vision-language instruction tuning for large language models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  23. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi, “InstructBLIP: Towards general-purpose vision-language models with instruction tuning,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  24. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems (S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, pp. 27730–27744, Curran Associates, Inc., 2022.
  25. Gemini Team, “Gemini: A family of highly capable multimodal models,” 2023.
  26. Qwen Team, “Introducing qwen-vl.” https://qwenlm.github.io/blog/qwen-vl/ [Accessed: 12 February 2024], Jan 2024.
  27. Jovita Lukasik, Paul Gavrikov, Janis Keuper, and Margret Keuper, “Improving native CNN robustness with filter frequency regularization,” Transactions on Machine Learning Research, 2023.
  28. Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, and cihang xie, “Shape-texture debiased neural network training,” in International Conference on Learning Representations, 2021.
  29. Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel, “Partial success in closing the gap between human and machine vision,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021.
  30. Paul Gavrikov, Janis Keuper, and Margret Keuper, “An extended study of human-like behavior under adversarial training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2360–2367, June 2023.
  31. Priyank Jaini, Kevin Clark, and Robert Geirhos, “Intriguing properties of generative classifiers,” 2024.
  32. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  33. Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L. Griffiths, “Are convolutional neural networks or transformers more like human vision?,” 2021.
  34. Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer, “Scaling vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, June 2022.
  35. Joy Buolamwini and Timnit Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, vol. 81 of Proceedings of Machine Learning Research, pp. 77–91, PMLR, 2018.
  36. Inioluwa Deborah Raji and Joy Buolamwini, “Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial AI products,” in Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2019, Honolulu, HI, USA, January 27-28, 2019, pp. 429–435, ACM, 2019.
  37. Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P. Xing, “High-frequency component helps explain the generalization of convolutional neural networks,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 8681–8691, Computer Vision Foundation / IEEE, 2020.
  38. Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas, “Redditbias: A real-world resource for bias evaluation and debiasing of conversational language models,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 1941–1955, Association for Computational Linguistics, 2021.
  39. Anne Lauscher, Tobias Lüken, and Goran Glavas, “Sustainable modular debiasing of language models,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pp. 4782–4797, Association for Computational Linguistics, 2021.
  40. Nicholas Meade, Elinor Poole-Dayan, and Siva Reddy, “An empirical survey of the effectiveness of debiasing techniques for pre-trained language models,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 1878–1898, Association for Computational Linguistics, 2022.
  41. Yue Guo, Yi Yang, and Ahmed Abbasi, “Auto-debias: Debiasing masked language models with automated biased prompts,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 1012–1023, Association for Computational Linguistics, 2022.
  42. Patrick Haller, Ansar Aynetdinov, and Alan Akbik, “Opiniongpt: Modelling explicit biases in instruction-tuned llms,” 2023.
  43. Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, and Mario Fritz, “Exploring value biases: How llms deviate towards the ideal,” 2024.
  44. Nino Scherrer, Claudia Shi, Amir Feder, and David Blei, “Evaluating the moral beliefs encoded in LLMs,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  45. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen, “Large language models as optimizers,” in The Twelfth International Conference on Learning Representations, 2024.
  46. Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah, “Multimodal Neurons in Artificial Neural Networks,” Distill, vol. 6, p. e30, Mar. 2021.
  47. Nan Liu, Shuang Li, Yilun Du, Josh Tenenbaum, and Antonio Torralba, “Learning to compose visual relations,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, eds.), vol. 34, pp. 23166–23178, Curran Associates, Inc., 2021.
  48. Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross, “Winoground: Probing vision and language models for visio-linguistic compositionality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5238–5248, June 2022.
  49. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, “Image style transfer using convolutional neural networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423, 2016.
  50. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255, 2009.
  51. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  52. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  53. LLMRails, “llmrails/ember-v1 ⋅⋅\cdot⋅ Hugging Face,” Feb. 2024. [Online; accessed 27. Feb. 2024].
  54. Teknium, theemozilla, karan4d, and huemin_art, “Nous Hermes 2 Mistral 7B DPO,” Feb. 2024. [Online; accessed 27. Feb. 2024].
  55. Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas, “Videogpt: Video generation using vq-vae and transformers,” 2021.
  56. Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao, “Multimodal foundation models: From specialists to general-purpose assistants,” 2023.
  57. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, eds.), vol. 25, Curran Associates, Inc., 2012.
  58. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber, “Highway networks,” 2015.
  59. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” 2015.
  60. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  61. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee, “Improved baselines with visual instruction tuning,” 2023.
  62. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024.
  63. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,” 2023.
  64. Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang, “Cogagent: A visual language model for gui agents,” 2023.
  65. Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang, “Cogvlm: Visual expert for pretrained language models,” 2023.
  66. Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang, “Generative multimodal models are in-context learners,” 2023.
  67. Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Munan Ning, and Li Yuan, “Moe-llava: Mixture of experts for large vision-language models,” 2024.
  68. Mikhail Kim, Vladimir Orshulevich, and Ash Vardanian, “UForm by Unum Cloud,” Jan. 2023.
  69. Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell, “Aligning large multimodal models with factually augmented rlhf,” 2023.
  70. OpenAI, “GPT-4,” Mar. 2024. [Online; accessed 6. Mar. 2024].
  71. Robert Geirhos, Kristof Meding, and Felix A. Wichmann, “Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by measuring error consistency,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, eds.), vol. 33, pp. 13890–13902, Curran Associates, Inc., 2020.
  72. Tianyuan Zhang and Zhanxing Zhu, “Interpreting adversarially trained convolutional neural networks,” in Proceedings of the 36th International Conference on Machine Learning (Kamalika Chaudhuri and Ruslan Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 7502–7511, PMLR, 09–15 Jun 2019.
  73. Dictionary.com, “texture,” Feb. 2024. [Online; accessed 25. Feb. 2024].
  74. Dictionary.com, “shape,” Feb. 2024. [Online; accessed 25. Feb. 2024].
  75. Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, and Zeynep Akata, “Waffling around for performance: Visual classification with random words and broad concepts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15746–15757, October 2023.
  76. Sachit Menon and Carl Vondrick, “Visual classification via description from large language models,” 2022.
  77. Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi, “What does a platypus look like? generating customized prompts for zero-shot image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15691–15701, October 2023.
  78. Muhammad Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Mateusz Kozinski, Rogerio Feris, and Horst Bischof, “LaFTer: Label-free tuning of zero-shot classifier using language and unlabeled image collections,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  79. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu, “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
  80. Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd Van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, and Neil Houlsby, “Scaling vision transformers to 22 billion parameters,” in Proceedings of the 40th International Conference on Machine Learning (Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research, pp. 7480–7512, PMLR, 23–29 Jul 2023.
  81. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger, “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning (Doina Precup and Yee Whye Teh, eds.), vol. 70 of Proceedings of Machine Learning Research, pp. 1321–1330, PMLR, 06–11 Aug 2017.
  82. Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang, “Eva-clip-18b: Scaling clip to 18 billion parameters,” 2024.
  83. Ross Wightman, “Pytorch image models.” https://github.com/rwightman/pytorch-image-models, 2019.
  84. Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz, “Sfr-embedded-mistral.” Salesforce AI Research Blog, 2024.
  85. Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers, “MTEB: Massive text embedding benchmark,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (Andreas Vlachos and Isabelle Augenstein, eds.), (Dubrovnik, Croatia), pp. 2014–2037, Association for Computational Linguistics, May 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Paul Gavrikov (13 papers)
  2. Jovita Lukasik (13 papers)
  3. Steffen Jung (13 papers)
  4. Robert Geirhos (28 papers)
  5. Bianca Lamm (5 papers)
  6. Muhammad Jehanzeb Mirza (10 papers)
  7. Margret Keuper (77 papers)
  8. Janis Keuper (66 papers)
Citations (9)
Youtube Logo Streamline Icon: https://streamlinehq.com