Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models (2407.21001v3)

Published 30 Jul 2024 in cs.CV, cs.AI, and cs.LG
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models

Abstract: Vision-LLMs (VLMs) are intensively used in many downstream tasks, including those requiring assessments of individuals appearing in the images. While VLMs perform well in simple single-person scenarios, in real-world applications, we often face complex situations in which there are persons of different genders doing different activities. We show that in such cases, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity. We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias and analyze how this bias is internalized in VLMs. To assess this bias, we have introduced the GAB dataset with approximately 5500 AI-generated images that represent a variety of activities, addressing the scarcity of real-world images for some scenarios. To have extensive quality control, the generated images are evaluated for their diversity, quality, and realism. We have tested 12 renowned pre-trained VLMs on this dataset in the context of text-to-image and image-to-text retrieval to measure the effect of this bias on their predictions. Additionally, we have carried out supplementary experiments to quantify the bias in VLMs' text encoders and to evaluate VLMs' capability to recognize activities. Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.

Examination of Gender-Activity Binding Bias in Vision-LLMs

The paper entitled "GABInsight: Exploring Gender-Activity Binding Bias in Vision-LLMs" introduces a crucial analysis of biases inherent in Vision-LLMs (VLMs). The exploration critically addresses the Gender-Activity Binding (GAB) bias, where VLMs inaccurately associate certain activities with specific genders due to ingrained stereotypes or sample selection biases from training data. The implications of such biases are particularly significant for VLM applications in real-world scenarios, where gender representation might affect model decisions, leading to perpetuation of gender stereotypes.

Methodology

To systematically examine the extent of GAB bias, the authors introduce the Gender-Activity Binding (GAB) dataset, consisting of approximately 5500 AI-generated images. This dataset was curated by leveraging DALL-E 3 for image generation with extensive prompt enhancements for diversity and realism, thus addressing the scarcity of suitable real-world images reflecting unbiased scenarios. The chosen activities are categorized into stereotypical, everyday, and gender-biased based on insights from GPT-4 and the LAION-400M dataset. This innovative dataset provides a structured platform for evaluating the effect of gender biases in VLMs across various experimental settings.

Evaluation of Vision-LLMs

The research conducts a comprehensive benchmarking involving 12 notable VLMs, assessing their performance in image-to-text and text-to-image retrieval tasks under the influence of GAB bias. The authors reveal that retrieval accuracy sharply declines by an average of 13.2% when VLMs encounter cross-gender scenarios where stereotypical gender roles are contradicted. Notably, in image-to-text retrieval tasks, VLM performance diminishes significantly for instances when an unexpected gender performs the activity amidst a scene containing both genders. Conversely, when only one gender is depicted, models exhibit much higher accuracy.

For text-to-image retrieval, VLMs demonstrate nearly random assignments with approximately 50% accuracy, indicating insufficient comprehension of gender-activity correlations from image data alone. This underscores the notion that while image encoders in VLMs struggle with recognizing activity performers based on gender biases, the text encoders exhibit a pronounced bias favoring traditional gender roles as observed through embedding similarities.

Discussion and Implications

The findings imply that the GAB bias is predominantly absorbed by text encoders in VLMs rather than image encoders, revealing a critical area for bias mitigation. This paper highlights the need for addressing gender stereotypes in training datasets and the influence of biased pre-training, emphasizing the potential for these models to inadvertently reinforce societal biases if integrated into decision-making systems.

Future Prospects and Considerations

In terms of future endeavors, the research identifies several pathways to unravel and mitigate embedded biases within VLMs. Considering the extension of this framework to assess biases such as racial and age stereotypes is an essential future direction. Furthermore, comprehensive scrutiny of datasets underpinning VLM training would be instrumental in pinpointing origins of biases.

Overall, this paper provides a significant contribution to understanding and addressing gender biases in VLMs, highlighting the complexity and nuanced nature of these biases and the need for further refinement in model training and evaluation frameworks. The introduction of a unique dataset and subsequent analysis paves the way for more equitable and comprehensive model development that could meaningfully enhance VLM application in diverse real-world environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Gpt-4 technical report, 2023.
  2. Evaluating clip: Towards characterization of broader capabilities and downstream implications. ArXiv, abs/2108.02818, 2021.
  3. Convolutional image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5561–5570, 2018.
  4. Recovering from selection bias in causal and statistical inference. Probabilistic and Causal Inference, 2014.
  5. A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
  6. AltCLIP: Altering the language encoder in CLIP for extended language capabilities. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8666–8682, 2023.
  7. Debiasing vision-language models via biased prompts, 2023.
  8. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  9. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022.
  10. Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution. ArXiv, abs/2306.12424, 2023.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
  12. Denoising diffusion probabilistic models, 2020.
  13. spaCy: Industrial-strength Natural Language Processing in Python. 2020.
  14. Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples, 2024.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021.
  16. Ultralytics YOLO, 2023. URL https://github.com/ultralytics/ultralytics.
  17. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  18. Survey of social bias in vision-language models. CoRR, abs/2309.14381, 2023.
  19. Multimodal foundation models: From specialists to general-purpose assistants. ArXiv, abs/2309.10020, 2023.
  20. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
  21. Grounded language-image pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955–10965, 2021.
  22. Y. Li and N. Vasconcelos. Debias your vlm with counterfactuals: A unified approach. 2023.
  23. Roberta: A robustly optimized bert pretraining approach, 2019.
  24. OpenAI. Dall-e 3. https://openai.com/dall-e-3/, 2023.
  25. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  26. Language models are unsupervised multitask learners. 2019.
  27. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  28. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019. URL http://arxiv.org/abs/1908.10084.
  29. High-resolution image synthesis with latent diffusion models, 2022.
  30. Towards an exhaustive evaluation of vision-language foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 339–352, 2023.
  31. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv, abs/2111.02114, 2021.
  32. Dear: Debiasing vision-language models with additive residuals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6820–6829, 2023.
  33. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15638–15650, 2022.
  34. T. Srinivasan and Y. Bisk. Worst of both worlds: Biases compound in pre-trained vision-and-language models. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 77–85. Association for Computational Linguistics, 2022a.
  35. T. Srinivasan and Y. Bisk. Worst of both worlds: Biases compound in pre-trained vision-and-language models. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 77–85, 2022b.
  36. Eva-clip: Improved training techniques for clip at scale. ArXiv, arXiv:2303.15389, 2023.
  37. H. Tan and M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China, 2019. Association for Computational Linguistics.
  38. Winoground: Probing vision and language models for visio-linguistic compositionality. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238, 2022.
  39. Llama 2: Open foundation and fine-tuned chat models, 2023.
  40. Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page 1995–2008, 2021.
  41. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
  42. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  43. Filip: Fine-grained interactive language-image pre-training. ArXiv, abs/2111.07783, 2021.
  44. Coca: Contrastive captioners are image-text foundation models. ArXiv, abs/2205.01917, 2022.
  45. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023.
  46. Joint face detection and alignment using multi-task cascaded convolutional networks. CoRR, abs/1604.02878, 2016.
  47. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
  48. Counterfactually measuring and eliminating social bias in vision-language pre-training models. In MM ’22: The 30th ACM International Conference on Multimedia, pages 4996–5004. ACM, 2022.
  49. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. ArXiv, abs/2207.00221, 2022.
  50. Vision language models in autonomous driving and intelligent transportation systems. ArXiv, abs/2310.14414, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ali Abdollahi (7 papers)
  2. Mahdi Ghaznavi (3 papers)
  3. Mohammad Reza Karimi Nejad (1 paper)
  4. Arash Mari Oriyad (1 paper)
  5. Reza Abbasi (8 papers)
  6. Ali Salesi (1 paper)
  7. Melika Behjati (3 papers)
  8. Mohammad Hossein Rohban (43 papers)
  9. Mahdieh Soleymani Baghshah (49 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com