VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution (2306.12424v3)
Abstract: We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-LLMs. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in professional roles, supporting bias evaluation in two ways: i) resolution bias, where we evaluate the difference between pronoun resolution accuracies for image subjects with gender presentations perceived as masculine versus feminine by human annotators and ii) retrieval bias, where we compare ratios of professionals perceived to have masculine and feminine gender presentations retrieved for a gender-neutral search query. We benchmark several state-of-the-art vision-LLMs and find that they demonstrate bias in resolving binary gender in complex scenes. While the direction and magnitude of gender bias depends on the task and the model being evaluated, captioning models are generally less biased than Vision-Language Encoders. Dataset and code are available at https://github.com/oxai/visogender
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
- Big data’s disparate impact. California law review, pages 671–732, 2016.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
- The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591–598, 2016.
- Stereotyping norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, 2021.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
- Debiasing vision-language models via biased prompts. arXiv preprint arXiv:2302.00070, 2023.
- A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
- Balancing the picture: Debiasing vision-language datasets with synthetic contrast sets. arXiv preprint arXiv:2305.15407, 2023.
- Stable bias: Analyzing societal representations in diffusion models, 2023.
- Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2021.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Gender artifacts in visual datasets. arXiv preprint arXiv:2206.09191, 2022.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
- Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
- Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Slip: Self-supervision meets language-image pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 529–544. Springer, 2022.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
- Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
- Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems, 34:2611–2624, 2021.
- Do datasets have politics? disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2):1–37, 2021.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
- Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643–653, 2016.
- Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the fifteenth conference on computational natural language learning: Shared task, pages 28–34, 2011.
- Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 47–57, 2014.
- Easy victories and uphill battles in coreference resolution. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1971–1982, 2013.
- Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2015.
- Learning global features for coreference resolution. arXiv preprint arXiv:1604.03035, 2016.
- Deep reinforcement learning for mention-ranking coreference models. arXiv preprint arXiv:1609.08667, 2016.
- Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323, 2016.
- End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045, 2017.
- Collecting a large-scale gender bias dataset for coreference resolution and machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2470–2480, 2021.
- Toward gender-inclusive coreference resolution. arXiv preprint arXiv:1910.13913, 2019.
- Mind the gap: A balanced corpus of gendered ambiguous pronouns. Transactions of the Association for Computational Linguistics, 6:605–617, 2018.
- Vlue: A multi-task benchmark for evaluating vision-language models. arXiv preprint arXiv:2205.15237, 2022.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Being negative but constructively: Lessons learnt from creating better visual question answering datasets. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 431–441, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
- Vision meets definitions: Unsupervised visual word sense disambiguation incorporating gloss information. arXiv preprint arXiv:2305.01788, 2023.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
- COVR: A test-bed for visually grounded compositional generalization with real images. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9824–9846, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223, 2017.
- Understanding image and text simultaneously: a dual vision-language machine comprehension task. arXiv preprint arXiv:1612.07833, 2016.
- Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6555–6565, Online, July 2020. Association for Computational Linguistics.
- Why is winoground hard? investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768, 2022.
- What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
- Mitigating test-time bias for fair image retrieval. arXiv preprint arXiv:2305.19329, 2023.
- Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. arXiv preprint arXiv:2109.05433, 2021.
- Women also snowboard: Overcoming bias in captioning models. arXiv preprint arXiv:1803.09797, 2018.
- Harms of gender exclusivity and challenges in non-binary representation in language technologies. arXiv preprint arXiv:2108.12084, 2021.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
- Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
- Prompt-based methods may underestimate large language models’ linguistic generalizations. arXiv preprint arXiv:2305.13264, 2023.
- Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining, pages 2221–2231, 2019.
- Ke Yang and Julia Stoyanovich. Measuring fairness in ranked outputs. In Proceedings of the 29th international conference on scientific and statistical database management, pages 1–6, 2017.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Reclip: A strong zero-shot baseline for referring expression comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Misgendered: Limits of large language models in understanding pronouns. arXiv preprint arXiv:2306.03950, 2023.
- Getty images. https://www.gettyimages.co.uk/.
- Diversify your vision datasets with automatic diffusion-based augmentation. arXiv preprint arXiv:2305.16289, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
- Siobhan Mackenzie Hall (8 papers)
- Fernanda Gonçalves Abrantes (2 papers)
- Hanwen Zhu (2 papers)
- Grace Sodunke (2 papers)
- Aleksandar Shtedritski (13 papers)
- Hannah Rose Kirk (33 papers)