New Job, New Gender? Measuring the Social Bias in Image Generation Models (2401.00763v3)
Abstract: Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can lead to severe consequences. Prior research on assessing bias within image generation models suffers from several shortcomings, including limited accuracy, reliance on extensive human labor, and lack of comprehensive analysis. In this paper, we propose BiasPainter, a novel evaluation framework that can accurately, automatically and comprehensively trigger social bias in image generation models. BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries. These queries span 62 professions, 39 activities, 57 types of objects, and 70 personality traits. The framework then compares the edited images to the original seed images, focusing on the significant changes related to gender, race, and age. BiasPainter adopts a key insight that these characteristics should not be modified when subjected to neutral prompts. Built upon this design, BiasPainter can trigger the social bias and evaluate the fairness of image generation models. We use BiasPainter to evaluate six widely-used image generation models, such as stable diffusion and Midjourney. Experimental results show that BiasPainter can successfully trigger social bias in image generation models. According to our human evaluation, BiasPainter can achieve 90.8% accuracy on automatic bias detection, which is significantly higher than the results reported in previous work.
- Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) (2022). https://api.semanticscholar.org/CorpusID:250390701
- How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1358–1370. https://doi.org/10.18653/v1/2022.emnlp-main.88
- Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2022). https://api.semanticscholar.org/CorpusID:253383708
- Shikha Bordia and Samuel R. Bowman. 2019. Identifying and Reducing Gender Bias in Word-Level Language Models. In North American Chapter of the Association for Computational Linguistics.
- InstructPix2Pix: Learning to Follow Image Editing Instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023).
- VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition.
- Hidden Voice Commands. In USENIX Security Symposium.
- Bias in machine learning software: why? how? what to do? Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021).
- Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv abs/2002.12543 (2020).
- An innovative approach for testing bioinformatics programs using metamorphic testing. BMC Bioinformatics 10 (2008), 24 – 24.
- Fairness Testing: A Comprehensive Survey and Analysis of Trends. ArXiv abs/2207.10223 (2022). https://api.semanticscholar.org/CorpusID:250920488
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers. ArXiv abs/2202.04053 (2022). https://api.semanticscholar.org/CorpusID:246652218
- Aiyub Dawood. 2023. Number of Midjourney Users and Statistics. https://www.mlyearning.org/midjourney-users-statistics/. Accessed: 2023-08-01.
- Unified Language Model Pre-training for Natural Language Understanding and Generation. CoRR abs/1905.03197 (2019). arXiv:1905.03197 http://arxiv.org/abs/1905.03197
- Leah V Durant. 2004. Gender bias and legal profession: A discussion of why there are still so few women on the bench. U. Md. LJ Race, Religion, Gender & Class 4 (2004), 181.
- Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (2018).
- Generative adversarial networks. In NIPS.
- Shashij Gupta. 2020. Machine Translation Testing via Pathological Invariance. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) (2020), 107–109.
- Testing Machine Translation via Referential Transparency. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 410–422.
- Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey. ArXiv abs/2207.07068 (2022). https://api.semanticscholar.org/CorpusID:250526377
- et al. Hugo Touvron. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023). https://api.semanticscholar.org/CorpusID:259950998
- DeepCrime: mutation testing of deep learning systems based on real faults. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021).
- You Keep Using That Word: Ways of Thinking about Gender in Computing Research. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 39 (apr 2021), 23 pages. https://doi.org/10.1145/3449113
- Sam Levin. 2018. Tesla fatal crash: ’autopilot’ mode sped up car before driver killed, report finds [Online]. https://www.theguardian.com/technology/2018/jun/07/tesla-fatal-crash-silicon-valley-autopilot-mode-report. Accessed: 2018-06.
- Dark-Skin Individuals Are at More Risk on the Street: Unmasking Fairness Issues of Autonomous Driving Systems. ArXiv abs/2308.02935 (2023). https://api.semanticscholar.org/CorpusID:260682572
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- Interactive Planning for Autonomous Urban Driving in Adversarial Scenarios. 2021 IEEE International Conference on Robotics and Automation (ICRA) (2021), 5261–5267.
- DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 120–131. https://api.semanticscholar.org/CorpusID:36353796
- A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (CSUR) 54 (2019), 1 – 35. https://api.semanticscholar.org/CorpusID:201666566
- Inc. Midjourney. 2023. Midjourney. https://www.midjourney.com/. Accessed: 2023-08-01.
- DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Proceedings of the 26th Symposium on Operating Systems Principles (2017).
- DEVIATE: A Deep Learning Variance Testing Framework. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290.
- Fairness in rankings and recommendations: an overview. The VLDB Journal 31 (2021), 431 – 458. https://api.semanticscholar.org/CorpusID:233219774
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV]
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:231591445
- Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (2020), 5193–5254.
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
- High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685. https://api.semanticscholar.org/CorpusID:245335280
- High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
- Natural Test Generation for Precise Testing of Question Answering Software. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022).
- Charles Hamilton Smith and Samuel Kneeland. [n. d.]. The natural history of the human species. https://api.semanticscholar.org/CorpusID:162691300
- Mitigating Gender Bias in Natural Language Processing: Literature Review. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:195316733
- DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2017), 303–314. https://api.semanticscholar.org/CorpusID:4055261
- AEON: a method for automatic evaluation of NLP test cases. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022).
- Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving. ArXiv abs/2101.06784 (2021).
- Neural Discrete Representation Learning. NIPS (2017).
- BiasAsker: Measuring the Bias in Conversational AI System. FSE (2023).
- RobOT: Robustness-Oriented Testing for Deep Learning Systems. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 300–311.
- T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation. ACL (2023).
- Validating Multimedia Content Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api.semanticscholar.org/CorpusID:258840941
- An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2023), 1339–1351. https://api.semanticscholar.org/CorpusID:261048642
- Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models. ArXiv abs/2310.12481 (2023). https://api.semanticscholar.org/CorpusID:264305810
- MTTM: Metamorphic Testing for Textual Content Moderation Software. ArXiv abs/2302.05706 (2023).
- All Languages Matter: On the Multilingual Safety of Large Language Models. ArXiv abs/2310.00905 (2023). https://api.semanticscholar.org/CorpusID:263605466
- Social bias, discrimination and inequity in healthcare: mechanisms, implications and recommendations. BJA education (2022).
- Testing and validating machine learning classifiers by metamorphic testing. The Journal of systems and software (2011).
- J Zhang and Mark Harman. 2021. ”Ignorance and Prejudice” in Software Fairness. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 1436–1447.
- Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48 (2019), 1–36. https://api.semanticscholar.org/CorpusID:195657970
- Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48 (2022), 1–36.
- DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018).
- Auditing Gender Presentation Differences in Text-to-Image Models. arXiv:2302.03675 [cs.CV]
- Cost-Effective Testing of a Deep Learning Model through Input Reduction. 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) (2020), 289–300. https://api.semanticscholar.org/CorpusID:212843936
- Chris Ziegler. 2016. A google self-driving car caused a crash for the first time. [Online]. https://www.theverge.com/2016/2/29/11134344/google-self-driving-car-crash-report. Accessed: 2016-09.
- Wenxuan Wang (128 papers)
- Haonan Bai (2 papers)
- Jen-tse Huang (46 papers)
- Yuxuan Wan (28 papers)
- Youliang Yuan (18 papers)
- Haoyi Qiu (10 papers)
- Nanyun Peng (205 papers)
- Michael R. Lyu (176 papers)