Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models (2405.04623v1)

Published 7 May 2024 in cs.CY

Abstract: Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan decreased, but misclassifying the same images as human offensive classes such as criminal increased. Furthermore, of the 14 Vision Transformer-based VLMs we evaluated, the probability of predicting an image of a Black man and a Latino man as criminal increases by 65% and 69%, respectively, when the dataset is scaled from 400M to 2B samples for the larger ViT-L models. Conversely, for the smaller base ViT-B models, the probability of predicting an image of a Black man and a Latino man as criminal decreases by 20% and 47%, respectively, when the dataset is scaled from 400M to 2B samples. We ground the model audit results in a qualitative and historical analysis, reflect on our findings and their implications for dataset curation practice, and close with a summary of mitigation mechanisms and ways forward. Content warning: This article contains racially dehumanising and offensive descriptions.

Exploring Dataset Scaling and Bias in Vision Transformer Models

Understanding the Study

The paper evaluated the impact of dataset size on racial and gender bias in visio-linguistic models (VLMs), specifically focusing on Vision Transformers (ViT) models trained on two data sizes: LAION400-M and LAION-2B. Researchers used the Chicago Face Dataset (CFD) to measure bias, revealing significant variations in racial classification as dataset size changed.

Key Findings

Impact of Dataset Scaling

  • Decrease in Non-human Misclassifications: Larger datasets reduced misclassifications of human images as non-human categories such as animals or apes.
  • Increase in Offensive Human Classifications: Larger datasets increased the likelihood of misclassifying Black and Latino men as criminals, particularly with larger model architectures (e.g., ViT-L).

Model Response Differences

  • Model Size Matters: Larger models (ViT-L) increased criminal classification predictions when trained on the bigger dataset (2B samples), while smaller models (ViT-B) showed a decrease in such predictions when the dataset was scaled up.

Implications for AI Development and Ethics

The findings underscore the complexities of scaling datasets in AI training:

  • Bias Amplification: Scaling up datasets without careful curation and consideration of diversity can amplify biases, potentially leading to harmful stereotypes being reinforced in AI applications.
  • Need for Responsible AI Practices: The paper highlights the critical need for transparency, evaluation, and responsible dataset management in AI development to avoid propagation of stereotypes and bias.

Speculating on Future Developments

Given the paper's findings, future developments in AI might focus on:

  1. Improved Dataset Curation: Enhanced methods for dataset curation to ensure diversity and minimize biases.
  2. Robust Bias Mitigation Techniques: Development of more sophisticated techniques to detect and mitigate biases as datasets scale.
  3. Ethical AI Deployment: Emphasis on ethical considerations and fairness in AI deployment, especially in sensitive applications.

Conclusion

This analysis demonstrates the nuanced challenge of scaling datasets in the training of AI models. While larger datasets can enhance the model's ability to generalize, they can also inadvertently amplify existing societal biases if not curated responsibly. The paper reinforces the necessity for continued vigilance and advancement in ethical AI practices. As AI technologies become increasingly integrated into societal frameworks, the stakes for responsible AI development are markedly high, requiring concerted efforts from developers, researchers, and policymakers alike.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (99)
  1. Mohamed Abdalla and Moustafa Abdalla. 2021. The Grey Hoodie Project: Big tobacco, big tech, and the threat on academic integrity. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 287–297.
  2. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306.
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  4. Patronus AI. 2024. Introducing CopyrightCatcher, the first Copyright Detection API for LLMs. https://www.patronus.ai/blog/introducing-copyright-catcher
  5. Michelle Alexander. 2020. The new Jim Crow: Mass incarceration in the age of colorblindness. The New Press.
  6. Based on billions of words on the internet, people= men. Science Advances 8, 13 (2022), eabm2463.
  7. John K Bardes. 2018. Redefining Vagrancy: Policing Freedom and Disorder in Reconstruction New Orleans, 1862–1868. Journal of Southern History 84, 1 (2018), 69–112.
  8. To” see” is to stereotype: Image tagging algorithms, gender recognition, and the accuracy-fairness trade-off. Proceedings of the ACM on Human-Computer Interaction 4, CSCW3 (2021), 1–31.
  9. Ruha Benjamin. 2019. Race after technology: Abolitionist tools for the new jim code. John Wiley & Sons.
  10. Marquis Bey. 2016. “Bring Out Your Dead” Understanding the Historical Persistence of the Criminalization of Black Bodies. Cultural Studies? Critical Methodologies 16, 3 (2016), 271–277.
  11. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. arXiv preprint arXiv:2211.03759 (2022).
  12. The values encoded in machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 173–184.
  13. Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision?. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1536–1546.
  14. Simone Browne. 2015. Dark matters: On the surveillance of blackness. Duke University Press.
  15. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR, 77–91.
  16. Carole Cadwalladr and Emma Graham-Harrison. 2018. Revealed: 50 million Facebook profiles harvested for Cambridge Analytica in major data breach. The guardian 17, 1 (2018), 22.
  17. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
  18. Do# BlackLivesMatter? Implicit bias, institutional racism and fear of the black body. Ralph Bunche Journal of Public Affairs 6, 1 (2017), 2.
  19. Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788 (2023).
  20. Improving Ponzi scheme contract detection using multi-channel TextCNN and transformer. Sensors 21, 19 (2021), 6417.
  21. Learning graph structures with transformer for multivariate time-series anomaly detection in IoT. IEEE Internet of Things Journal 9, 12 (2021), 9179–9189.
  22. Sanghyuk Roy Choi and Minhyeok Lee. 2023. Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review. Biology 12, 7 (2023), 1033.
  23. Digital ageism: Challenges and opportunities in artificial intelligence for older adults. The Gerontologist 62, 7 (2022), 947–955.
  24. Kate Crawford. 2021. The atlas of AI: Power, politics, and the planetary costs of artificial intelligence. Yale University Press.
  25. Does object recognition work for everyone?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 52–59.
  26. Amanuel Elias. 2024. Brief History of Racism. In Racism and Anti-Racism Today. Emerald Publishing Limited, 29–56.
  27. Joe Feagin. 2013. Systemic racism: A theory of oppression. Routledge.
  28. The Latino eyelid: anthropometric analysis of a spectrum of findings. Ophthalmic plastic and reconstructive surgery 33, 6 (2017), 440.
  29. Shytierra Gaston. 2019. Enforcing race: A neighborhood-level explanation of Black–White differences in drug arrests. Crime & Delinquency 65, 4 (2019), 499–526.
  30. Nico Grant and Kashmir Hill. 2023. Google’s Photo App Still Can’t Find Gorillas. And Neither Can Apple’s.
  31. Mary L Gray and Siddharth Suri. 2019. Ghost work: How to stop Silicon Valley from building a new global underclass. Eamon Dolan Books.
  32. Race and Wrongful Convictions in the United States 2022. Available at SSRN 4245863 (2022).
  33. Thomas F Gross. 2009. Own-ethnicity bias in the recognition of Black, East Asian, Hispanic, and White faces. Basic and Applied Social Psychology 31, 2 (2009), 128–135.
  34. Gender recognition or gender reductionism? The social implications of embedded gender recognition systems. In Proceedings of the 2018 chi conference on human factors in computing systems. 1–13.
  35. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV). 771–787.
  36. Does CLIP Know My Face? arXiv preprint arXiv:2209.07341 (2022).
  37. Robots enact malignant stereotypes. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 743–756.
  38. OpenCLIP. https://doi.org/10.5281/zenodo.5143773 If you use this software, please cite it as below..
  39. Auditing for discrimination in algorithms delivering job ads. In Proceedings of the web conference 2021. 3767–3778.
  40. Pratyusha Kalluri et al. 2020. Don’t ask if artificial intelligence is good or fair, ask how it shifts power. Nature 583, 7815 (2020), 169–169.
  41. Learning cell-type-specific gene regulation mechanisms by multi-attention based deep learning with regulatory latent space. Frontiers in Genetics 11 (2020), 869.
  42. Transformer-based direct speech-to-speech translation with transcoder. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 958–965.
  43. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 449–456.
  44. Kimmo Kärkkäinen and Jungseock Joo. 2019. Fairface: Face attribute dataset for balanced race, gender, and age. arXiv preprint arXiv:1908.04913 (2019).
  45. Jana Kasperkevic. 2015. Google says sorry for racist auto-tag in photo app. The Guardian 1 (2015), 2015.
  46. Os Keyes. 2018. The misgendering machines: Trans/HCI implications of automatic gender recognition. Proceedings of the ACM on human-computer interaction 2, CSCW (2018), 1–22.
  47. Ido Kilovaty. 2019. Legally cognizable manipulation. Berkeley Tech. LJ 34 (2019), 449.
  48. FABLES: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261 (2024).
  49. Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation. arXiv preprint arXiv:2011.00747 (2020).
  50. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nature Communications 13, 1 (2022), 6678.
  51. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv preprint arXiv:2201.12086 (2022).
  52. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  53. SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation. arXiv preprint arXiv:2401.08053 (2024).
  54. Stable Bias: Analyzing Societal Representations in Diffusion Models. arXiv preprint arXiv:2303.11408 (2023).
  55. Power hungry processing: Watts driving the cost of ai deployment? arXiv preprint arXiv:2311.16863 (2023).
  56. The Chicago face database: A free stimulus set of faces and norming data. Behavior research methods 47 (2015), 1122–1135.
  57. Multimodal Composite Association Score: Measuring Gender Bias in Generative Multimodal Models. arXiv preprint arXiv:2304.13855 (2023).
  58. Harvey Mannering. 2023. Analysing Gender Bias in Text-to-Image Models using Object Detection. arXiv preprint arXiv:2307.08025 (2023).
  59. Kieran McCarthy. 2023. Web Scraping for Me, But Not for Thee. https://blog.ericgoldman.org/archives/2023/08/web-scraping-for-me-but-not-for-thee-guest-blog-post.htm. (Accessed on 04/30/2024).
  60. Dan McQuillan. 2022. Resisting AI: an anti-fascist approach to artificial intelligence. Policy Press.
  61. The Delaware pain database: A set of painful expressions and corresponding norming data. Pain reports 5, 6 (2020).
  62. Auditing algorithms: Understanding algorithmic systems from the outside in. Foundations and Trends® in Human–Computer Interaction 14, 4 (2021), 272–344.
  63. Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory. In The Twelfth International Conference on Learning Representations.
  64. ‘Alllooksame’? Mediating Asian American Visual Cultures of Race on the Web. East main street: Asian American popular culture (2005), 262–272.
  65. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035 (2023).
  66. Safiya Umoja Noble. 2018. Algorithms of oppression. In Algorithms of oppression. New York University Press.
  67. Ziad Obermeyer and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm that guides health decisions for 70 million people. In Proceedings of the conference on fairness, accountability, and transparency. 89–89.
  68. George Pacheco Jr. 2008. Rhetoric with humor: An analysis of Hispanic/Latino comedians’ uses of humor. The University of Southern Mississippi.
  69. Frank Pasquale. 2015. The black box society: The secret algorithms that control money and information. Harvard University Press.
  70. Sentencing Project. 2018. Report to the United Nations on racial disparities in the US criminal justice system. (2018).
  71. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).
  72. The fallacy of AI functionality. In 2022 ACM Conference on Fairness, Accountability, and Transparency. 959–972.
  73. Katherine J Rosich. 2007. Race, ethnicity, and the criminal justice system. (2007).
  74. Angela Saini. 2019. Superior: the return of race science. Beacon Press.
  75. The Bogazici face database: Standardized photographs of Turkish faces with supporting materials. PloS one 13, 2 (2018), e0192018.
  76. How computers see gender: An evaluation of gender classification in commercial facial analysis services. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–33.
  77. How we’ve taught algorithms to see identity: Constructing race and gender in image databases for facial analysis. Proceedings of the ACM on Human-computer Interaction 4, CSCW1 (2020), 1–35.
  78. Nika Schoonover. 2023. Microsoft, Meta and Bloomberg accused of using pirated books in AI development — Courthouse News Service. https://www.courthousenews.com/microsoft-meta-and-bloomberg-accused-of-using-pirated-books-in-ai-development/. (Accessed on 04/30/2024).
  79. CLIPort: What and Where Pathways for Robotic Manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL).
  80. Tom Simonite. 2018. When it comes to gorillas, google photos remains blind. Wired, January 13 (2018).
  81. CalvinJohn Smiley and David Fakunle. 2016. From “brute” to “thug:” The demonization and criminalization of unarmed Black male victims in America. Journal of human behavior in the social environment 26, 3-4 (2016), 350–366.
  82. Rory W Spanton and Olivia Guest. 2022. Measuring Trustworthiness or Automating Physiognomy? A Comment on Safra, Chevallier, Gr\\\backslash\ezes, and Baumard (2020). arXiv preprint arXiv:2202.08674 (2022).
  83. Luke Stark and Jevan Hutson. 2021. Physiognomic artificial intelligence. Fordham Intell. Prop. Media & Ent. LJ 32 (2021), 922.
  84. Is Cosine-Similarity of Embeddings Really About Similarity? arXiv preprint arXiv:2403.05440 (2024).
  85. Ryan Steed and Aylin Caliskan. 2021. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 701–713.
  86. The MR2: A multi-racial, mega-resolution database of facial stimuli. Behavior research methods 48 (2016), 1197–1204.
  87. When are Lemons Purple? The Concept Association Bias of Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 14333–14348.
  88. Paola Tubaro and Antonio A Casilli. 2019. Micro-work, artificial intelligence and the automotive industry. Journal of Industrial and Business Economics 46 (2019), 333–345.
  89. Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv preprint arXiv:2201.07284 (2022).
  90. Emiel Van Miltenburg. 2016. Stereotyping and bias in the flickr30k dataset. arXiv preprint arXiv:1605.06083 (2016).
  91. Attention is all you need. Advances in neural information processing systems 30 (2017).
  92. Carissa Véliz. 2021. Privacy is power. Melville House Brooklyn.
  93. Financial Fraud Detection Based on Deep Learning: Towards Large-Scale Pre-training Transformer Models. In China Conference on Knowledge Graph and Semantic Computing. Springer, 163–177.
  94. Predictive inequity in object detection. arXiv preprint arXiv:1902.11097 (2019).
  95. FinChain-BERT: A High-Accuracy Automatic Fraud Detection Model Based on NLP Methods for Financial Scenarios. Information 14, 9 (2023), 499.
  96. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
  97. When and why vision-language models behave like bags-of-words, and what to do about it?. In The Eleventh International Conference on Learning Representations.
  98. Transformer for Gene Expression Modeling (T-GEM): An Interpretable Deep Learning Model for Gene Expression-Based Phenotype Predictions. Cancers 14, 19 (2022), 4763.
  99. ISIEA: An image database of social inclusion and exclusion in young Asian adults. Behavior Research Methods (2021), 1–13.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Abeba Birhane (24 papers)
  2. Sepehr Dehdashtian (7 papers)
  3. Vinay Uday Prabhu (13 papers)
  4. Vishnu Boddeti (13 papers)
Citations (4)