Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

203

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models (2405.04623v1)

Published 7 May 2024 in cs.CY

Abstract: Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan decreased, but misclassifying the same images as human offensive classes such as criminal increased. Furthermore, of the 14 Vision Transformer-based VLMs we evaluated, the probability of predicting an image of a Black man and a Latino man as criminal increases by 65% and 69%, respectively, when the dataset is scaled from 400M to 2B samples for the larger ViT-L models. Conversely, for the smaller base ViT-B models, the probability of predicting an image of a Black man and a Latino man as criminal decreases by 20% and 47%, respectively, when the dataset is scaled from 400M to 2B samples. We ground the model audit results in a qualitative and historical analysis, reflect on our findings and their implications for dataset curation practice, and close with a summary of mitigation mechanisms and ways forward. Content warning: This article contains racially dehumanising and offensive descriptions.

PDF HTML Abstract

Exploring Dataset Scaling and Bias in Vision Transformer Models

Understanding the Study

The paper evaluated the impact of dataset size on racial and gender bias in visio-linguistic models (VLMs), specifically focusing on Vision Transformers (ViT) models trained on two data sizes: LAION400-M and LAION-2B. Researchers used the Chicago Face Dataset (CFD) to measure bias, revealing significant variations in racial classification as dataset size changed.

Key Findings

Impact of Dataset Scaling

Decrease in Non-human Misclassifications: Larger datasets reduced misclassifications of human images as non-human categories such as animals or apes.
Increase in Offensive Human Classifications: Larger datasets increased the likelihood of misclassifying Black and Latino men as criminals, particularly with larger model architectures (e.g., ViT-L).

Model Response Differences

Model Size Matters: Larger models (ViT-L) increased criminal classification predictions when trained on the bigger dataset (2B samples), while smaller models (ViT-B) showed a decrease in such predictions when the dataset was scaled up.

Implications for AI Development and Ethics

The findings underscore the complexities of scaling datasets in AI training:

Bias Amplification: Scaling up datasets without careful curation and consideration of diversity can amplify biases, potentially leading to harmful stereotypes being reinforced in AI applications.
Need for Responsible AI Practices: The paper highlights the critical need for transparency, evaluation, and responsible dataset management in AI development to avoid propagation of stereotypes and bias.

Speculating on Future Developments

Given the paper's findings, future developments in AI might focus on:

Improved Dataset Curation: Enhanced methods for dataset curation to ensure diversity and minimize biases.
Robust Bias Mitigation Techniques: Development of more sophisticated techniques to detect and mitigate biases as datasets scale.
Ethical AI Deployment: Emphasis on ethical considerations and fairness in AI deployment, especially in sensitive applications.

Conclusion

This analysis demonstrates the nuanced challenge of scaling datasets in the training of AI models. While larger datasets can enhance the model's ability to generalize, they can also inadvertently amplify existing societal biases if not curated responsibly. The paper reinforces the necessity for continued vigilance and advancement in ethical AI practices. As AI technologies become increasingly integrated into societal frameworks, the stakes for responsible AI development are markedly high, requiring concerted efforts from developers, researchers, and policymakers alike.

PDF Markdown Bookmark Chat (Pro)

References (99)

Authors (4)

Abeba Birhane (24 papers)
Sepehr Dehdashtian (7 papers)
Vinay Uday Prabhu (13 papers)
Vishnu Boddeti (13 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/Abebab/status/1788618337640906787

https://twitter.com/jenstirrup/status/1788830078945653228

https://twitter.com/LLMSherpa/status/1923082197491187994

https://twitter.com/WGOV/status/1788442570952994983

https://twitter.com/GptMaestro/status/1790162955075834032