On Hate Scaling Laws For Data-Swamps (2306.13141v2)

Published 22 Jun 2023 in cs.CY

Abstract: Scale the model, scale the data, scale the GPU-farms' is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts remain under explored. This is especially of critical importance in the context of visio-linguistic datasets whose main source is the World Wide Web, condensed and packaged as the CommonCrawl dump. This large scale data-dump, which is known to have numerous drawbacks, is repeatedly mined and serves as the data-motherlode for large generative models. In this paper, we: 1) investigate the effect of scaling datasets on hateful content through a comparative audit of the LAION-400M and LAION-2B-en, containing 400 million and 2 billion samples respectively, and 2) evaluate the downstream impact of scale on visio-linguistic models trained on these dataset variants by measuring racial bias of the models trained on them using the Chicago Face Dataset (CFD) as a probe. Our results show that 1) the presence of hateful content in datasets, when measured with a Hate Content Rate (HCR) metric on the inferences of the Pysentimiento hate-detection NLP model, increased by nearly $12\%$ and 2) societal biases and negative stereotypes were also exacerbated with scale on the models we evaluated. As scale increased, the tendency of the model to associate images of human faces with thehuman being' class over 7 other offensive classes reduced by half. Furthermore, for the Black female category, the tendency of the model to associate their faces with the `criminal' class doubled, while quintupling for Black male faces. We present a qualitative and historical analysis of the model audit results, reflect on our findings and its implications for dataset curation practice, and close with a summary of our findings and potential future work to be done in this area.

PDF HTML Abstract

An Analysis of "On Hate Scaling Laws for Data-Swamps"

This paper, authored by Birhane et al., rigorously analyzes the implications of dataset scaling on the prevalence of hateful content and biases in visio-linguistic models. The paper underscores a critical oversight in generative AI, where the focus has predominantly been on model scaling, while data scaling has been inadequately examined.

Core Contributions and Findings

Data Scaling and Hateful Content: The authors investigate the LAION family of datasets, specifically comparing LAION-400M and its successor, LAION-2B-en. Using the Hate Content Rate (HCR) metric in conjunction with the Pysentimiento model, the paper demonstrates a concerning increase in hateful content by approximately 12% as the dataset scales from 400 million to 2 billion samples. This increase is manifest across hateful, targeted, and aggressive content categories.
Visio-Linguistic Model Bias: The downstream effects of these datasets are scrutinized through their impact on visio-linguistic models such as CLIP. A critical experiment involving the Chicago Face Dataset (CFD) highlights exacerbated racial biases in models trained on larger datasets. Notably, for Black individuals, the propensity to be associated with derogatory classifications such as "criminal" increased substantially—doubling for Black females and quintupling for Black males—when the dataset expanded.
Qualitative and Historical Analysis: The paper contextualizes these findings historically, linking the negative biases to stereotypical and dehumanizing narratives historically imposed on marginalized communities. Such biases are argued to be perpetuated and amplified in scaled datasets.

Implications

The research raises significant concerns about the current trajectory of AI dataset scaling, suggesting that it might reinforce and spread societal biases rather than mitigate them. It calls for a re-evaluation of dataset curation practices, urging the community to focus not just on size but also on diversity and representational fairness.

Future Directions

The authors advocate for more transparent and accountable dataset curation practices, emphasizing the need for standardized metrics that consider ethical implications alongside model performance. They propose that future research could investigate alternative langual and racial representations in datasets to mitigate bias propagation.

Conclusion

This paper offers a comprehensive critique of the "scale is all you need" paradigm in AI dataset development. By exposing the risks inherent in merely expanding data size without assessing quality, Birhane et al. provide a crucial contribution to the discourse on ethical AI, urging for balanced development that foregrounds equity and accountability.