Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal datasets: misogyny, pornography, and malignant stereotypes (2110.01963v1)

Published 5 Oct 2021 in cs.CY
Multimodal datasets: misogyny, pornography, and malignant stereotypes

Abstract: We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset often used as a source for training LLMs, and the entrenched biases in large-scale visio-linguistic models (such as OpenAI's CLIP model) trained on opaque datasets (WebImageText). In the backdrop of these specific calls of caution, we examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset. We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. We outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the AI community, regulators, policy makers and data subjects.

Analysis of Multimodal Datasets: Biases and Content Concerns

The paper entitled "Multimodal datasets: misogyny, pornography, and malignant stereotypes" critically examines the challenges and inherent risks embedded in large-scale datasets created for training state-of-the-art AI models. The research primarily focuses on the LAION-400M, a dataset comprised of Image-Alt-text pairs derived from the CommonCrawl corpus and filtered using OpenAI’s CLIP model. The paper reveals several problematic issues with content and methodology that raise concerns for AI stakeholders.

Key Findings

The authors present a thorough analysis of the LAION-400M dataset, highlighting the prevalence of offensive and explicit materials, including misogynistic and racist imagery and text. Specific instances reveal that the dataset often associates benign queries with non-safe-for-work (NSFW) results, perpetuating harmful stereotypes. Misogyny is notably prevalent, evidenced by the search results for terms linked to womanhood returning sexualized imagery.

The paper also questions the efficacy of using CLIP for dataset filtering, citing known biases in CLIP’s categorization tendencies, such as its misidentification of certain racial groups. Moreover, the use of a cosine similarity threshold for filtering, though intended to improve semantic alignment, was shown to permit ethnically and culturally insensitive pairings into the dataset.

Implications and Critiques

Practical Implications

The practical concerns of employing these datasets in model training are significant. Models risk amplifying societal biases, as exemplified by exposure to grotesque stereotypes captured in training images or text. For practitioners, this poses challenges regarding mitigating biases in AI systems trained on such data.

The paper also questions the adequacy of existing filtering processes, pointing out the asymmetries between data gathering and dataset detoxification efforts. Moreover, the emotional toll on researchers handling such disagreeable content is underscored, revealing gaps in the field’s consideration of human costs in dataset curation and handling.

Theoretical Implications

The paper calls for a revisitation of the "scale beats noise" philosophy commonly used to justify large dataset curation without employing stringent quality checks. The authors stress that ignoring the qualitative aspects of data at this scale risks embedding problematic ideologies into AI models, thus reinforcing systemic discrimination. This critique is crucial for both AI ethics discussions and the development of more robust dataset collection methodologies.

Recommendations for Future Developments

  1. Improved Filtering Protocols: Develop and deploy more advanced automated and manual filtering techniques that better account for content sensitivities and societal norms.
  2. Bias Audits and Corrections: Regularly conduct bias audits on both datasets and trained models, and employ corrective measures to mitigate identified biases.
  3. Data Collection Transparency: Establish transparent criteria for data collection to better align with ethical considerations, possibly incorporating consent and data ownership semantics.
  4. Multimodal Research: Advance research on the interaction between modalities, observing how biases in one modality might propagate or exacerbate in another, particularly in joint embeddings for text and images.

Conclusion

The paper offers a critical perspective on the pitfalls associated with scaling AI systems using improperly curated multimodal datasets. It highlights the urgent need for better data governance frameworks and ethical norms that prioritize transparency, fairness, and accountability in AI development. This research serves as an essential reminder for the AI community to balance technological ambition with societal responsibility.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Abeba Birhane (24 papers)
  2. Vinay Uday Prabhu (13 papers)
  3. Emmanuel Kahembwe (7 papers)
Citations (296)
Youtube Logo Streamline Icon: https://streamlinehq.com