Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp (2405.08209v2)

Published 13 May 2024 in cs.CY, cs.CL, cs.CV, and cs.LG

Abstract: As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (138)
  1. Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. http://arxiv.org/abs/2108.02818 arXiv:2108.02818 [cs].
  2. Stability AI. 2024. https://stability.ai/stable-image
  3. Limits of Algorithmic Fair Use. Wash. JL Tech. & Arts 19 (2024), 1.
  4. Evaluating the Fairness of Discriminative Foundation Models in Computer Vision. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. ACM, Montreal, QC, Canada, 809–833. https://doi.org/10.1145/3600211.3604720
  5. Amazon. 2024. Amazon Rekognition. https://docs.aws.amazon.com/rekognition/latest/dg/what-is.html
  6. Ethical Considerations for Responsible Data Curation. Advances in Neural Information Processing Systems 36 (2024), 55320–55360.
  7. Internet Archive. 2022. Wayback CDX Server API documentation. https://archive.org/developers/wayback-cdx-server.html
  8. “You’re so exotic looking”: An intersectional analysis of Asian American and Pacific Islander stereotypes. Affilia 36, 3 (2021), 282–301.
  9. Andy Baio. 2022. Exploring 12 million of the 2.3 billion images used to train stable diffusion’s image generator. Retrieved July 6 (2022), 2023. https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/
  10. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, Online, 610–623.
  11. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, Chicago, IL, USA, 1493–1504. https://doi.org/10.1145/3593013.3594095
  12. WebInSight: making web images accessible. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility. ACM, Portland, OR, USA, 181–188.
  13. Typology of risks of generative text-to-image models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. ACM, Montreal, Canada, 396–410.
  14. The values encoded in machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, Seoul, South Korea, 173–184.
  15. Into the LAION’s Den: Investigating Hate in Multimodal Datasets. Advances in Neural Information Processing Systems 36 (2023), 21268––21284.
  16. Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision?. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Online, 1536–1546.
  17. Multimodal datasets: misogyny, pornography, and malignant stereotypes.
  18. Sumon Biswas and Hridesh Rajan. 2021. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Athens, Greece, 981–993. https://doi.org/10.1145/3468264.3468536
  19. Language (technology) is power: A critical survey of” bias” in NLP.
  20. Demographic dialectal variation in social media: A case study of African-American English.
  21. Bias and fairness in multimodal machine learning: A case study of automated video interviews. In Proceedings of the 2021 International Conference on Multimodal Interaction. ACM, Montreal, QC, Canada, 268–277.
  22. Matthieu Bourel. 2024. Fake Photos, Real Harm: AOC and the Fight Against AI Porn. https://www.rollingstone.com/culture/culture-features/aoc-deepfake-ai-porn-personal-experience-defiance-act-1234998491/
  23. Dawn Beverley Branley and Judith Covey. 2017. Is exposure to online content depicting risky behavior related to viewers’ own risky behavior offline? Computers in Human Behavior 75 (2017), 283–287.
  24. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  25. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability, and Transparency. PMLR, New York, NY, USA, 77–91.
  26. Judith Butler. 2013. Gender as performance. In A critical sense. Routledge, New York, NY, USA, 109–125.
  27. C3P. 2024. Canadian Centre for Child Protection. https://www.protectchildren.ca/en/
  28. Web content accessibility guidelines (WCAG) 2.0. WWW Consortium (W3C) 290 (2008), 1–34.
  29. Gender bias in word embeddings: A comprehensive analysis of frequency, syntax, and semantics. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. ACM, Oxford, England, 156–170.
  30. Why is my classifier discriminatory? Advances in neural information processing systems 31 (2018).
  31. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Vancouver, Canada, 2818–2829.
  32. A list of over 5000 US news domains and their social media accounts. https://doi.org/10.5281/zenodo.7651047
  33. Cloudflare. 2024. Cloudflare API v4 documentation: Get multiple domain details. https://developers.cloudflare.com/api/operations/domain-intelligence-get-multiple-domain-details
  34. Samantha Cole. 2023. Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material. https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/
  35. Creative Commons. 2024. CC BY 4.0. https://creativecommons.org/licenses/by/4.0/deed.en
  36. Behavioral Use Licensing for Responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (, Seoul, Republic of Korea,) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 778–788. https://doi.org/10.1145/3531146.3533143
  37. Common Crawl. 2024. https://commoncrawl.org/
  38. Vqgan-clip: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision. Springer, Tel Aviv, Israel, 88–105.
  39. Allan Dafoe. 2015. On technological determinism: A typology, scope conditions, and a mechanism. Science, Technology, & Human Values 40, 6 (2015), 1047–1076.
  40. DataComp. 2024. DataComp Tracks. https://www.datacomp.ai/#tracks
  41. An Archival Perspective on Pretraining Data.
  42. Bert: Pre-training of deep bidirectional transformers for language understanding.
  43. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, Santa Clara, CA, 67–73.
  44. Documenting large webtext corpora: A case study on the colossal clean crawled corpus.
  45. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ACM, Cambridge, MA, USA, 214–226.
  46. Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology 9 (2009), 69–83.
  47. Data filtering networks.
  48. Hany Farid. 2021. An overview of perceptual hashing. Journal of Online Trust and Safety 1, 1 (2021), 22 pages.
  49. DataComp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems 36 (2024), 27092––27112.
  50. Scaling Laws for Data Filtering – Data Curation cannot be Compute Agnostic. arXiv:2404.07177 [cs.LG]
  51. Michael M Grynbaum and Ryan Mac. 2023. The Times Sues OpenAI and Microsoft. , 1 pages.
  52. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, Anaheim, CA, USA, 3747–3754. https://doi.org/10.1109/ICDE55515.2023.00303
  53. Caption crawler: Enabling reusable alternative text descriptions using reverse image search. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, Montreal, QC, Canada, 1–11.
  54. Ritwik Gupta. 2024. LAION and the Challenges of Preventing AI-Generated CSAM. https://www.techpolicy.press/laion-and-the-challenges-of-preventing-ai-generated-csam/
  55. Whose language counts as high quality? measuring language ideologies in text data selection.
  56. Alex Hanna and Tina M Park. 2020. Against scale: Provocations and resistances to scale thinking.
  57. Foundation Models and Fair Use. https://doi.org/10.48550/arXiv.2303.15715 arXiv:2303.15715 [cs].
  58. Extending CLIP for Category-to-image Retrieval in E-commerce. In European Conference on Information Retrieval. Springer, Stavanger, Norway, 289–303.
  59. Training compute-optimal large language models.
  60. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, Online, 560–575.
  61. IP2Location. 2024. IP2Location Lite IP-Country IPv6 Database. https://lite.ip2location.com/ip2location-lite
  62. IWF. 2023. How AI is being abused to create child sexual abuse imagery. https://www.iwf.org.uk/media/q4zll2ya/iwf-ai-csam-report_public-oct23v1.pdf
  63. Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency. ACM, Barcelona, Spain, 306–316.
  64. Khari Johnson. 2020. MIT takes down 80 Million Tiny Images data set due to racist and offensive content. https://venturebeat.com/ai/mit-takes-down-80-million-tiny-images-data-set-due-to-racist-and-offensive-content/
  65. Mehtab Khan and Alex Hanna. 2022. The subjects and stages of ai dataset development: A framework for dataset accountability. Ohio St. Tech. LJ 19 (2022), 171.
  66. Zaid Khan and Yun Fu. 2021. One label, one billion faces: Usage and consistency of racial categories in computer vision. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, Online, 587–597.
  67. Location accuracy of commercial IP address geolocation databases. Information technology and control 46, 3 (2017), 333–344.
  68. Counterfactual fairness. Advances in neural information processing systems 30 (2017), 11 pages.
  69. Ganaele Langlois and Andrea Slane. 2017. Economies of reputation: The case of revenge porn. Communication and Critical/Cultural Studies 14, 2 (2017), 120–138.
  70. Detecting child sexual abuse material: A comprehensive survey. Forensic Science International: Digital Investigation 34 (2020), 301022.
  71. Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain. https://doi.org/10.2139/ssrn.4523551
  72. Jonathan D Levine and Frederick L Oswald. 2013. O* NET: The occupational information network. In The Handbook of Work Analysis. Routledge, New York, NY, USA, 312–332.
  73. On the accuracy of country-level IP geolocation. In Proceedings of the applied networking research workshop. ACM, Online, 67–73.
  74. Ian F Haney Lopez. 1995. The social construction of race. Harvard Civil Rights-Civil Liberties Law Review, Cambridge, MA, USA.
  75. Stable bias: Evaluating societal representations in diffusion models. Advances in Neural Information Processing Systems 36 (2024), 56338–56351.
  76. A framework for deprecating datasets: Standardizing documentation, identification, and communication. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, Seoul, South Korea, 199–212.
  77. Alexandra Sasha Luccioni and Joseph Viviano. 2021. What’s in the box? An analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). ACL, Bangkok, Thailand, 182–189.
  78. AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters.
  79. The Chicago face database: A free stimulus set of faces and norming data. Behavior research methods 47 (2015), 1122–1135.
  80. Susan R Madsen. 2021. Why Calling Women’Girls’ Is A Bigger Deal Than You May Think.
  81. Emanuel Maiberg. 2024. Tech Companies Promise to Try to Do Something About All the AI CSAM They’re Enabling. https://www.404media.co/tech-companies-promise-to-try-to-do-something-about-all-the-ai-csam-theyre-enabling/
  82. Microsoft. 2024. PhotoDNA. https://www.microsoft.com/en-us/photodna
  83. Midjourney. 2024. https://www.midjourney.com/home
  84. Clipcap: Clip prefix for image captioning.
  85. Andreas Mueller. 2023. word_cloud. https://github.com/amueller/word_cloud
  86. Arif Ali Mughal. 2018. The Art of Cybersecurity: Defense in Depth Strategy for Robust Protection. International Journal of Intelligent Automation and Computing 1, 1 (2018), 1–20.
  87. Michael Muller and Angelika Strohmayer. 2022. Forgetting practices in the data sciences. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, New Orleans, LA, USA, 1–19.
  88. NCMEC. 2024. National Center for Missing and Exploited Children. https://www.missingkids.org/home
  89. Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems 35 (2022), 21455–21469.
  90. A survey of machine unlearning.
  91. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems 36 (2024), 23 pages.
  92. Ruth Oldenziel. 1999. Making technology masculine: men, women and modern machines in America, 1870-1945. Amsterdam University Press, Amsterdam, Netherlands.
  93. OpenAI. 2022. Model Card: CLIP. https://github.com/openai/CLIP/blob/main/model-card.md
  94. OpenAI. 2024. OpenAI and journalism. https://openai.com/blog/openai-and-journalism
  95. Mitigating dataset harms requires stewardship: Lessons from 1000 papers.
  96. IP geolocation databases: Unreliable? ACM SIGCOMM Computer Communication Review 41, 2 (2011), 53–56.
  97. Gender Biases in Automatic Evaluation Metrics for Image Captioning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8358–8375. https://doi.org/10.18653/v1/2023.emnlp-main.520
  98. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. ACM, Copenhagen, Denmark, 3403–3417.
  99. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, Online, 8748–8763.
  100. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  101. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  102. World Population Review. 2024. Western Countries 2024. https://worldpopulationreview.com/country-rankings/western-countries
  103. Reece Rogers. 2024. Here’s How Generative AI Depicts Queer People.
  104. Representation matters: Assessing the importance of subgroup allocations in training data. In International Conference on Machine Learning. PMLR, Online, 9040–9051.
  105. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, 10684–10695.
  106. A world wide view of browsing the world wide web. In Proceedings of the 22nd ACM Internet Measurement Conference. ACM, Nice, France, 317–336.
  107. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. ACM, Vancouver, Canada, 2765–2775.
  108. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, Florence, Italy, 1668–1678.
  109. Mia Sato and Emillia David. 2024. I’m still trying to generate an AI Asian man and white woman.
  110. Do datasets have politics? Disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–37.
  111. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
  112. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.
  113. Diagnosing gender bias in image recognition systems. Socius 6 (2020), 2378023120967171.
  114. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. ACM, Montreal, QC, Canada, 723–741.
  115. Celine Parreñas Shimizu. 2007. The hypersexuality of race: Performing Asian/American women on screen and scene. Duke University Press, Durham, NC, USA.
  116. Shutterstock. 2024. Can I use images on my website? https://support.shutterstock.com/s/article/Can-I-use-Images-on-my-website?language=en_US
  117. Nakatani Shuyo. 2014. langdetect. https://github.com/Mimino666/langdetect
  118. Natasha Singer. 2024. Teen Girls Confront an Epidemic of Deepfake Nudes in Schools.
  119. Morgan P Slusher and Craig A Anderson. 1987. When reality monitoring fails: The role of imagination in stereotype maintenance. Journal of Personality and Social Psychology 52, 4 (1987), 653.
  120. Teachers Pay Teachers. 2022. How do I obtain a copyright in my work? Should I register my copyright? https://help.teacherspayteachers.com/hc/en-us/articles/360042535652-How-do-I-obtain-a-copyright-in-my-work-Should-I-register-my-copyright
  121. David Thiel. 2023. Identifying and Eliminating CSAM in Generative ML Training Data and Models.
  122. Generative ML and CSAM: Implications and Mitigations.
  123. Thorn. 2024a. Safer. https://get.safer.io/csam-detection-tool-for-child-safety
  124. Thorn. 2024b. Safety by Design for Generative AI: Preventing Child Sexual Abuse. https://info.thorn.org/hubfs/thorn-safety-by-design-for-generative-AI.pdf
  125. Contrastive multiview coding. In European Conference on Computer Vision. Springer, Online, 776–794.
  126. Francisco Valdes. 1996. Unpacking hetero-patriarchy: tracing the conflation of sex, gender & (and) sexual orientation to its origins. Yale JL & Human. 8 (1996), 161.
  127. Pranshu Verma and Drew Harwell. 2023. Exploitive, illegal photos of children found in the data that trains some AI. https://www.washingtonpost.com/technology/2023/12/20/ai-child-pornography-abuse-photos-laion/
  128. A survey of user-centered design practice. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, Minneapolis, MN, USA, 471–478.
  129. Jess Weatherbed. 2024. Trolls have flooded X with graphic Taylor Swift AI fakes.
  130. WebAIM. 2024. The WebAIM Million: An annual accessibility analysis of the top 1,000,000 home pages. https://webaim.org/projects/million/#alttext
  131. CCNet: Extracting high quality monolingual datasets from web crawl data.
  132. Kaylee Williams. 2023. Exploring Legal Approaches to Regulating Nonconsensual Deepfake Pornography. https://www.techpolicy.press/exploring-legal-approaches-to-regulating-nonconsensual-deepfake-pornography/
  133. Predictive inequity in object detection.
  134. Robert Wolfe and Aylin Caliskan. 2022. American == White in Multimodal Language-and-Image AI. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. ACM, Oxford, United Kingdom, 800–812. https://doi.org/10.1145/3514094.3534136
  135. Contrastive language-vision ai models pretrained on web-scraped multimodal data exhibit sexual objectification bias. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, Chicago, IL, USA, 1174–1185.
  136. Demystifying clip data.
  137. Fairness-Aware Instrumentation of Preprocessing~ Pipelines for Machine Learning.
  138. Jiping Zuo and Shengming Tang. 2000. Breadwinner status and gender ideologies of men and women regarding family roles. Sociological perspectives 43, 1 (2000), 29–43.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Rachel Hong (4 papers)
  2. William Agnew (19 papers)
  3. Tadayoshi Kohno (32 papers)
  4. Jamie Morgenstern (50 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com