Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toxicity of the Commons: Curating Open-Source Pre-Training Data (2410.22587v2)

Published 29 Oct 2024 in cs.CL

Abstract: Open-source LLMs are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make LLMs safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Fairness and Robustness in Invariant Learning: A Case Study in Toxicity Classification. arXiv preprint arXiv:2011.06485.
  2. A Survey on Data Selection for Language Models. arXiv preprint arXiv:2402.16827.
  3. PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403.
  4. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? . In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  5. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  6. Algorithmic Injustices: Towards a Relational Ethics. arXiv preprint arXiv:1912.07376.
  7. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE.
  8. The Foundation Model Transparency Index v1. 1: May 2024. arXiv preprint arXiv:2407.12929.
  9. Measuring Intersectional Biases in Historical Documents. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 2711–2730, Toronto, Canada. Association for Computational Linguistics.
  10. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.
  11. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  12. European Parliament (2024). EU Artificial Intelligence Act. Accessed: 2024-07-24.
  13. Executive Office of the President (2023). Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Accessed: 2024-07-24.
  14. Current Topological and Machine Learning Applications for Bias Detection in Text. In 2023 6th International Conference on Signal Processing and Information Security (ICSPIS), pages 190–195. IEEE.
  15. CroissantLLM: A Truly Bilingual French-English Language Model. arXiv preprint arXiv:2402.00786.
  16. BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B. arXiv preprint arXiv:2311.00117.
  17. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027.
  18. Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI. Accessed: 2024-08-12.
  19. OLMo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838.
  20. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing.
  21. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 29217–29234. Curran Associates, Inc.
  22. An empirical analysis of compute-optimal large language model training. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc.
  23. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 31809–31826. Curran Associates, Inc.
  24. Rethinking open source generative AI: open washing and the EU AI Act. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1774–1787.
  25. LLM360: Towards Fully Transparent Open-Source LLMs. arXiv preprint arXiv:2312.06550.
  26. Consent in Crisis: The Rapid Decline of the AI Data Commons. arXiv preprint arXiv:2407.14933.
  27. A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. In Duh, K., Gomez, H., and Bethard, S., editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3245–3276, Mexico City, Mexico. Association for Computational Linguistics.
  28. Decoupled weight decay regularization.
  29. What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online. Association for Computational Linguistics.
  30. OpenELM: An Efficient Language Model Family with Open Training and Inference Framework. In Workshop on Efficient Systems for Foundation Models II@ ICML2024.
  31. Meta Llama Team (2024). The Llama 3 Herd of Models . Accessed: 2024-07-31.
  32. Mistral (2024). Large Enough. Accessed: 2024-08-12.
  33. Monea, A. (2023). I know it when I see it: the heteronormativity of Google’s SafeSearch. Porn Studies, 10(2):135–153.
  34. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv preprint arXiv:2406.17557.
  35. On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7595–7609, Singapore. Association for Computational Linguistics.
  36. Prigent, F. (2024). UT1 URL Blacklist. Accessed: 2024-07-31.
  37. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv preprint arXiv:2112.11446.
  38. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67.
  39. Robertson, K. (2024). 8 Daily Newspapers Sue OpenAI and Microsoft Over A.I. Accessed: 2024-08-12.
  40. BLP-2023 Task 1: Violence Inciting Text Detection (VITD). In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 365–375.
  41. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. working paper or preprint.
  42. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint arXiv:2402.00159.
  43. QueerBench: Quantifying Discrimination in Language Models Toward Queer Identities. arXiv preprint arXiv:2406.12399.
  44. Assessing the impact of OCR quality on downstream NLP tasks. In ICAART 2020-Proceedings of the 12th International Conference on Agents and Artificial Intelligence, volume 1, pages 484–496. SciTePress.
  45. Detecting Hate Speech on the World Wide Web. In Sood, S. O., Nagarajan, M., and Gamon, M., editors, Proceedings of the Second Workshop on Language in Social Media, pages 19–26, Montréal, Canada. Association for Computational Linguistics.
  46. Jailbroken: How Does LLM Safety Training Fail?
  47. A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems.
  48. ShieldGemma: Generative AI Content Moderation Based on Gemma.
  49. Mitigating Biases in Hate Speech Detection from A Causal Perspective. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6610–6625.
  50. Challenges in Automated Debiasing for Toxic Language Detection. In Merlo, P., Tiedemann, J., and Tsarfaty, R., editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3143–3155, Online. Association for Computational Linguistics.
Citations (1)

Summary

  • The paper introduces a custom dataset (ToxicCommons) and a toxicity filtering pipeline that reduces harmful biases in open-source pre-training data.
  • It employs the Celadon Classifier with weighted accuracy measures to detect nuanced toxic content, particularly in historical texts affected by OCR errors.
  • The study offers actionable guidelines for maintaining data utility while ensuring safety, laying the groundwork for ethical AI model development.

Expert Analysis of "Toxicity of the Commons: Curating Open-Source Pre-Training Data"

The paper "Toxicity of the Commons: Curating Open-Source Pre-Training Data" offers a methodological contribution to the emerging field of ethical data curation for open-source LLMs. As researchers strive to develop safer and more transparent AI systems, this paper addresses the often-overlooked dimension of pre-training data openness and safety. The authors focus on the unique challenges posed by using public domain texts, primarily historical documents subjected to Optical Character Recognition (OCR).

Contributions and Methods

The authors present a comprehensive pipeline for toxicity filtering in pre-training datasets. Their process comprises three primary components:

  1. Creation of a Custom Dataset (ToxicCommons): This dataset includes categories of textual biases across dimensions such as racial, gender, religious, and ability-based discrimination, as well as violence. The dataset is curated using human annotations to guide a LLM annotation process, balancing scalability and accuracy.
  2. Introduction of Celadon Classifier: This classifier is trained using the ToxicCommons dataset to detect toxic content efficiently across multiple dimensions. Notably, Celadon is designed to handle out-of-domain challenges presented by historical texts and instability caused by OCR errors.
  3. Synthetic Content Moderation Strategy: The approach differentiates content based on toxicity levels, recommending either preservation, content warning labeling, or synthetic re-writing for the most egregious texts. This nuanced method seeks to maintain data utility while mitigating harmful content exposure.

Numerical Results and Claims

The classifier's results are evaluated using metrics beyond traditional accuracy measures, emphasizing the importance of weighted accuracy to address the distribution disparities across toxicity levels. The authors report a weighted accuracy of 74% for violence detection, suggesting reliable performance in assigning toxicity levels akin to human annotations. Furthermore, Celadon demonstrates significant improvement over existing generic toxicity screens by aligning sensitivity towards the nuanced nature of historical and public domain texts.

Implications and Future Directions

This paper has significant implications for the development of open-data LLMs. By introducing an open-source, replicable methodology, the authors contribute to a foundation for other researchers aiming to balance openness and safety in AI model training. The acknowledgment of historical context enriches the discourse around AI-generated content, promoting a more inclusive and equitable approach to LLM design.

The paper lays groundwork for future exploration into domain-specific toxicity filtering, especially considerations to diversify the linguistic and cultural representation in datasets. The balanced approach encourages further research into dynamic data curation techniques and could serve as a prototypical model for legislative and ethical standardization in AI development.

In conclusion, "Toxicity of the Commons" advances the field by offering a pragmatic, scalable solution for managing the safety of open-source LLM pre-training data. The proposed pipeline, dataset, and classifier serve not only to improve immediate practices but also inspire continuous ethical discourse and technological refinement in AI safety. The paper is a crucial step towards more responsible and transparent AI systems aligned with societal values.