Toxicity of the Commons: Curating Open-Source Pre-Training Data (2410.22587v2)
Abstract: Open-source LLMs are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make LLMs safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.
- Fairness and Robustness in Invariant Learning: A Case Study in Toxicity Classification. arXiv preprint arXiv:2011.06485.
- A Survey on Data Selection for Language Models. arXiv preprint arXiv:2402.16827.
- PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? . In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Algorithmic Injustices: Towards a Relational Ethics. arXiv preprint arXiv:1912.07376.
- Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE.
- The Foundation Model Transparency Index v1. 1: May 2024. arXiv preprint arXiv:2407.12929.
- Measuring Intersectional Biases in Historical Documents. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 2711–2730, Toronto, Canada. Association for Computational Linguistics.
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.
- Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- European Parliament (2024). EU Artificial Intelligence Act. Accessed: 2024-07-24.
- Executive Office of the President (2023). Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Accessed: 2024-07-24.
- Current Topological and Machine Learning Applications for Bias Detection in Text. In 2023 6th International Conference on Signal Processing and Information Security (ICSPIS), pages 190–195. IEEE.
- CroissantLLM: A Truly Bilingual French-English Language Model. arXiv preprint arXiv:2402.00786.
- BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B. arXiv preprint arXiv:2311.00117.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027.
- Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI. Accessed: 2024-08-12.
- OLMo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838.
- DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing.
- Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 29217–29234. Curran Associates, Inc.
- An empirical analysis of compute-optimal large language model training. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc.
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 31809–31826. Curran Associates, Inc.
- Rethinking open source generative AI: open washing and the EU AI Act. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1774–1787.
- LLM360: Towards Fully Transparent Open-Source LLMs. arXiv preprint arXiv:2312.06550.
- Consent in Crisis: The Rapid Decline of the AI Data Commons. arXiv preprint arXiv:2407.14933.
- A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. In Duh, K., Gomez, H., and Bethard, S., editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3245–3276, Mexico City, Mexico. Association for Computational Linguistics.
- Decoupled weight decay regularization.
- What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online. Association for Computational Linguistics.
- OpenELM: An Efficient Language Model Family with Open Training and Inference Framework. In Workshop on Efficient Systems for Foundation Models II@ ICML2024.
- Meta Llama Team (2024). The Llama 3 Herd of Models . Accessed: 2024-07-31.
- Mistral (2024). Large Enough. Accessed: 2024-08-12.
- Monea, A. (2023). I know it when I see it: the heteronormativity of Google’s SafeSearch. Porn Studies, 10(2):135–153.
- The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv preprint arXiv:2406.17557.
- On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7595–7609, Singapore. Association for Computational Linguistics.
- Prigent, F. (2024). UT1 URL Blacklist. Accessed: 2024-07-31.
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv preprint arXiv:2112.11446.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67.
- Robertson, K. (2024). 8 Daily Newspapers Sue OpenAI and Microsoft Over A.I. Accessed: 2024-08-12.
- BLP-2023 Task 1: Violence Inciting Text Detection (VITD). In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 365–375.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. working paper or preprint.
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint arXiv:2402.00159.
- QueerBench: Quantifying Discrimination in Language Models Toward Queer Identities. arXiv preprint arXiv:2406.12399.
- Assessing the impact of OCR quality on downstream NLP tasks. In ICAART 2020-Proceedings of the 12th International Conference on Agents and Artificial Intelligence, volume 1, pages 484–496. SciTePress.
- Detecting Hate Speech on the World Wide Web. In Sood, S. O., Nagarajan, M., and Gamon, M., editors, Proceedings of the Second Workshop on Language in Social Media, pages 19–26, Montréal, Canada. Association for Computational Linguistics.
- Jailbroken: How Does LLM Safety Training Fail?
- A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems.
- ShieldGemma: Generative AI Content Moderation Based on Gemma.
- Mitigating Biases in Hate Speech Detection from A Causal Perspective. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6610–6625.
- Challenges in Automated Debiasing for Toxic Language Detection. In Merlo, P., Tiedemann, J., and Tsarfaty, R., editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3143–3155, Online. Association for Computational Linguistics.