Toxicity of the Commons: Curating Open-Source Pre-Training Data (2410.22587v2)

Published 29 Oct 2024 in cs.CL

Abstract: Open-source LLMs are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make LLMs safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

References (50)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a custom dataset (ToxicCommons) and a toxicity filtering pipeline that reduces harmful biases in open-source pre-training data.
It employs the Celadon Classifier with weighted accuracy measures to detect nuanced toxic content, particularly in historical texts affected by OCR errors.
The study offers actionable guidelines for maintaining data utility while ensuring safety, laying the groundwork for ethical AI model development.

Expert Analysis of "Toxicity of the Commons: Curating Open-Source Pre-Training Data"

The paper "Toxicity of the Commons: Curating Open-Source Pre-Training Data" offers a methodological contribution to the emerging field of ethical data curation for open-source LLMs. As researchers strive to develop safer and more transparent AI systems, this paper addresses the often-overlooked dimension of pre-training data openness and safety. The authors focus on the unique challenges posed by using public domain texts, primarily historical documents subjected to Optical Character Recognition (OCR).

Contributions and Methods

The authors present a comprehensive pipeline for toxicity filtering in pre-training datasets. Their process comprises three primary components:

Creation of a Custom Dataset (ToxicCommons): This dataset includes categories of textual biases across dimensions such as racial, gender, religious, and ability-based discrimination, as well as violence. The dataset is curated using human annotations to guide a LLM annotation process, balancing scalability and accuracy.
Introduction of Celadon Classifier: This classifier is trained using the ToxicCommons dataset to detect toxic content efficiently across multiple dimensions. Notably, Celadon is designed to handle out-of-domain challenges presented by historical texts and instability caused by OCR errors.
Synthetic Content Moderation Strategy: The approach differentiates content based on toxicity levels, recommending either preservation, content warning labeling, or synthetic re-writing for the most egregious texts. This nuanced method seeks to maintain data utility while mitigating harmful content exposure.

Numerical Results and Claims

The classifier's results are evaluated using metrics beyond traditional accuracy measures, emphasizing the importance of weighted accuracy to address the distribution disparities across toxicity levels. The authors report a weighted accuracy of 74% for violence detection, suggesting reliable performance in assigning toxicity levels akin to human annotations. Furthermore, Celadon demonstrates significant improvement over existing generic toxicity screens by aligning sensitivity towards the nuanced nature of historical and public domain texts.

Implications and Future Directions

This paper has significant implications for the development of open-data LLMs. By introducing an open-source, replicable methodology, the authors contribute to a foundation for other researchers aiming to balance openness and safety in AI model training. The acknowledgment of historical context enriches the discourse around AI-generated content, promoting a more inclusive and equitable approach to LLM design.

The paper lays groundwork for future exploration into domain-specific toxicity filtering, especially considerations to diversify the linguistic and cultural representation in datasets. The balanced approach encourages further research into dynamic data curation techniques and could serve as a prototypical model for legislative and ethical standardization in AI development.

In conclusion, "Toxicity of the Commons" advances the field by offering a pragmatic, scalable solution for managing the safety of open-source LLM pre-training data. The proposed pipeline, dataset, and classifier serve not only to improve immediate practices but also inspire continuous ethical discourse and technological refinement in AI safety. The paper is a crucial step towards more responsible and transparent AI systems aligned with societal values.

PDF Markdown

Tweets

https://twitter.com/Dorialexander/status/1852014049157890267

https://twitter.com/linguist_cat/status/1852002133576454484

https://twitter.com/kr0niker/status/1852089686371766292

https://twitter.com/arXivGPT/status/1852482590965711178