Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs (2505.02009v2)

Published 4 May 2025 in cs.CL and cs.LG

Abstract: LLMs have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for harmful content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. We share TTP, TTP-Eval, HAVOC and a sample of C4 inferenced on HarmFormer. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.

Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale Datasets for Responsible LLMs

This paper addresses a profound issue in the development and deployment of LLMs: the presence of harmful content within web-scale datasets used for pretraining. Datasets such as Common Crawl, C4, and FineWeb have become essential repositories, yet they are frequently laden with toxic material including hate speech, misinformation, and explicit narratives. The work presented by Mendu et al. undertakes the monumental task of analyzing and systematically categorizing such content, proposing a novel framework to mitigate its influence on LLM output.

The paper introduces a three-dimensional taxonomy viewpoint categorizing content into Safe, Topical, and Toxic categories across five harm classes: Hate Content & Violence, Ideological Harm, Sexual Harm, Illegal Activity Harm, and Self-Inflicted Harm. Unlike traditional binary classification systems, this taxonomy differentiates between harmful intent and neutral yet relevant discourse, facilitating nuanced content moderation necessary for ethical model pretraining. By creating a taxonomy that distinguishes between such relationships, safer and more responsible AI models can be trained.

At the core of this research is the implementation of a prompt dataset termed Topical and Toxic Prompt (TTP), coupled with a transformer-based model named HarmFormer to efficiently identify and filter harmful content. TTP serves as a comprehensive evaluation set, assisting in quantifying the accuracy of content filtering mechanisms employed. Moreover, HarmFormer is rigorously validated against existing moderation tools, demonstrating significant improvements especially in processing long-form texts—an area where conventional systems falter.

To enhance the evaluability of model outputs under adversarially toxic inputs, the paper introduces HAVOC, a multidimensional benchmark for open-ended toxicity assays in LLM responses. This benchmark evidences that a substantial proportion of outputs from state-of-the-art models can exhibit harmful intent despite contextually neutral inputs, underlining vulnerabilities in current toxicity management frameworks.

An important numerical conclusion from the paper reveals that toxic content forms 2.11% to 4.1% of Common Crawl's corpus, depending on the dataset—a nontrivial amount underscoring the urgent need for enhanced curation methodologies. Additionally, the HAVOC benchmark demonstrates that 26.7% of model generation outputs could potentially leak toxicity when provoked with harmful content, indicating gaps in current pretraining safety measures.

By filtering harmful content using the taxonomy-driven HarmFormer, models show an 18% reduction in toxic content generation without compromising performance on topical tasks. This not only emphasizes the effectiveness of taxonomy-driven data curation for achieving Responsible AI compliance but also provides a solid foundation for advancing LLM safety in future applications.

This paper envisages future developments where AI models will be trained on curated datasets emphasizing responsible content engagement. As they openness through sharing model signals from evaluations like HAVOC, practitioners are granted the ability to reproduce and extend these studies. Such work demonstrates Microsoft's commitment to advocating for ethical AI deployment, promoting further research in constructing safer, more responsible AI systems.

In conclusion, the work presented by Mendu et al. is a critical resource for researchers and developers in AI, offering advanced tools and insights to responsibly manage the risks associated with deploying LLMs in sensitive contexts.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sai Krishna Mendu (2 papers)
  2. Harish Yenala (1 paper)
  3. Aditi Gulati (1 paper)
  4. Shanu Kumar (14 papers)
  5. Parag Agrawal (11 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com