Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training (2407.07630v1)

Published 10 Jul 2024 in cs.CL

Abstract: This article presents a comprehensive review of the challenges associated with using massive web-mined corpora for the pre-training of LLMs. This review identifies key challenges in this domain, including challenges such as noise (irrelevant or misleading information), duplication of content, the presence of low-quality or incorrect information, biases, and the inclusion of sensitive or personal information in web-mined corpora. Addressing these issues is crucial for the development of accurate, reliable, and ethically responsible LLMs. Through an examination of current methodologies for data cleaning, pre-processing, bias detection and mitigation, we highlight the gaps in existing approaches and suggest directions for future research. Our discussion aims to catalyze advancements in developing more sophisticated and ethically responsible LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Michał Perełkiewicz (7 papers)
  2. Rafał Poświata (9 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com