WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset (2402.19282v6)
Abstract: This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for LLMs, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.
- Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023.
- The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Chatglm-6b fine-tuning for cultural and creative products advertising words. In 2023 International Conference on Culture-Oriented Science and Technology (CoST), pages 291–295. IEEE, 2023.
- Baichuan 2: Open large-scale language models. CoRR, abs/2309.10305, 2023.
- Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
- Common crawl - open repository of web crawl data, 2023.
- Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, 2019.
- Together Computer. Redpajama: an open dataset for training large language models, October 2023.
- Dolma: An open corpus of 3 trillion tokens for language model pretraining research. Allen Institute for AI, Tech. Rep, 2023.
- The bigscience ROOTS corpus: A 1.6tb composite multilingual dataset. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- Toxic comment classification challenge, 2017.
- Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
- Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Heng Ji, Jong C. Park, and Rui Xia, editors, Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL 2021 - System Demonstrations, Online, August 1-6, 2021, pages 122–131. Association for Computational Linguistics, 2021.
- Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 1286–1305. Association for Computational Linguistics, 2021.
- Andrei Z. Broder. On the resemblance and containment of documents. In Bruno Carpentieri, Alfredo De Santis, Ugo Vaccaro, and James A. Storer, editors, Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, Proceedings, pages 21–29. IEEE, 1997.
- ekzhu/datasketch: First stable release, feb 2017.
- D4: improving LLM pretraining via document de-duplication and diversification. CoRR, abs/2308.12284, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027, 2021.
- Tinystories: How small can language models be and still speak coherent english? CoRR, abs/2305.07759, 2023.