CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models (2410.18505v2)
Abstract: We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality LLMs.
- Abhimanyu Dubey etc. The llama 3 herd of models, 2024.
- An Yang and Baosong Yang etc. Qwen2 technical report, 2024.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- Common Crawl. Common Crawl Corpus. https://commoncrawl.org, 2024.
- Qwen Team. Qwen2.5: A party of foundation models, September 2024.
- Nemotron-4 340b technical report, 2024.
- The fineweb datasets: Decanting the web for the finest text data at scale, 2024.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
- Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
- Skywork: A more open bilingual foundation model, 2023.
- Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models, 2023.
- A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97, page 21, USA, 1997. IEEE Computer Society.
- Chinesewebtext: Large-scale high-quality chinese web text extracted with effective evaluation model, 2023.
- Efficient memory management for large language model serving with pagedattention, 2023.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
- Starcoder: may the source be with you!, 2023.
- Lighteval: A lightweight framework for llm evaluation, 2023.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, 2023.
- Cmmlu: Measuring massive multitask language understanding in chinese, 2024.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, page 552–561. AAAI Press, 2012.
- Measuring massive multitask language understanding, 2021.
- Careful selection of knowledge to solve open book question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6120–6129, Florence, Italy, July 2019. Association for Computational Linguistics.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Socialiqa: Commonsense reasoning about social interactions, 2019.
- Datacomp-lm: In search of the next generation of training sets for language models, 2024.
- Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.
- Aquila2 technical report, 2024.
- Aquilamoe: Efficient training for moe models with scale-up and scale-out strategies, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.