Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 421 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models (2410.18505v2)

Published 24 Oct 2024 in cs.CL

Abstract: We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Abhimanyu Dubey etc. The llama 3 herd of models, 2024.
  2. An Yang and Baosong Yang etc. Qwen2 technical report, 2024.
  3. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  4. Common Crawl. Common Crawl Corpus. https://commoncrawl.org, 2024.
  5. Qwen Team. Qwen2.5: A party of foundation models, September 2024.
  6. Nemotron-4 340b technical report, 2024.
  7. The fineweb datasets: Decanting the web for the finest text data at scale, 2024.
  8. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
  9. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
  10. Skywork: A more open bilingual foundation model, 2023.
  11. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models, 2023.
  12. A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97, page 21, USA, 1997. IEEE Computer Society.
  13. Chinesewebtext: Large-scale high-quality chinese web text extracted with effective evaluation model, 2023.
  14. Efficient memory management for large language model serving with pagedattention, 2023.
  15. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
  16. Starcoder: may the source be with you!, 2023.
  17. Lighteval: A lightweight framework for llm evaluation, 2023.
  18. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, 2023.
  19. Cmmlu: Measuring massive multitask language understanding in chinese, 2024.
  20. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  21. Hellaswag: Can a machine really finish your sentence?, 2019.
  22. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, page 552–561. AAAI Press, 2012.
  23. Measuring massive multitask language understanding, 2021.
  24. Careful selection of knowledge to solve open book question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6120–6129, Florence, Italy, July 2019. Association for Computational Linguistics.
  25. Piqa: Reasoning about physical commonsense in natural language, 2019.
  26. Socialiqa: Commonsense reasoning about social interactions, 2019.
  27. Datacomp-lm: In search of the next generation of training sets for language models, 2024.
  28. Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.
  29. Aquila2 technical report, 2024.
  30. Aquilamoe: Efficient training for moe models with scale-up and scale-out strategies, 2024.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: