An Integrated Data Processing Framework for Pretraining Foundation Models
Abstract: The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data. In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources and develop dedicated data cleansing pipeline for each data repository. Lacking a unified data processing framework, this process is repetitive and cumbersome. To mitigate this issue, we propose a data processing framework that integrates a Processing Module which consists of a series of operators at different granularity levels, and an Analyzing Module which supports probing and evaluation of the refined data. The proposed framework is easy to use and highly flexible. In this demo paper, we first introduce how to use this framework with some example use cases and then demonstrate its effectiveness in improving the data quality with an automated evaluation with ChatGPT and an end-to-end evaluation in pretraining the GPT-2 model. The code and demonstration videos are accessible on GitHub.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
- One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie (Eds.). ISCA, 2635–2639. https://doi.org/10.21437/INTERSPEECH.2014-564
- Data-Juicer: A One-Stop Data Processing System for Large Language Models. CoRR abs/2309.02033 (2023). https://doi.org/10.48550/ARXIV.2309.02033 arXiv:2309.02033
- Benchmarking Large Language Models in Retrieval-Augmented Generation. CoRR abs/2309.01431 (2023). https://doi.org/10.48550/ARXIV.2309.01431 arXiv:2309.01431
- PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1–240:113. http://jmlr.org/papers/v24/22-1144.html
- Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 5547–5569. https://proceedings.mlr.press/v162/du22c.html
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs/2101.00027 (2021). arXiv:2101.00027 https://arxiv.org/abs/2101.00027
- Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92.
- Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus.
- Textbooks Are All You Need. CoRR abs/2306.11644 (2023). https://doi.org/10.48550/ARXIV.2306.11644 arXiv:2306.11644
- The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1511.02301
- Training Compute-Optimal Large Language Models. CoRR abs/2203.15556 (2022). https://doi.org/10.48550/ARXIV.2203.15556 arXiv:2203.15556
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604–613.
- Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). arXiv:2001.08361 https://arxiv.org/abs/2001.08361
- The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research (2023).
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.).
- Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 8424–8445. https://doi.org/10.18653/V1/2022.ACL-LONG.577
- Textbooks Are All You Need II: phi-1.5 technical report. CoRR abs/2309.05463 (2023). https://doi.org/10.48550/ARXIV.2309.05463 arXiv:2309.05463
- Pointer Sentinel Mixture Models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=Byj72udxe
- ChenghaoMou/text-dedup: Reference Snapshot. https://doi.org/10.5281/zenodo.8364980
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021, 12 (2021), 124003.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics. https://doi.org/10.18653/V1/P16-1144
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR abs/2211.05100 (2022). https://doi.org/10.48550/ARXIV.2211.05100 arXiv:2211.05100
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/ARXIV.2302.13971 arXiv:2302.13971
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
- A Survey on Large Language Model based Autonomous Agents. CoRR abs/2308.11432 (2023). https://doi.org/10.48550/ARXIV.2308.11432 arXiv:2308.11432
- Data Management For Large Language Models: A Survey. CoRR abs/2312.01700 (2023). https://doi.org/10.48550/ARXIV.2312.01700 arXiv:2312.01700
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 4003–4012. https://aclanthology.org/2020.lrec-1.494/
- A Survey on Large Language Models for Recommendation. CoRR abs/2305.19860 (2023). https://doi.org/10.48550/ARXIV.2305.19860 arXiv:2305.19860
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. CoRR abs/2305.10429 (2023). https://doi.org/10.48550/ARXIV.2305.10429 arXiv:2305.10429
- WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open 2 (2021), 65–68. https://doi.org/10.1016/J.AIOPEN.2021.06.001
- GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/ARXIV.2303.18223 arXiv:2303.18223
- Oasis: Data Curation and Assessment System for Pretraining of Large Language Models. CoRR abs/2311.12537 (2023). https://doi.org/10.48550/ARXIV.2311.12537 arXiv:2311.12537
- Large Language Models for Information Retrieval: A Survey. CoRR abs/2308.07107 (2023). https://doi.org/10.48550/ARXIV.2308.07107 arXiv:2308.07107
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.