Papers
Topics
Authors
Recent
Search
2000 character limit reached

An Integrated Data Processing Framework for Pretraining Foundation Models

Published 26 Feb 2024 in cs.LG, cs.CL, and cs.IR | (2402.16358v2)

Abstract: The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data. In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources and develop dedicated data cleansing pipeline for each data repository. Lacking a unified data processing framework, this process is repetitive and cumbersome. To mitigate this issue, we propose a data processing framework that integrates a Processing Module which consists of a series of operators at different granularity levels, and an Analyzing Module which supports probing and evaluation of the refined data. The proposed framework is easy to use and highly flexible. In this demo paper, we first introduce how to use this framework with some example use cases and then demonstrate its effectiveness in improving the data quality with an automated evaluation with ChatGPT and an end-to-end evaluation in pretraining the GPT-2 model. The code and demonstration videos are accessible on GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  2. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie (Eds.). ISCA, 2635–2639. https://doi.org/10.21437/INTERSPEECH.2014-564
  3. Data-Juicer: A One-Stop Data Processing System for Large Language Models. CoRR abs/2309.02033 (2023). https://doi.org/10.48550/ARXIV.2309.02033 arXiv:2309.02033
  4. Benchmarking Large Language Models in Retrieval-Augmented Generation. CoRR abs/2309.01431 (2023). https://doi.org/10.48550/ARXIV.2309.01431 arXiv:2309.01431
  5. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1–240:113. http://jmlr.org/papers/v24/22-1144.html
  6. Together Computer. 2023. RedPajama: an Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data
  7. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 5547–5569. https://proceedings.mlr.press/v162/du22c.html
  8. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs/2101.00027 (2021). arXiv:2101.00027 https://arxiv.org/abs/2101.00027
  9. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92.
  10. Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  11. Textbooks Are All You Need. CoRR abs/2306.11644 (2023). https://doi.org/10.48550/ARXIV.2306.11644 arXiv:2306.11644
  12. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1511.02301
  13. Training Compute-Optimal Large Language Models. CoRR abs/2203.15556 (2022). https://doi.org/10.48550/ARXIV.2203.15556 arXiv:2203.15556
  14. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604–613.
  15. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). arXiv:2001.08361 https://arxiv.org/abs/2001.08361
  16. The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research (2023).
  17. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.).
  18. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 8424–8445. https://doi.org/10.18653/V1/2022.ACL-LONG.577
  19. Textbooks Are All You Need II: phi-1.5 technical report. CoRR abs/2309.05463 (2023). https://doi.org/10.48550/ARXIV.2309.05463 arXiv:2309.05463
  20. Pointer Sentinel Mixture Models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=Byj72udxe
  21. ChenghaoMou/text-dedup: Reference Snapshot. https://doi.org/10.5281/zenodo.8364980
  22. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021, 12 (2021), 124003.
  23. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics. https://doi.org/10.18653/V1/P16-1144
  24. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  25. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.
  26. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR abs/2211.05100 (2022). https://doi.org/10.48550/ARXIV.2211.05100 arXiv:2211.05100
  27. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/ARXIV.2302.13971 arXiv:2302.13971
  28. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
  29. A Survey on Large Language Model based Autonomous Agents. CoRR abs/2308.11432 (2023). https://doi.org/10.48550/ARXIV.2308.11432 arXiv:2308.11432
  30. Data Management For Large Language Models: A Survey. CoRR abs/2312.01700 (2023). https://doi.org/10.48550/ARXIV.2312.01700 arXiv:2312.01700
  31. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, 4003–4012. https://aclanthology.org/2020.lrec-1.494/
  32. A Survey on Large Language Models for Recommendation. CoRR abs/2305.19860 (2023). https://doi.org/10.48550/ARXIV.2305.19860 arXiv:2305.19860
  33. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. CoRR abs/2305.10429 (2023). https://doi.org/10.48550/ARXIV.2305.10429 arXiv:2305.10429
  34. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open 2 (2021), 65–68. https://doi.org/10.1016/J.AIOPEN.2021.06.001
  35. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  36. A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/ARXIV.2303.18223 arXiv:2303.18223
  37. Oasis: Data Curation and Assessment System for Pretraining of Large Language Models. CoRR abs/2311.12537 (2023). https://doi.org/10.48550/ARXIV.2311.12537 arXiv:2311.12537
  38. Large Language Models for Information Retrieval: A Survey. CoRR abs/2308.07107 (2023). https://doi.org/10.48550/ARXIV.2308.07107 arXiv:2308.07107
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.