Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rho-1: Not All Tokens Are What You Need (2404.07965v3)

Published 11 Apr 2024 in cs.CL and cs.AI

Abstract: Previous LLM pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that ''9l training''. Our initial analysis examines token-level training dynamics of LLM, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new LLM called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective LLMing (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the LLM with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when continual pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the LLM pre-training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. OpenAI. Gpt-4 technical report, 2023.
  4. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  5. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
  6. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469, 2021.
  7. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  8. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, 2021.
  9. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023.
  10. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022.
  11. Should you mask 15% in masked language modeling? In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2985–3000, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.217. URL https://aclanthology.org/2023.eacl-main.217.
  12. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning, 110(3):457–506, 2021.
  13. Metamath: Bootstrap your own mathematical questions for large language models. In ICLR, 2024.
  14. Key-point-driven data synthesis with its enhancement on mathematical reasoning. arXiv preprint arXiv:2403.02333, 2024.
  15. Mammoth: Building math generalist models through hybrid instruction tuning. In ICLR, 2024.
  16. Exploring the mystery of influential data for mathematical reasoning, 2024.
  17. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  18. Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
  19. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  20. Openwebmath: An open dataset of high-quality mathematical web text, 2023.
  21. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  22. Starcoder: may the source be with you! CoRR, abs/2305.06161, 2023a.
  23. Tinyllama: An open-source small language model, 2024.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  25. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  26. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  27. Textbooks are all you need ii: phi-1.5 technical report, 2023b.
  28. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. URL https://github.com/deepseek-ai/DeepSeek-LLM.
  29. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024a. URL https://arxiv.org/abs/2402.03300.
  30. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  31. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  32. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  33. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.
  34. Tora: A tool-integrated reasoning agent for mathematical problem solving. In ICLR, 2024.
  35. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  36. Chain-of-thought prompting elicits reasoning in large language models. In NIPS, volume 35, pages 24824–24837, 2022a.
  37. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024b.
  38. Language models scale reliably with over-training and on downstream tasks. Preprint, 2024.
  39. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  40. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
  41. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
  42. Contrastive decoding: Open-ended text generation as optimization. In ACL (1), pages 12286–12312. Association for Computational Linguistics, 2023c.
  43. Knowledge fusion of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jiDsk12qcz.
  44. Specializing smaller language models towards multi-step reasoning. In International Conference on Machine Learning, pages 10421–10430. PMLR, 2023.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  46. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
  47. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  48. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
  49. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR, 2022.
  50. D4: Improving llm pretraining via document de-duplication and diversification. In NIPS, volume 36, 2023.
  51. A survey on data selection for language models, 2024.
  52. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
  53. Skill-it! a data-driven skills framework for understanding and training language models. Advances in Neural Information Processing Systems, 36, 2024.
  54. At which training stage does code data help LLMs reasoning? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KIPJKST4gw.
  55. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032, 2023d.
  56. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. In ICLR, 2024.
  57. One shot learning as instruction data prospector for large language models. arXiv preprint arXiv:2312.10302, 2023e.
  58. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024.
  59. Together Computer. Redpajama: an open dataset for training large language models, 10 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  60. Understanding learning dynamics of language models with svcca. arXiv preprint arXiv:1811.00225, 2018.
  61. The grammar-learning trajectories of neural language models. arXiv preprint arXiv:2109.06096, 2021.
  62. Probing across time: What does roberta know and when? arXiv preprint arXiv:2104.07885, 2021.
  63. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  64. Training trajectories of language models across scales. arXiv preprint arXiv:2212.09803, 2022.
  65. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  66. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  67. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022b.
  68. Scaling laws for downstream task performance of large language models. arXiv preprint arXiv:2402.04177, 2024.
  69. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290, 2022.
  70. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  71. Superposition, memorization, and double descent. Transformer Circuits Thread, 2023.
  72. Emergent and predictable memorization in large language models. Advances in Neural Information Processing Systems, 36, 2024.
  73. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
  74. To repeat or not to repeat: Insights from scaling llm under token-crisis. Advances in Neural Information Processing Systems, 36, 2024.
  75. Problems of monetary management: the UK experience. Springer, 1984.
  76. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  77. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
  78. Measuring mathematical problem solving with the math dataset. In NIPS, 2021.
  79. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  80. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  81. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.92. URL https://aclanthology.org/2020.acl-main.92.
  82. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
  83. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DHyHRBwJUTN.
  84. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  85. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  86. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  87. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  88. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  89. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  90. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  91. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  92. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  93. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  94. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  95. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684, 2023.
  96. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  97. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  98. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Zhenghao Lin (14 papers)
  2. Zhibin Gou (15 papers)
  3. Yeyun Gong (78 papers)
  4. Xiao Liu (402 papers)
  5. Yelong Shen (83 papers)
  6. Ruochen Xu (35 papers)
  7. Chen Lin (75 papers)
  8. Yujiu Yang (155 papers)
  9. Jian Jiao (44 papers)
  10. Nan Duan (172 papers)
  11. Weizhu Chen (128 papers)
Citations (39)

Summary

  • The paper introduces Selective Language Modeling (SLM), emphasizing that prioritizing high-loss tokens enhances training efficiency.
  • Empirical results show up to a 30% absolute few-shot accuracy gain on math tasks, with state-of-the-art performance on the MATH dataset.
  • The study underscores resource-efficient training and a nuanced understanding of token-level dynamics, paving the way for adaptive training strategies.

Introducing Rho-1: Advancing Efficiency in LLM Training with Selective LLMing

Overview of Selective LLMing (SLM)

The innovation presented in this research pivots on the introduction of Selective LLMing (SLM), a methodology that deviates from the traditional approach of training LLMs by treating every token within the training corpus with equal importance. This paper propounds a discerning stance, advocating that not all tokens contribute equally towards the efficacious training of LLMs. By deploying a reference model to evaluate and select tokens based on their utility and alignment with the desired distribution, SLM focuses training efforts on tokens exhibiting higher excess loss. This approach marks a strategic shift towards optimizing training processes, underscoring efficiency and targeted learning.

Empirical Validation and Results

Empirical results substantiate the efficacy of SLM, showcasing significant improvements across various tasks. When applied to the mathematical domain through continuous pretraining on the 15B OpenWebMath corpus, the Rho-1 models demonstrated an absolute few-shot accuracy improvement of up to 30% on nine math tasks. Furthermore, after fine-tuning, the Rho-1 models (1B and 7B versions) surpassed state-of-the-art results on the MATH dataset, achieving accuracies of 40.6% and 51.8% respectively, with notably lesser pretraining tokens compared to DeepSeekMath. In general domain pretraining on 80 billion tokens, Rho-1 underscored its utility by delivering a 6.8% average improvement across fifteen diverse tasks. These numerical results reinforce the premise that SLM not only enhances model performance but also operationalizes a more resource-efficient training procedure.

Implications and Theoretical Contributions

The SLM's methodological contributions extend beyond empirical success, presenting several theoretical and practical implications:

  • Efficiency in Training: By pinpointing and prioritizing tokens that are pivotal for model learning, SLM conserves computational resources and accelerates the training cycle.
  • Token Dynamics Understanding: The differentiation between "easy" and "hard" tokens introduces a nuanced understanding of token-level learning dynamics, providing insights into how models interact with diverse data subsets during training.
  • Strategic Data Utilization: SLM embodies a strategic approach to data utilization, ensuring that training efforts are concentrated on data segments that promise the greatest returns in model performance.

Future Directions

The promising results of SLM open avenues for further exploration and refinement. Future work could delve into the optimization of token selection criteria, exploring dynamic or adaptive mechanisms that evolve with the model's learning trajectory. Moreover, the application of SLM across broader domains and model architectures presents an interesting frontier, potentially unveiling domain-specific insights and customization strategies for model training.

Conclusion

The introduction of Selective LLMing (SLM) heralds a thoughtful reconsideration of how training resources are allocated in the development of LLMs. By privileging the quality of tokens over quantity, SLM achieves remarkable efficiency and efficacy, casting a new light on the path towards optimizing LLM training. This research enriches the tapestry of generative AI and machine learning with a method that succinctly aligns training focus with the most beneficial data points, marking a step forward in the journey towards more intelligent and resource-aware computational models.

Youtube Logo Streamline Icon: https://streamlinehq.com