Papers
Topics
Authors
Recent
2000 character limit reached

Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization (2402.14270v2)

Published 22 Feb 2024 in cs.LG

Abstract: In the rapidly advancing arena of LLMs, a key challenge is to enhance their capabilities amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets, with a specific focus on selective retention of samples that incur moderately high losses. These samples are deemed informative and beneficial for model refinement, contrasting with the highest-loss samples, which would be discarded due to their correlation with data noise and complexity. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization (IR-DRO). IR-DRO is designed to dynamically prioritize the training focus on informative samples through an instance reweighting mechanism, streamlined by a closed-form solution for straightforward integration into established training protocols. Through rigorous experimentation with various models and datasets, our findings indicate that our sample-targeted methods significantly improve LLM performance across multiple benchmarks, in both continual pre-training and instruction tuning scenarios. Our codes are available at https://github.com/VITA-Group/HardFocusTraining.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Nlu on data diets: Dynamic data subset selection for nlp classification tasks. arXiv preprint arXiv:2306.03208, 2023.
  3. Auditing and generating synthetic data with controllable trust trade-offs. arXiv preprint arXiv:2304.10819, 2023.
  4. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp.  41–48, 2009.
  5. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Skill-it! a data-driven skills framework for understanding and training language models. arXiv preprint arXiv:2307.14430, 2023.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Irreducible curriculum for language model pretraining. arXiv preprint arXiv:2310.15389, 2023.
  12. Active learning with optimal instance subset selection. IEEE Transactions on Cybernetics, 43(2):464–475, 2013.
  13. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  14. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  15. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
  16. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  17. Phi-2: The surprising power of small language models, 2023.
  18. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
  19. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pp.  2525–2534. PMLR, 2018.
  20. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp.  5464–5474. PMLR, 2021a.
  21. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8110–8118, 2021b.
  22. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023.
  23. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  24. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  27. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.  6950–6960. PMLR, 2020.
  28. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  29. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023.
  30. Robust optimization over multiple domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4739–4746, 2019.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  33. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  34. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3752–3761, 2018.
  35. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
  39. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
  40. Adversarial attack generation empowered by min-max optimization. Advances in Neural Information Processing Systems, 34:16020–16033, 2021.
  41. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161:419–449, 2017.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  43. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
  44. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  45. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.  19–27, 2015.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.