Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models (2403.07384v2)

Published 12 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Despite the effectiveness of data selection for LLMs during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.

Scalable Data Selection for Efficient Fine-tuning of LLMs

The paper "S2L: Scalable Data Selection for Fine-tuning LLMs by Summarizing Training Trajectories of Small Models" presents an innovative approach to optimizing data selection for supervised fine-tuning (SFT) of LLMs in specialized domains. The authors highlight the prevalent challenges associated with data efficiency in these models, especially when transitioning from generalist capabilities to domain-specific expertise. The introduction of {S2L}—a method that leverages training trajectories from smaller models—addresses these challenges effectively.

Problem Statement and Methodology

The paper begins with the identification of a significant gap in data efficiency during SFT, particularly for specialized domains where data distributions differ markedly from the pretraining distributions. Existing data selection methods often fall short in these scenarios due to their reliance on generalist models or less specialized selection criteria.

{S2L} distinguishes itself through a scalable data selection process that capitalizes on identifying and clustering training trajectory patterns from smaller, proxy models. This approach is designed on the foundation that training dynamics tend to be consistent across models of different scales, as evidenced by Xia et al. (2023). By summarizing these dynamics, {S2L} efficiently selects a subset of data that ensures comprehensive topic and pattern coverage.

Results and Analysis

The experimental results are robust and compelling. Utilizing the MathInstruct dataset, the authors demonstrate that {S2L} can achieve performance parity using only 11% of the full dataset. Notably, {S2L} outperforms state-of-the-art data selection methods by an average margin of 4.7% across multiple tasks, including both in-domain and out-of-domain datasets. For example, on the challenging MATH benchmark, {S2L} achieves a remarkable 32.7% accuracy, significantly enhancing performance over existing models.

In terms of scalability, {S2L}'s ability to perform data selection with models up to 40 times smaller than the target model is noteworthy, yielding substantial reductions in computational expenses. This scalability is empirically validated by successfully transferring data subsets to larger models like Phi-2 (2.7B), demonstrating {S2L}'s cross-model applicability.

Implications and Future Directions

From a practical perspective, {S2L} presents a cost-effective solution for practitioners aiming to fine-tune LLMs for specialized applications, such as in mathematical reasoning and clinical text summarization. The success of {S2L} in these domains suggests its potential utility across other specialized fields, paving the way for more efficient use of training resources and energy.

Theoretically, the method opens avenues for further exploration into the uniformity of training dynamics across models and tasks, inviting research into the underlying mechanisms that ensure this consistency. Additionally, potential enhancements could explore automated adjustments to trajectory clustering parameters or adaptive sampling strategies that respond dynamically to the complexity of fine-tuning tasks.

Conclusion

The introduction of {S2L} provides a significant step toward optimizing data efficiency during the fine-tuning phase of LLM development. By capitalizing on the training trajectories of smaller models, this strategy not only reduces the data volume required for high performance but also achieves scalability across different model sizes. This makes {S2L} an invaluable tool for pushing the boundaries of LLM capabilities in specialized domains without incurring excessive computational costs. As AI continues to evolve, such methods will be integral to maintaining sustainable and efficient model training practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023.
  2. Gpt-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  3. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  4. An experimental design framework for label-efficient supervised finetuning of large language models. arXiv preprint arXiv:2401.06692, 2024.
  5. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023a.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023b.
  7. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  8. Alpagasus: Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=FdVXgSJhvz.
  9. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
  10. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJg2b0VYDr.
  13. Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021.
  14. Overview of the RadSum23 shared task on multi-modal and multi-anatomical radiology report summarization. In Demner-fushman, D., Ananiadou, S., and Cohen, K. (eds.), The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pp.  478–482, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.bionlp-1.45. URL https://aclanthology.org/2023.bionlp-1.45.
  15. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, Toronto, Canada, July 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.bionlp-1.0.
  16. Robust learning with progressive data expansion against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=9QEVJ9qm46.
  17. The faiss library. 2024.
  18. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  19. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.
  20. Exploring the benefits of training expert language models over instruction tuning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  14702–14729. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/jang23a.html.
  21. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  22. Data-efficient contrastive self-supervised learning: Most beneficial examples for supervised learning contribute the least. In International conference on machine learning, pp.  15356–15370. PMLR, 2023.
  23. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp.  5464–5474. PMLR, 2021a.
  24. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8110–8118, 2021b.
  25. MAWPS: A math word problem repository. In Knight, K., Nenkova, A., and Rambow, O. (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
  26. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  27. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023a.
  28. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
  29. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  30. Tinygsm: achieving¿ 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023.
  31. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
  32. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023b.
  33. Sieve: Multimodal dataset pruning using image captioning models. arXiv preprint arXiv:2310.02110, 2023.
  34. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  35. Coresets for data-efficient training of machine learning models. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  6950–6960. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/mirzasoleiman20a.html.
  36. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3505–3523, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.246. URL https://aclanthology.org/2022.acl-long.246.
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  38. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  39. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  40. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  41. Adaptive second order coresets for data-efficient machine learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  17848–17869. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/pooladzandi22a.html.
  42. Nessa: Near-storage data selection for accelerated machine learning training. In Proceedings of the 15th ACM Workshop on Hot Topics in Storage and File Systems, HotStorage ’23, pp.  8–15, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702242. doi: 10.1145/3599691.3603404. URL https://doi.org/10.1145/3599691.3603404.
  43. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  44. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
  45. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023b.
  46. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  47. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9275–9293, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.746. URL https://aclanthology.org/2020.emnlp-main.746.
  48. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  49. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  50. D4: Improving LLM pretraining via document de-duplication and diversification. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=CG0L2PFrb1.
  51. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJlxm30cKm.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  53. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334, 2023.
  54. Clinical text summarization: adapting large language models can outperform human experts. arXiv preprint arXiv:2309.07430, 2023.
  55. Let the model decide its curriculum for multitask learning. In Cherry, C., Fan, A., Foster, G., Haffari, G. R., Khadivi, S., Peng, N. V., Ren, X., Shareghi, E., and Swayamdipta, S. (eds.), Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pp.  117–125, Hybrid, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deeplo-1.13. URL https://aclanthology.org/2022.deeplo-1.13.
  56. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
  57. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  58. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  59. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023a.
  60. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182, 2023b.
  61. Training trajectories of language models across scales. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13711–13738, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.767. URL https://aclanthology.org/2023.acl-long.767.
  62. Not all poisons are created equal: Robust training against data poisoning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  25154–25165. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/yang22j.html.
  63. Identifying spurious biases early in training through the lens of simplicity bias. arXiv preprint arXiv:2305.18761, 2023a.
  64. Towards sustainable learning: Coresets for data-efficient deep learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  39314–39330. PMLR, 23–29 Jul 2023b.
  65. Decoding data quality via synthetic corruptions: Embedding-guided pruning of code data. arXiv preprint arXiv:2312.02418, 2023c.
  66. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  67. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  68. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  69. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  70. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=KBMOKmX2he.
  71. Lobass: Gauging learnability in supervised fine-tuning data. arXiv preprint arXiv:2310.13008, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yu Yang (213 papers)
  2. Siddhartha Mishra (76 papers)
  3. Baharan Mirzasoleiman (51 papers)
  4. Jeffrey N Chiang (1 paper)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com