Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

SMART: Submodular Data Mixture Strategy for Instruction Tuning (2403.08370v3)

Published 13 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Instruction Tuning involves finetuning a LLM on a collection of instruction-formatted datasets in order to enhance the generalizability of the model to unseen tasks. Studies have shown the importance of balancing different task proportions during finetuning, but finding the right balance remains challenging. Unfortunately, there's currently no systematic method beyond manual tuning or relying on practitioners' intuition. In this paper, we introduce SMART (Submodular data Mixture strAtegy for instRuction Tuning) - a novel data mixture strategy which makes use of a submodular function to assign importance scores to tasks which are then used to determine the mixture weights. Given a fine-tuning budget, SMART redistributes the budget among tasks and selects non-redundant samples from each task. Experimental results demonstrate that SMART significantly outperforms traditional methods such as examples proportional mixing and equal mixing. Furthermore, SMART facilitates the creation of data mixtures based on a few representative subsets of tasks alone and through task pruning analysis, we reveal that in a limited budget setting, allocating budget among a subset of representative tasks yields superior performance compared to distributing the budget among all tasks. The code for reproducing our results is open-sourced at https://github.com/kowndinya-renduchintala/SMART.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439.
  2. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  3. Jean-Michel Attendu and Jean-Philippe Corbeil. 2023. Nlu on data diets: Dynamic data subset selection for nlp classification tasks. arXiv preprint arXiv:2306.03208.
  4. Jeff Bilmes. 2022. Submodularity in machine learning and artificial intelligence. arXiv preprint arXiv:2202.00132.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290.
  7. Michael Carter. 2001. Foundations of mathematical economics. MIT press.
  8. Rich Caruana. 1997. Multitask learning. Machine learning, 28:41–75.
  9. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246.
  10. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
  11. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  13. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
  14. Alexandre De Brebisson and Pascal Vincent. 2015. An exploration of softmax alternatives belonging to the spherical loss family. arXiv preprint arXiv:1511.05042.
  15. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  16. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  17. Jack Edmonds. 1970. Matroids, submodular functions and certain polyhedra. Combinatorial Structures and Their Applications, pages 69–87.
  18. Bert on a data diet: Finding important examples by gradient-based pruning. arXiv preprint arXiv:2211.05610.
  19. Uriel Feige. 1998. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652.
  20. An analysis of approximations for maximizing submodular set functions—II. Springer.
  21. Satoru Fujishige. 2005. Submodular functions and optimization. Elsevier.
  22. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  23. Data-efficient finetuning using cross-task nearest neighbors. arXiv preprint arXiv:2212.00196.
  24. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  25. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
  26. Mistral 7b. arXiv preprint arXiv:2310.06825.
  27. Orient: Submodular mutual information measures for data subset selection under distribution shift. Advances in neural information processing systems, 35:31796–31808.
  28. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE.
  29. Submodlib: A submodular optimization library. arXiv preprint arXiv:2202.10680.
  30. Automata: Gradient based data subset selection for compute-efficient hyper-parameter tuning. Advances in Neural Information Processing Systems, 35:28721–28733.
  31. Katrin Kirchhoff and Jeff Bilmes. 2014. Submodularity for data selection in machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 131–141.
  32. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  33. Similar: Submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems, 34:18685–18697.
  34. Ditto: Data-efficient and fair targeted subset selection for asr accent adaptation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5810–5822.
  35. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  36. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
  37. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  38. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  39. Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
  40. László Lovász. 1983. Submodular functions and convexity. Mathematical Programming The State of the Art: Bonn 1982, pages 235–257.
  41. Michel Minoux. 2005. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques: Proceedings of the 8th IFIP Conference on Optimization Techniques Würzburg, September 5–9, 1977, pages 234–243. Springer.
  42. Partitioned gradient matching-based data subset selection for compute-efficient robust asr training. arXiv preprint arXiv:2210.16892.
  43. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
  44. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14:265–294.
  45. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  47. INGENIOUS: using informative data subsets for efficient pre-training of language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 6690–6705. Association for Computational Linguistics.
  48. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  49. Dynamics of instruction tuning: Each ability of large language models has its own growth pace. arXiv preprint arXiv:2310.19651.
  50. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  51. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  52. Stanford alpaca: An instruction-following llama model.
  53. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 99–108.
  54. Donald M Topkis. 1998. Supermodularity and complementarity. Princeton university press.
  55. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  56. Alan M. Turing. 1950. Computing Machinery and Intelligence. Mind, 59(October):433–60. Publisher: Oxford University Press.
  57. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904.
  58. Exploring and predicting transferability across nlp tasks. arXiv preprint arXiv:2005.00770.
  59. Explore-instruct: Enhancing domain-specific instruction coverage through active exploration. arXiv preprint arXiv:2310.09168.
  60. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  61. Harnessing the power of david against goliath: Exploring instruction data generation without using closed-source models. arXiv preprint arXiv:2308.12711.
  62. Data management for large language models: A survey. arXiv preprint arXiv:2312.01700.
  63. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  64. Submodularity in data subset selection and active learning. In International conference on machine learning, pages 1954–1963. PMLR.
  65. Unsupervised submodular subset selection for speech data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4107–4111. IEEE.
  66. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182.
  67. Connectivity patterns are task embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11993–12013.
  68. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. arXiv preprint arXiv:2305.14327.
  69. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  70. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  71. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  72. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  73. Efficiently tuned parameters are task embeddings. arXiv preprint arXiv:2210.11705.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 8 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com