Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LESS: Selecting Influential Data for Targeted Instruction Tuning (2402.04333v3)

Published 6 Feb 2024 in cs.CL, cs.AI, and cs.LG
LESS: Selecting Influential Data for Targeted Instruction Tuning

Abstract: Instruction tuning has unlocked powerful capabilities in LLMs, effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

LESS: An Efficient Algorithm for Targeted Instruction Tuning in LLMs

Introduction to LESS

LLMs have gained significant traction for their ability to serve as general-purpose chatbots, capable of generating human-like text based on provided instructions. However, for real-world applications that demand specialized capabilities, such as advanced reasoning, the challenge of sifting through extensive instruction tuning datasets to identify and utilize the most relevant data becomes apparent. This process, termed "targeted instruction tuning," is crucial for developing specific skills within LLMs without having to train on the entire dataset, which may contain irrelevant or even counterproductive information.

The proposed solution to this challenge is the algorithm LESS (Low-rank gradiEnt Similarity Search), which represents a novel method for selecting influential data from large instruction tuning datasets. LESS operates by effectively estimating data influences using optimizer-aware formulations and performing a low-rank gradient similarity search to pinpoint the examples most pertinent to enhancing the model's performance on a given task.

LESS: The Underlying Mechanism

Compatibility with Instruction Tuning

At its core, LESS modifies existing influence estimation methods to work efficiently with the Adam optimizer and manage variable-length instruction data. These adaptations are crucial given that LLMs often use Adam for fine-tuning due to its ability to handle sparse gradients and adjust learning rates automatically.

Efficiency Through LoRA and Random Projections

To address the computational and storage overhead associated with large model parameters, LESS employs LoRA (Low-Rank Adaptations) and random projection techniques to construct a gradient datastore. This datastore, consisting of low-dimensional gradient features, allows for efficient and effective dataset selection while being reusable for new target tasks, thus significantly reducing the computational cost.

Transferable Knowledge Across Models

A significant advantage of LESS is its ability to select data using gradients from smaller models to induce strong performance in larger models or even different model families. This transferability is crucial for practical applications where computational resources may be limited.

Interpretable Data Selection

LESS diverges from traditional methods that often rely on surface form cues for data selection. Instead, it focuses on identifying data that showcases similar reasoning and skill types required for the target task. This approach ensures that the selected data aligns more closely with the specific capabilities being targeted, rather than merely matching on language or topic.

Experimental Findings and Implications

The effectiveness of LESS is demonstrated through experiments on diverse downstream tasks, where training on only a 5% subset of data selected by LESS often outperforms training on the full dataset. This outcome underscores the potential for LESS to enable more focused and efficient training protocols, especially in scenarios where dataset size significantly outstrips the in-domain data necessary for specialized tasks.

Additionally, the ability of LESS to select transferable data across models introduces a promising avenue for reducing the computational costs associated with data selection and model training. Smaller models can be utilized to curate training datasets for larger, more complex models, facilitating a more resource-efficient workflow without compromising performance.

The Road Ahead

While LESS presents a significant advance in targeted instruction tuning for LLMs, several avenues remain open for further exploration. These include extending LESS for real-time model adaptation, optimizing the algorithm for even greater efficiency, and investigating its potential for reducing unintended model biases by selectively focusing on data that promotes fairness and inclusivity.

In summary, LESS stands as a testament to the potential of intelligent data selection in unlocking more specialized and efficient capabilities within the field of LLMs, paving the way for their broader application across a myriad of tasks demanding high degrees of specificity and complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  2. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016.
  3. On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
  4. If influence functions are the answer, then what is the question? In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=hzbguA9zMJ.
  5. Influence functions in deep learning are fragile. In International Conference on Learning Representations, 2020.
  6. bench authors, B. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  7. Data diversity matters for robust instruction tuning. arXiv preprint arXiv:2311.14736, 2023.
  8. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
  9. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023a.
  10. Skill-it! a data-driven skills framework for understanding and training language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=IoizwO1NLf.
  11. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020.
  12. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2019.
  13. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
  14. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  15. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653, 2023.
  16. Dsdm: Model-aware dataset selection with datamodels, 2024.
  17. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems, volume 33, pp.  2881–2891, 2020.
  18. Google. An important next step on our ai journey, 2023. URL https://blog.google/technology/ai/bard-google-ai-search-updates/.
  19. Studying large language model generalization with influence functions, 2023.
  20. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.
  21. Hampel, F. R. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393, 1974.
  22. Understanding in-context learning via supportive pretraining data. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  23. Evaluation of similarity-based explanations. In International Conference on Learning Representations, 2020.
  24. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  25. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  26. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  4411–4421. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/hu20b.html.
  27. Datamodels: Predicting predictions from training data. In Proceedings of the 39th International Conference on Machine Learning, 2022.
  28. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  29. Extensions of lipschitz mappings into hilbert space. Contemporary mathematics, 26:189–206, 1984.
  30. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.  1289–1299. IEEE, 2019.
  31. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464–5474. PMLR, 2021a.
  32. Retrieve: Coreset selection for efficient and robust semi-supervised learning. In Advances in Neural Information Processing Systems, volume 34, pp.  14488–14501, 2021b.
  33. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  34. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885–1894. PMLR, 2017.
  35. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  36. Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
  37. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023.
  38. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  39. Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8):1–42, 2022.
  40. On the sdes and scaling rules for adaptive gradient algorithms. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  7697–7711. Curran Associates, Inc., 2022.
  41. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pp. 23610–23641. PMLR, 2023.
  42. Trivial or impossible—dichotomous data difficulty masks model differences (on imagenet and beyond). In International Conference on Learning Representations, 2021.
  43. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
  44. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020.
  45. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  46. Dataset meta-learning from kernel ridge-regression. In International Conference on Learning Representations, 2021.
  47. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  48. OpenAI. OpenAI: GPT-4, 2023. URL https://openai.com/research/gpt-4.
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  50. Trak: Attributing model behavior at scale. In International Conference on Machine Learning (ICML), 2023.
  51. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, volume 34, pp.  20596–20607, 2021.
  52. Phillips, J. M. Coresets and sketches. In Handbook of discrete and computational geometry, pp. 1269–1288. Chapman and Hall/CRC, 2017.
  53. Estimating training data influence by tracing gradient descent. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  19920–19930. Curran Associates, Inc., 2020.
  54. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  55. Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations, 2022.
  56. Understanding influence functions and datamodels via harmonic analysis. In The Eleventh International Conference on Learning Representations, 2023.
  57. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  58. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  59. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  60. Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations, 2021.
  61. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018.
  62. Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pp. 9983–9995. PMLR, 2020.
  63. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  64. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, 2022.
  65. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
  66. Self-instruct: Aligning language models with self-generated instructions. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  67. More than a toy: Random matrix models predict how real-world neural representations generalize. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  23549–23588. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/wei22a.html.
  68. Finetuned language models are zero-shot learners. In ICLR 2022-Tenth International Conference on Learning Representations, 2022b.
  69. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022c.
  70. Predicting performance for natural language processing tasks. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8625–8646, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.764. URL https://aclanthology.org/2020.acl-main.764.
  71. Training trajectories of language models across scales. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13711–13738, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.767. URL https://aclanthology.org/2023.acl-long.767.
  72. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023b.
  73. Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=lXuByUeHhd.
  74. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023b.
  75. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  76. Tensor programs iv: Feature learning in infinite-width neural networks. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  11727–11737. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/yang21c.html.
  77. Enhanced convolutional neural tangent kernels, 2020a. URL https://openreview.net/forum?id=BkgNqkHFPr.
  78. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020b.
  79. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  80. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mengzhou Xia (34 papers)
  2. Sadhika Malladi (17 papers)
  3. Suchin Gururangan (29 papers)
  4. Sanjeev Arora (93 papers)
  5. Danqi Chen (84 papers)
Citations (111)