Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models (2401.06692v3)

Published 12 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern LLMs. However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues to increase. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool, but its high computational cost remains a barrier to its widespread applicability in the context of LLMs. To mitigate the annotation cost of SFT and circumvent the computational bottlenecks of active learning, we propose using experimental design. Experimental design techniques select the most informative samples to label, and typically maximize some notion of uncertainty and/or diversity. In our work, we implement a framework that evaluates several existing and novel experimental design techniques and find that these methods consistently yield significant gains in label efficiency with little computational overhead. On generative tasks, our methods achieve the same generalization performance with only $50\%$ of annotation cost required by random sampling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Gone fishing: Neural active learning with fisher embeddings. Advances in Neural Information Processing Systems, 34:8927–8939.
  2. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671.
  3. Training connectionist networks with queries and selective sampling. Advances in neural information processing systems, 2.
  4. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65–72.
  5. The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9368–9377.
  6. Jeff Bilmes. 2022. Submodularity in machine learning and artificial intelligence. arXiv preprint arXiv:2202.00132.
  7. Alexander Bukharin and Tuo Zhao. 2023. Data diversity matters for robust instruction tuning.
  8. Alpagasus: Training a better alpaca with fewer data.
  9. Active representation learning for general task space with applications in robotics. arXiv preprint arXiv:2306.08942.
  10. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944.
  11. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations.
  12. Combinatorial optimisation. Wiley-Interscience Series in Discrete Mathematics and Optimization, USA, 1:998.
  13. Accelerating batch active learning using continual learning techniques.
  14. Mods: Model-oriented data selection for instruction tuning.
  15. Melanie Ducoffe and Frederic Precioso. 2018. Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841.
  16. Reza Zanjirani Farahani and Masoud Hekmatfar. 2009. Facility Location — link.springer.com. https://link.springer.com/book/10.1007/978-3-7908-2151-2. [Accessed 10-01-2024].
  17. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516.
  18. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR.
  19. Yonatan Geifman and Ran El-Yaniv. 2017. Deep active learning over the long tail. arXiv preprint arXiv:1711.00941.
  20. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  21. Unnatural instructions: Tuning language models with (almost) no human labor.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  23. Active learning with support vector machines. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(4):313–326.
  24. Active instruction tuning: Improving cross-task generalization by training on prompt sensitive tasks. arXiv preprint arXiv:2311.00288.
  25. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning.
  26. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  27. Michel Minoux. 2005. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques: Proceedings of the 8th IFIP Conference on Optimization Techniques Würzburg, September 5–9, 1977, pages 234–243. Springer.
  28. Pitu B Mirchandani and Richard L Francis. 1990. Discrete location theory.
  29. Lazier than lazy greedy.
  30. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950–6960. PMLR.
  31. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  32. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14:265–294.
  33. Direct: Deep active learning under imbalance and label noise. arXiv preprint arXiv:2312.09196.
  34. Instruction tuning with gpt-4.
  35. Active learning for natural language generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9862–9877, Singapore. Association for Computational Linguistics.
  36. Luc Pronzato and Andrej Pázman. 2013. Design of experiments in nonlinear models. Lecture notes in statistics, 212(1).
  37. Friedrich Pukelsheim. 2006. Optimal design of experiments. SIAM.
  38. Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489.
  39. Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
  40. Burr Settles. 2009. Active learning literature survey.
  41. Burr Settles. 2011. From theories to queries: Active learning in practice. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, volume 16 of Proceedings of Machine Learning Research, pages 1–18, Sardinia, Italy. PMLR.
  42. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  43. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  44. Simon Tong and Daphne Koller. 2001. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  46. Information-Based Complexity. Academic Press, New York.
  47. Improved active multi-task representation learning via lasso. In International Conference on Machine Learning, pages 35548–35578. PMLR.
  48. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  49. Self-instruct: Aligning language models with self-generated instructions.
  50. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.
  51. Finetuned language models are zero-shot learners.
  52. Submodularity in data subset selection and active learning. In International conference on machine learning, pages 1954–1963. PMLR.
  53. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization.
  54. Cit: Curation in training for effective vision-language data. arXiv preprint arXiv:2301.02241.
  55. Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning. arXiv preprint arXiv:2306.09910.
  56. Galaxy: Graph-based active learning at the extreme. In International Conference on Machine Learning, pages 26223–26238. PMLR.
  57. Lima: Less is more for alignment.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Gantavya Bhatt (13 papers)
  2. Yifang Chen (31 papers)
  3. Arnav M. Das (4 papers)
  4. Jifan Zhang (17 papers)
  5. Sang T. Truong (12 papers)
  6. Stephen Mussmann (15 papers)
  7. Yinglun Zhu (17 papers)
  8. Jeffrey Bilmes (6 papers)
  9. Simon S. Du (120 papers)
  10. Kevin Jamieson (72 papers)
  11. Jordan T. Ash (18 papers)
  12. Robert D. Nowak (34 papers)
Citations (10)

Summary

Introduction to Experimental Design in LLM Fine-tuning

Supervised fine-tuning (SFT) using instruction datasets is a powerful way to enhance the performance of LLMs. Instruction datasets are collections of natural language prompts paired with expert-generated responses that help LLMs learn to generalize across different tasks. The challenge, however, is that creating these datasets is costly - requiring substantial annotation efforts by human experts. To address this issue, traditional methods such as active learning have been considered, but they come with computational complexity that makes them less feasible for large-scale LLMs. The innovation presented in this paper is the application of experimental design to SFT, with a specific emphasis on label efficiency and reduced computational costs.

Moving Beyond Active Learning

Active learning has traditionally been the go-to approach for label-efficient model training. By iteratively training a model and using it to select informative samples for labeling, active learning aims to reduce the number of annotations required. Yet, when dealing with LLMs, the computational resources necessary for the repeated training cycles and inferences become a substantial barrier, calling for a more efficient alternative.

Experimental Design as a Solution

Experimental design is a methodology that organizes an experiment to gain information about an object of interest—in this case, determining the best subset of unlabeled examples for optimizing an AI system's performance. By selecting a representative set of prompts for annotation in one step prior to any labeling, experimental design bypasses much of the computational expense associated with active learning. This selection process relies on measures that reflect both uncertainty and the diversity of the unlabeled data, enabling a data-efficient way to fine-tune an LLM.

Implementation and Impact

The researchers tested several experimental design strategies, some of which are unique to this paper, such as the "maximum token uncertainty" and the use of the facility location function to ensure diversity among selected samples. These novel strategies were evaluated within a tailored framework, and the results showed that their techniques improve label efficiency considerably. Indeed, the paper states that their methods can achieve the desired model performance with only half the annotation budget traditionally required by random sampling techniques.

The findings of this paper suggest a promising new direction for training LLMs, where both accuracy and computational efficiency are paramount. As the need to fine-tune LLMs across increasing numbers of tasks grows, experimental design might present the balance between performance and cost-efficiency that is vital for the broader adoption and application of these sophisticated models. Future research may likely expand on these initial findings, refining the experimental design approaches further and exploring their integration with existing and novel fine-tuning methodologies.