Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts (2410.19185v1)

Published 24 Oct 2024 in cs.AI
Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts

Abstract: LLMs demonstrate impressive proficiency in language understanding and generation. Nonetheless, training these models from scratch, even the least complex billion-parameter variant demands significant computational resources rendering it economically impractical for many organizations. With LLMs functioning as general-purpose task solvers, this paper investigates their task-specific fine-tuning. We employ task-specific datasets and prompts to fine-tune two pruned LLaMA models having 5 billion and 4 billion parameters. This process utilizes the pre-trained weights and focuses on a subset of weights using the LoRA method. One challenge in fine-tuning the LLaMA model is crafting a precise prompt tailored to the specific task. To address this, we propose a novel approach to fine-tune the LLaMA model under two primary constraints: task specificity and prompt effectiveness. Our approach, Tailored LLaMA initially employs structural pruning to reduce the model sizes from 7B to 5B and 4B parameters. Subsequently, it applies a carefully designed prompt specific to the task and utilizes the LoRA method to accelerate the fine-tuning process. Moreover, fine-tuning a model pruned by 50\% for less than one hour restores the mean accuracy of classification tasks to 95.68\% at a 20\% compression ratio and to 86.54\% at a 50\% compression ratio through few-shot learning with 50 shots. Our validation of Tailored LLaMA on these two pruned variants demonstrates that even when compressed to 50\%, the models maintain over 65\% of the baseline model accuracy in few-shot classification and generation tasks. These findings highlight the efficacy of our tailored approach in maintaining high performance with significantly reduced model sizes.

Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts

The paper presents a focused exploration into the fine-tuning of LLMs, specifically through the innovation of the Tailored-LLaMA approach. It addresses a pertinent challenge in the deployment of LLMs: adapting these typically expansive models for task-specific applications while mitigating computational demands. Tailored-LLaMA employs a tri-fold strategy involving structural pruning, prompt engineering, and the LoRA method to fine-tune pruned LLaMA variants effectively.

Structural Pruning and Efficiency

The process begins with structural pruning, designed to reduce model size without degrading performance disproportionately. The authors employ a method analogous to DepGraph by assessing parameter inter-dependencies within LLaMA’s architecture, categorized through a dependency graph. The pruning is quantitatively driven, focusing on groups of interdependent parameters. This first-phase pruning compresses the model from 7B parameters to 5B and 4B parameters, achieving various compression ratios without significant performance losses. The results indicate substantial accuracy retention, with a mean recovery rate of 95.68% for a 20% compression ratio.

Task-Specific Prompting and Fine-Tuning

Subsequent to pruning, the paper emphasizes the critical role of task-specific prompts. The researchers developed a prompt evaluation strategy to identify optimal prompts most likely to enhance the pruned models' performance on specific tasks. The few-shot performance demonstrated a restoration of mean classification accuracy to 95.68% at a 20% compression ratio and 86.54% at a 50% compression ratio, affirming the efficacy of tailored prompts.

Implementation of LoRA Method

The final phase involves the Low-Rank Adaptation (LoRA) method that expedites fine-tuning while requiring only limited data. LoRA facilitates an efficient parameter update process, training only the low-rank matrices which significantly reduces the data required. This approach results in a notable decrease in computational overhead, allowing fine-tuning on a single GPU in under one hour, a potentially transformative improvement for practical applications of LLMs in resource-constrained environments.

Implications and Future Directions

The paper's results hold meaningful implications for the implementation of LLAMs and other similar structures within constrained computational infrastructures. By demonstrating the ability to maintain high performance with reduced model sizes, this work opens avenues exploring further efficiencies in other large-scale models across different application domains. The adaptability of Tailored-LLaMA suggests potential for scalable approaches in diverse AI and NLP tasks beyond those explored.

The paper sets a solid foundation for additional research into fine-tuning and pruning strategies, particularly concerning the impact of task-specific prompting and low-rank parameter adaptations. Future investigations might delve into automating the selection of optimal prompts or further optimizing the structural pruning methodologies to enable widespread, efficient utilization of LLMs across varying scales and disciplines.

In conclusion, this paper provides a significant contribution to the field by bridging the gap between large-scale model capabilities and practical deployment needs, recommending methodologies that can enhance the efficiency of AI systems while maintaining their effectiveness across specified tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  2. Fluctuation-based adaptive structured pruning for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865–10873, 2024.
  3. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  4. Language models are few-shot learners. arXiv:2005.14165 [cs], July 2020. URL http://arxiv.org/abs/2005.14165.
  5. Once-for-all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2019.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  7. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  8. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  10. R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 160–167, New York, NY, USA, July 2008. Association for Computing Machinery. ISBN 978-1-60558-205-4. 10.1145/1390156.1390177. URL https://doi.org/10.1145/1390156.1390177.
  11. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  13. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101, 2023.
  14. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
  15. E. Frantar and D. Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  16. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  17. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016.
  18. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020.
  19. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  20. Shortened llama: A simple depth pruning for large language models. ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo), 2024. URL https://openreview.net/forum?id=18VGxuOdpu.
  21. Ziplm: Hardware-aware structured pruning of language models. arXiv preprint arXiv:2302.04089, 2023.
  22. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
  23. Optimal brain damage. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989. URL https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf.
  24. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2016.
  25. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, pages 5958–5968. PMLR, 2020.
  26. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017.
  27. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023.
  28. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https://www.aclweb.org/anthology/J93-2004.
  29. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  30. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  31. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  32. Improving language understanding by generative pre-training. ArXiv, 2018.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  35. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  36. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  37. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  38. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW.
  39. M. Team et al. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www. mosaicml. com/blog/mpt-7b. Accessed, pages 05–05, 2023.
  40. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  42. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000–6010, 2017.
  43. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International conference on machine learning, pages 6566–6575, 2019.
  44. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6151–6162, 2020.
  45. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  46. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  47. Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, Dublin, Ireland, May 2022. Association for Computational Linguistics. 10.18653/v1/2022.acl-long.107.
  48. Re-reading improves reasoning in language models. arXiv:2309.06275, 2023.
  49. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  50. Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403, 2023.
  51. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Danyal Aftab (2 papers)
  2. Steven Davy (6 papers)