Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Birbal: An efficient 7B instruct-model fine-tuned with curated datasets (2403.02247v1)

Published 4 Mar 2024 in cs.CL

Abstract: LLMOps incur significant costs due to hardware requirements, hindering their widespread accessibility. Additionally, a lack of transparency in model training methods and data contributes to the majority of models being non-reproducible. To tackle these challenges, the LLM Efficiency Challenge was introduced at NeurIPS Workshop, aiming to adapt foundation models on a diverse set of tasks via fine-tuning on a single GPU (RTX 4090 or A100 with 40GB) within a 24-hour timeframe. In this system description paper, we introduce Birbal, our Mistral-7B based winning model, fine-tuned on a single RTX 4090 for 16 hours. Birbal's success lies in curating high-quality instructions covering diverse tasks, resulting in a 35% performance improvement over second-best Qwen-14B based submission.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Chatgpt vs. bard: a comparative study. Authorea Preprints, 2023.
  2. The falcon series of open language models, 2023.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  4. BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  5. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1241. URL https://www.aclweb.org/anthology/D18-1241.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  7. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  8. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  10. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URL https://aclanthology.org/D19-5409.
  11. Felix HANS. Chatgpt vs. bard–which is better at solving coding problems?
  12. Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021a.
  13. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021b.
  14. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021c.
  15. Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
  16. Neftune: Noisy embeddings improve instruction finetuning, 2023.
  17. Mistral 7b, 2023.
  18. Can large language models infer causation from correlation?, 2023.
  19. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  20. Platypus: Quick, cheap, and powerful refinement of llms. 2023a.
  21. Holistic evaluation of text-to-image models, 2023b.
  22. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  23. The flan collection: Designing data and methods for effective instruction tuning, 2023.
  24. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.
  25. Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022.
  26. Gpt-4 technical report, 2023.
  27. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.165. URL https://aclanthology.org/2022.findings-acl.165.
  28. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://www.aclweb.org/anthology/P17-1099.
  29. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  30. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2023.
  31. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  33. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022.
  34. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  35. Lima: Less is more for alignment, 2023.
Citations (1)

Summary

  • The paper demonstrates a 35% performance improvement over competitors by fine-tuning Mistral-7B on a single GPU within 24 hours.
  • The methodology emphasizes meticulous dataset curation and leverages a 4-bit QLoRA fine-tuning approach to optimize performance.
  • The work underscores the viability of efficient LLM fine-tuning under resource constraints, democratizing access to advanced AI models.

Efficiency in LLM Fine-Tuning Demonstrated by Birbal on a Single GPU

Introduction to Birbal's Success

The landscape of few-shot LLMs has witnessed notable advancements in various NLP tasks. Amidst this progress, the challenges of high operational costs and limited reproducibility due to undisclosed training methodologies persist. Addressing these issues, the LLM Efficiency Challenge was conceived, focusing on the fine-tuning of an open-source foundation model on a single GPU within a 24-hour limit. This paper introduces "Birbal," a Mistral-7B based model that emerged as the winner of this challenge, showcasing a 35% performance improvement over its nearest competitor by leveraging a uniquely curated dataset.

LLM Efficiency Challenge

The challenge encouraged participants to adapt an open-source base LLM for a broad task spectrum, emphasizing efficiency and accessibility. It provided a platform where models like Mistral-7B could be fine-tuned within strict hardware and time constraints, using only open-source data. This initiative highlighted the feasibility of achieving significant LLM advancements without relying on extensive computational resources.

Our Approach

Design Choices

Given the competition's constraints, our strategic choice centered on the Mistral-7B model, considering its optimal balance between size and performance within the limited memory budget. Our focus was twofold: eschewing reliance on hardware optimization in favor of dataset curation and prioritizing high-quality, task-oriented data over quantity.

Data Curation

The curation process sought to assemble a diverse dataset conducive to broad task coverage. This involved meticulous selection and sampling from existing datasets, with the compiled data consisting of prompts and responses crossing various NLP domains. The fine-tuning dataset sizes were directly correlated with the epochs allowable within the time limit, ensuring efficient use of resources.

Fine-Tuning Methodology

Fine-tuning employed 4-bit QLoRA, optimized for the Mistral-7B model within the memory and temporal confines. This process was carefully managed to adhere to the stipulated 24-hour window, with adjustments in dataset size allowing for the necessary epoch completion. Benchmarking against validation sets ensured the selection of the optimal model checkpoint for competition submission.

Evaluation and Results

The Birbal model underwent rigorous assessment through multiple evaluation stages, demonstrating superior performance in a diverse array of tasks. Despite not all variants advancing beyond the initial stages, Birbal-200K notably excelled, underscoring the effectiveness of our fine-tuning strategy and dataset curation in achieving high efficiency on a single GPU setup.

Conclusion and Broader Impact

The development of Birbal exemplifies how strategic dataset curation and fine-tuning approaches can significantly elevate LLM performance under stringent resource constraints. This work contributes to democratizing access to efficient LLM fine-tuning, potentially broadening participation in cutting-edge AI research. Nonetheless, it also underscores the inherent biases present within base models and source datasets, raising crucial considerations for future endeavors in this domain.

Acknowledgments and Reproducibility

The success of the Birbal model could not have been achieved without support from Lambda Labs for compute resources. Our commitment to transparency and reproducibility is evidenced by the public availability of datasets, fine-tuning scripts, and model artifacts, facilitating further exploration and application within the research community.