Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation (2402.18334v3)

Published 28 Feb 2024 in cs.CL and cs.LG

Abstract: We introduce Bonito, an open-source model for conditional task generation that converts unannotated text into task-specific training datasets for instruction tuning. We aim to enable zero-shot task adaptation of LLMs on users' specialized, private data. We train Bonito by fine-tuning a pretrained LLM on a new large-scale dataset with 1.65M examples created by remixing existing instruction tuning datasets into meta-templates. The meta-templates for a dataset produce training examples where the input is the unannotated text and the task attribute and the output consists of the instruction and the response. We use Bonito to generate synthetic tasks for seven datasets from specialized domains with unannotated text across three task types -- yes-no question answering, extractive question answering, and natural language inference -- and adapt LLMs. We show that Bonito significantly improves the average performance of pretrained and instruction tuned models over the de facto self supervised baseline. For example, adapting Mistral-Instruct-v2 and instruction tuned variants of Mistral and Llama2 with Bonito improves the strong zero-shot performance by 22.1 F1 points whereas the next word prediction objective undoes some of the benefits of instruction tuning and reduces the average performance by 0.8 F1 points. We conduct additional experiments with Bonito to understand the effects of the domain, the size of the training set, and the choice of alternative synthetic task generators. Overall, we show that learning with synthetic instruction tuning datasets is an effective way to adapt LLMs to new domains. The model, dataset, and code are available at https://github.com/BatsResearch/bonito.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland. Association for Computational Linguistics.
  3. Pythia: A suite for analyzing large language models across training and scaling.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Meditron-70b: Scaling medical pretraining for large language models. ArXiv preprint, abs/2311.16079.
  6. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416.
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  8. Chatlaw: Open-source legal large language model with integrated external knowledge bases. ArXiv preprint, abs/2306.16092.
  9. Learning a foundation language model for geoscience knowledge understanding and utilization. ArXiv preprint, abs/2306.05064.
  10. Qlora: Efficient finetuning of quantized llms. ArXiv preprint, abs/2305.14314.
  11. Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
  12. Yaroslav Ganin and Victor S. Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1180–1189. JMLR.org.
  13. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
  14. The false promise of imitating proprietary llms. ArXiv preprint, abs/2305.15717.
  15. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.
  16. Reinforced self-training (rest) for language modeling. ArXiv preprint, abs/2308.08998.
  17. InstructDial: Improving zero and few-shot generalization in dialogue through instruction tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 505–525, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  18. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  19. Revisiting self-training for neural sequence generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  20. Distilling the knowledge in a neural network. ArXiv preprint, abs/1503.02531.
  21. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  22. Unnatural instructions: Tuning language models with (almost) no human labor. In Meeting of the Association for Computational Linguistics (ACL).
  23. Unsupervised prompt learning for vision-language models. ArXiv preprint, abs/2204.03649.
  24. Mistral 7b. ArXiv preprint, abs/2310.06825.
  25. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  26. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  27. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  28. ChatGPT: Jack of all trades, master of none. Information Fusion, page 101861.
  29. Yuta Koreeda and Christopher Manning. 2021. ContractNLI: A dataset for document-level natural language inference for contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  30. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  31. Longform: Optimizing instruction tuning for long text generation with corpus extraction.
  32. PAQ: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
  33. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  34. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  35. Self-alignment with instruction backtranslation. ArXiv preprint, abs/2308.06259.
  36. The flan collection: Designing data and methods for effective instruction tuning. ArXiv preprint, abs/2301.13688.
  37. Enhancing clip with clip: Exploring pseudolabeling for limited-label prompt tuning. arXiv e-prints, pages arXiv–2306.
  38. Jeff Miller. 2024. Generative AI is hot, but predictive AI remains the workhorse — cio.com. https://www.cio.com/article/1303984/generative-ai-is-hot-but-predictive-ai-remains-the-workhorse-2.html. [Accessed 10-02-2024].
  39. The effect of natural distribution shift on question answering models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 6905–6916. PMLR.
  40. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  41. Ruslan Mitkov and Le An Ha. 2003. Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing, pages 17–22.
  42. Crosslingual generalization through multitask finetuning. ArXiv preprint, abs/2211.01786.
  43. Semantic graphs for generating deep questions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1463–1475, Online. Association for Computational Linguistics.
  44. In-BoXBART: Get instructions into biomedical multi-task learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 112–128, Seattle, United States. Association for Computational Linguistics.
  45. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035.
  46. Instruction tuning with gpt-4. ArXiv preprint, abs/2304.03277.
  47. Language models are unsupervised multitask learners.
  48. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  49. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3505–3506. ACM.
  50. Question answering for privacy policies: Combining computational and legal perspectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4947–4958, Hong Kong, China. Association for Computational Linguistics.
  51. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv preprint, abs/1910.01108.
  52. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  53. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online. Association for Computational Linguistics.
  54. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  55. In ChatGPT we trust? Measuring and characterizing the reliability of chatgpt. ArXiv preprint, abs/2304.08979.
  56. Zhengxiang Shi and Aldo Lipani. 2023. Don’t stop pretraining? make prompt-based fine-tuning powerful learner. ArXiv preprint, abs/2305.01711.
  57. Large language models encode clinical knowledge. Nature, pages 1–9.
  58. Towards expert-level medical question answering with large language models. ArXiv preprint, abs/2305.09617.
  59. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17.
  60. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  61. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  62. Together. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  63. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288.
  64. An empirical comparison of lm-based question and answer generation methods. ArXiv preprint, abs/2305.17002.
  65. Self-instruct: Aligning language model with self generated instructions. In Meeting of the Association for Computational Linguistics (ACL).
  66. Super-natural instructions: Generalization via declarative instructions on 1600+ tasks. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  67. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  68. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771.
  69. BloombergGPT: A large language model for finance. ArXiv preprint, abs/2303.17564.
  70. Continual learning for large language models: A survey. ArXiv preprint, abs/2402.01364.
  71. Genie: Achieving human parity in content-grounded datasets generation. ArXiv preprint, abs/2401.14367.
  72. ChatDoctor: A medical chat model fine-tuned on LLaMA model using medical domain knowledge. ArXiv preprint, abs/2303.14070.
  73. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685.
  74. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206.
  75. Can large language models transform computational social science? ArXiv preprint, abs/2305.03514.
Citations (11)

Summary

  • The paper introduces Bonito, a new model that auto-generates instruction tuning datasets from unannotated text to enable zero-shot adaptation.
  • The methodology leverages meta-templates to produce 1.65M synthetic examples across specialized domains, significantly boosting F1 scores.
  • Empirical results show Bonito outperforms self-supervision baselines by up to 39.1 F1 points, demonstrating its scalability and efficiency.

Enhancing Zero-Shot Task Adaptation with Bonito: A Model for Generating Instruction Tuning Datasets

Introduction to Bonito

In the field of LLMs, the quest for models that can adeptly handle tasks in specialized domains through zero-shot task adaptation has led to the development of Bonito. As the introduction of instruction tuning has improved LLMs' ability to generalize to unseen tasks, the limited scope of existing instruction tuning datasets has become apparent. These datasets primarily cover generic tasks, leaving a gap in models' abilities to handle specialized domains. Bonito steps in to address this gap by automating the creation of instruction tuning datasets from unannotated text, specifically targeting specialized domains. This paper details Bonito's development and evaluates its impact on zero-shot task adaptation, positioning it as a significant advancement for adapting LLMs to new domains.

Automating Conditional Task Generation

Bonito is introduced as a model designed for conditional task generation: transforming unannotated text into task-specific training datasets suitable for instruction tuning. This process facilitates the zero-shot adaptation of LLMs to specialized, user-specific data, marking a stride towards personalized model tuning with minimal human intervention. Through training on a novel dataset comprising 1.65M examples generated using meta-templates, Bonito is adept at producing synthetic tasks across multiple specialized domains and task types. Its effectiveness is underscored by its ability to improve the average performance of both pretrained and instructionally-tuned models in specialized domain tasks.

Key Contributions and Results

The paper delineates several pivotal contributions of Bonito, including: the introduction of an open-source model for conditional task generation, empirical evidence of Bonito's superiority over self-supervision baselines, and an in-depth analysis revealing the model's scalability across varying domains and synthetic task generation methods. Notably, Bonito demonstrates a remarkable improvement of up to 39.1 F1 points over self-supervision baselines, highlighting its efficacy in enhancing models' performance through zero-shot task adaptation.

Technical Details and Experimentation

Delving into the technical intricacies, the paper explains the structure and training process of the Bonito model, alongside detailing the setup for generating synthetic tasks. The experiments are meticulously designed to assess Bonito's impact across seven datasets involving three task types: yes-no question answering, extractive question answering, and natural language inference. The results consistently show Bonito's superiority over baseline models, particularly noting its capacity to significantly outperform self-supervised learning methods in terms of F1 score improvements.

Impact and Theoretical Implications

The introduction of Bonito is positioned as a significant advancement in the field of LLMs, primarily for its potential to democratize access to specialized domain models. By enabling effective zero-shot task adaptation without extensive annotated datasets, Bonito paves the way for more tailored and accessible LLM applications across diverse fields. Theoretically, this research contributes to a deeper understanding of instruction tuning's potential when paired with innovative dataset generation methods, challenging existing paradigms of model training and adaptation.

Future Directions

The paper speculates on future advancements in AI, hinting at the exploration of more nuanced task generation models and improved methodologies for synthetic dataset creation. The adaptability of Bonito to varied domains and task types opens avenues for further research into domain-specific LLM adaptations, potentially revolutionizing how models are trained for specialized applications.

Limitations and Ethical Considerations

Acknowledging its limitations, the paper underscores the reliance on vast amounts of unannotated text and the contextual applicability of its findings. Moreover, it highlights potential risks associated with model biases and the generation of factually incorrect datasets, stressing the need for ethical considerations in the deployment of models like Bonito.

Conclusion

Bonito represents a meaningful leap towards enhancing the adaptability of LLMs to specialized domains through the automated generation of instruction tuning datasets. Its proven effectiveness in improving zero-shot task adaptation performance not only marks it as a significant contribution to the field but also lays the groundwork for future explorations aimed at further refining the capabilities of generative AI models.