Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement (2403.15042v2)

Published 22 Mar 2024 in cs.CL
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Abstract: Pretrained LLMs are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at https://github.com/SqueezeAILab/LLM2LLM .

Enhancing LLMs in Low-Data Regimes through Iterative Data Augmentation

Introduction

LLMs have emerged as versatile tools for a wide array of NLP tasks. However, their application in specialized or data-scarce environments remains a challenge, mainly due to the inefficacy of conventional fine-tuning approaches in these contexts. In response to this issue, the paper introduces LLM2LLM, a novel method for targeted and iterative data augmentation. This approach significantly boosts LLM performance, especially in low-data regimes, by generating synthetic data that is focused on areas where the model demonstrates weaknesses.

LLM2LLM Framework

LLM2LLM operates by employing a teacher-student model architecture. The process initiates with fine-tuning a baseline student LLM on available seed data, followed by assessment to identify incorrectly predicted data points. The innovative step involves employing a teacher LLM to generate synthetic data based on these identified weak points. This synthetic data, aimed at addressing specific areas of difficulty, is then incorporated into the training dataset for the student model. Through iterative application, this method ensures an increasingly refined focus on challenging examples, thereby enhancing the model's performance.

Empirical Evaluations

The effectiveness of LLM2LLM was rigorously tested across several datasets, including GSM8K, CaseHOLD, SNIPS, TREC, and SST-2. These datasets were chosen for their diversity in task types and complexity, ranging from mathematical problems to text classification and sentiment analysis. By employing LLM2LLM, the researchers observed significant improvements over baseline models, achieving up to a 24.2% increase in performance on GSM8K, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC, and 39.8% on SST-2. Notably, these gains were most pronounced in scenarios where the amount of initial seed data was minimal, underlining LLM2LLM's potential in data-sparse situations.

Comparative Analysis

LLM2LLM was benchmarked against various data augmentation baselines, including traditional fine-tuning, Easy Data Augmentation (EDA), and AugGPT. Across all datasets, LLM2LLM outperformed these methods, demonstrating its superior capability in generating more effective and pertinent training data. Furthermore, an exploration into the influence of teacher model choices revealed that the quality of the generated data—and by extension, the achieved performance improvements—varied with the teacher model's capabilities.

Ablation Studies

Through a series of ablation studies, the paper delineates the impact of core components and design choices within the LLM2LLM framework. These studies confirmed the necessity of the iterative nature of data generation and the specific focus on augmenting based on incorrectly predicted examples. Moreover, the decision to periodically reset the student model before each fine-tuning phase was shown to prevent overfitting and facilitate more robust learning across iterations.

Implications and Future Directions

LLM2LLM presents a promising avenue for enhancing LLMs in specialized or data-sparse environments. By generating targeted synthetic data, it alleviates the need for extensive and potentially costly data collection efforts. Additionally, the iterative approach ensures that the model's evolving capabilities are continually matched with new and appropriately challenging data, fostering more effective learning. Looking ahead, further research could explore the integration of LLM2LLM with other model adaptation and data augmentation techniques, potentially opening up new realms of application for LLMs across diverse domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Palm 2 technical report, 2023.
  3. Ext5: Towards extreme multi-task scaling for transfer learning. In International Conference on Learning Representations, 2021.
  4. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173, 2017.
  5. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023.
  6. Instruction mining: When data mining meets large language model finetuning, 2023.
  7. Alpagasus: Training a better alpaca with fewer data, 2023.
  8. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, 2018.
  13. Claude Coulombe. Text data augmentation made simple by leveraging nlp cloud apis. arXiv preprint arXiv:1812.04718, 2018.
  14. Auggpt: Leveraging chatgpt for text data augmentation, 2023.
  15. Rephrase and respond: Let large language models ask better questions for themselves, 2023.
  16. Gpt-4 turbo v.s. gpt-4 comparison. https://github.com/da03/implicit_chain_of_thought/tree/main/gpt4_baselines, 2023.
  17. Jon Durbin. Jondurbin/airoboros-l2-70b-3.1.2 · hugging face, Oct 2023.
  18. Using gpt-4 to augment unbalanced data for automatic scoring, 2023.
  19. A survey of data augmentation approaches for NLP. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online, August 2021. Association for Computational Linguistics.
  20. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance, 2023.
  21. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  22. Koala: A dialogue model for academic research. Blog post, April 2023.
  23. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  24. Language models can teach themselves to program better, 2023.
  25. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  26. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  27. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  28. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks, 2023.
  29. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26, 2020.
  30. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  31. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023.
  32. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002.
  33. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, 2004.
  34. Tinygsm  achieving  80% on gsm8k with small language models, 2023.
  35. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  36. The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  37. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  38. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  39. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022.
  40. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  41. Crosslingual generalization through multitask finetuning. In Annual Meeting of the Association for Computational Linguistics, 2023.
  42. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  43. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
  44. instructgpt, 2022.
  45. Rephrase, augment, reason: Visual grounding of questions for vision-language models, 2023.
  46. Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 44(1.2):206–226, 2000.
  47. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2021.
  48. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  49. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  50. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  51. Gerald Tesauro et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Zeroshotdataaug: Generating and augmenting training data with chatgpt. arXiv preprint arXiv:2304.14334, 2023.
  55. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics.
  56. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022.
  57. Knowda: All-in-one knowledge mixture model for data augmentation in few-shot nlp. arXiv preprint arXiv:2206.10265, 2022.
  58. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  59. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China, November 2019. Association for Computational Linguistics.
  60. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4, 2023.
  61. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  62. Zeroprompt: Scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4235–4252, 2022.
  63. Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826, 2021.
  64. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
  65. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  66. Self-taught optimizer (stop): Recursively self-improving code generation, 2023.
  67. Star: Bootstrapping reasoning with reasoning, 2022.
  68. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  69. When does pretraining help? assessing self-supervised learning for law and the casehold dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law. Association for Computing Machinery, 2021.
  70. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  71. Lima: Less is more for alignment, 2023.
  72. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Nicholas Lee (29 papers)
  2. Thanakul Wattanawong (2 papers)
  3. Sehoon Kim (30 papers)
  4. Karttikeya Mangalam (32 papers)
  5. Sheng Shen (68 papers)
  6. Michael W. Mahoney (233 papers)
  7. Kurt Keutzer (199 papers)
  8. Amir Gholami (60 papers)
  9. Gopala Anumanchipalli (30 papers)
Citations (29)
Youtube Logo Streamline Icon: https://streamlinehq.com