Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

654 2

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement (2403.15042v2)

Published 22 Mar 2024 in cs.CL

Abstract: Pretrained LLMs are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at https://github.com/SqueezeAILab/LLM2LLM .

PDF HTML Abstract

Enhancing LLMs in Low-Data Regimes through Iterative Data Augmentation

Introduction

LLMs have emerged as versatile tools for a wide array of NLP tasks. However, their application in specialized or data-scarce environments remains a challenge, mainly due to the inefficacy of conventional fine-tuning approaches in these contexts. In response to this issue, the paper introduces LLM2LLM, a novel method for targeted and iterative data augmentation. This approach significantly boosts LLM performance, especially in low-data regimes, by generating synthetic data that is focused on areas where the model demonstrates weaknesses.

LLM2LLM Framework

LLM2LLM operates by employing a teacher-student model architecture. The process initiates with fine-tuning a baseline student LLM on available seed data, followed by assessment to identify incorrectly predicted data points. The innovative step involves employing a teacher LLM to generate synthetic data based on these identified weak points. This synthetic data, aimed at addressing specific areas of difficulty, is then incorporated into the training dataset for the student model. Through iterative application, this method ensures an increasingly refined focus on challenging examples, thereby enhancing the model's performance.

Empirical Evaluations

The effectiveness of LLM2LLM was rigorously tested across several datasets, including GSM8K, CaseHOLD, SNIPS, TREC, and SST-2. These datasets were chosen for their diversity in task types and complexity, ranging from mathematical problems to text classification and sentiment analysis. By employing LLM2LLM, the researchers observed significant improvements over baseline models, achieving up to a 24.2% increase in performance on GSM8K, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC, and 39.8% on SST-2. Notably, these gains were most pronounced in scenarios where the amount of initial seed data was minimal, underlining LLM2LLM's potential in data-sparse situations.

Comparative Analysis

LLM2LLM was benchmarked against various data augmentation baselines, including traditional fine-tuning, Easy Data Augmentation (EDA), and AugGPT. Across all datasets, LLM2LLM outperformed these methods, demonstrating its superior capability in generating more effective and pertinent training data. Furthermore, an exploration into the influence of teacher model choices revealed that the quality of the generated data—and by extension, the achieved performance improvements—varied with the teacher model's capabilities.

Ablation Studies

Through a series of ablation studies, the paper delineates the impact of core components and design choices within the LLM2LLM framework. These studies confirmed the necessity of the iterative nature of data generation and the specific focus on augmenting based on incorrectly predicted examples. Moreover, the decision to periodically reset the student model before each fine-tuning phase was shown to prevent overfitting and facilitate more robust learning across iterations.

Implications and Future Directions

LLM2LLM presents a promising avenue for enhancing LLMs in specialized or data-sparse environments. By generating targeted synthetic data, it alleviates the need for extensive and potentially costly data collection efforts. Additionally, the iterative approach ensures that the model's evolving capabilities are continually matched with new and appropriately challenging data, fostering more effective learning. Looking ahead, further research could explore the integration of LLM2LLM with other model adaptation and data augmentation techniques, potentially opening up new realms of application for LLMs across diverse domains.

PDF Markdown Bookmark Chat (Pro)

References (72)

Authors (9)

Nicholas Lee (29 papers)
Thanakul Wattanawong (2 papers)
Sehoon Kim (30 papers)
Karttikeya Mangalam (32 papers)
Sheng Shen (68 papers)
Michael W. Mahoney (233 papers)
Kurt Keutzer (199 papers)
Amir Gholami (60 papers)
Gopala Anumanchipalli (30 papers)

Citations (29)

View on Semantic Scholar

Tweets

https://twitter.com/arankomatsuzaki/status/1772078585903219007

https://twitter.com/_akhaliq/status/1772118186789405083

https://twitter.com/TheTuringPost/status/1775569035868180769

https://twitter.com/fly51fly/status/1772180165721997789

https://twitter.com/gm8xx8/status/1772080294952071513

https://twitter.com/knishimae0531/status/1772219608537456945

YouTube

Show All Videos