Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification (2410.21526v2)

Published 28 Oct 2024 in cs.LG and cs.CL

Abstract: Synthetic data augmentation via LLMs allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

References (40)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (5)

Tweets

https://twitter.com/gm8xx8/status/1851521123772473488

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification (2410.21526v2)

Summary

Follow-up Questions

Related Papers

Authors (5)

Tweets