Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations (2310.07849v2)

Published 11 Oct 2023 in cs.CL and cs.AI

Abstract: The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using LLMs to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhuoyan Li (7 papers)
  2. Hangxiao Zhu (3 papers)
  3. Zhuoran Lu (7 papers)
  4. Ming Yin (70 papers)
Citations (39)
X Twitter Logo Streamline Icon: https://streamlinehq.com