Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study (2407.12813v2)

Published 27 Jun 2024 in cs.CL and cs.AI

Abstract: Using LLMs to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yinheng Li (14 papers)
  2. Rogerio Bonatti (24 papers)
  3. Sara Abdali (14 papers)
  4. Justin Wagle (4 papers)
  5. Kazuhito Koishida (22 papers)
Citations (1)