Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science (2305.15041v1)

Published 24 May 2023 in cs.CL

Abstract: LLMs have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Veniamin Veselovsky (17 papers)
  2. Manoel Horta Ribeiro (44 papers)
  3. Akhil Arora (15 papers)
  4. Martin Josifoski (16 papers)
  5. Ashton Anderson (31 papers)
  6. Robert West (154 papers)
Citations (24)
X Twitter Logo Streamline Icon: https://streamlinehq.com