Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science (2305.15041v1)

Published 24 May 2023 in cs.CL

Abstract: LLMs have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (6)

Veniamin Veselovsky (17 papers)
Manoel Horta Ribeiro (44 papers)
Akhil Arora (15 papers)
Martin Josifoski (16 papers)
Ashton Anderson (31 papers)
Robert West (154 papers)

Citations (24)

View on Semantic Scholar

Tweets

https://twitter.com/Josh9817/status/1754183532744761383

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science (2305.15041v1)

Related Papers

Tweets