Papers
Topics
Authors
Recent
Search
2000 character limit reached

In Defense of Synthetic Data

Published 3 May 2019 in cs.DB | (1905.01351v1)

Abstract: Synthetic datasets have long been thought of as second-rate, to be used only when "real" data collected directly from the real world is unavailable. But this perspective assumes that raw data is clean, unbiased, and trustworthy, which it rarely is. Moreover, the benefits of synthetic data for privacy and for bias correction are becoming increasingly important in any domain that works with people. Curated synthetic datasets - synthetic data derived from minimal perturbations of real data - enable early stage product development and collaboration, protect privacy, afford reproducibility, increase dataset diversity in research, and protect disadvantaged groups from problematic inferences on the original data that reflects systematic discrimination. Rather than representing a departure from the true state of the world, in this paper we argue that properly generated synthetic data is a step towards responsible and equitable research and development of machine learning systems.

Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.