Synthcity: facilitating innovative use cases of synthetic data in different data modalities (2301.07573v1)

Published 18 Jan 2023 in cs.LG and cs.AI

Abstract: Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation across diverse tabular data modalities, including static data, regular and irregular time series, data with censoring, multi-source data, composite data, and more. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data. It also offers the community a playground for rapid experimentation and prototyping, a one-stop-shop for SOTA benchmarks, and an opportunity for extending research impact. The library can be accessed on GitHub (https://github.com/vanderschaarlab/synthcity) and pip (https://pypi.org/project/synthcity/). We warmly invite the community to join the development effort by providing feedback, reporting bugs, and contributing code.

PDF Abstract

Synthcity: Facilitating Innovative Use Cases of Synthetic Data

The paper presents "Synthcity," an open-source software package geared towards enhancing the use of synthetic data across multiple data modalities. Synthcity seeks to address key challenges in machine learning, particularly those related to fairness, privacy, and data augmentation. The tool is versatile, catering to a broad spectrum of tabular data types such as static data, regular and irregular time series, and censored data, among others. Synthcity serves as a centralized access point to state-of-the-art methodologies in synthetic data, offering resources for benchmarking, rapid prototyping, and extending research impacts.

Synthetic Data Technology in AI

The utility of AI models is often constrained by data limitations, including scarcity, privacy issues, and bias. This lack of high-quality datasets impedes the development of AI systems, particularly in high-stakes fields. Synthetic data offers a solution by generating high-fidelity data while adhering to constraints like differential privacy and fairness, thus fostering robust, privacy-preserving AI models.

Challenges in Synthetic Data Software Development

The practical use of synthetic data remains underdeveloped despite significant academic progress. The paper identifies two primary challenges:

Diverse Problem Settings: Different data modalities and use cases generate numerous complex problem settings. No single generator can adequately address this diversity—a gap Synthcity aims to fill by providing a comprehensive platform that integrates numerous methodologies.
Contextual Model Choice: The varied strengths of generative models necessitate a large arsenal of tools for application-specific challenges. Despite the abundance of generative models, their implementations often suffer from lack of modularity and interoperability, which Synthcity seeks to overcome.

Synthcity Library Features

Synthcity is composed of several key components that enable it to serve as a comprehensive solution for synthetic data generation:

Comprehensive Workflow: It provides a standardized workflow encapsulating dataset loading, generator training, synthetic data production, and evaluation.
Tabular Data Focus: Initially focused on tabular data due to its industrial relevance, encompassing static datasets, time series, and censored data.
Diverse Use Cases: Supports various applications including standard data generation, fairness (both balancing and causal fairness), privacy preservation, and cross-domain augmentation.
Evaluation Metrics: Synthcity incorporates a rich suite of evaluation metrics to assess synthetic data quality, covering fidelity, utility, and privacy aspects.

Comparative Analysis

The paper evaluates Synthcity against existing synthetic data libraries, indicating its expansive support for diverse data modalities and use cases. Synthcity's broader range of data generation algorithms and evaluation metrics highlights its potential as a superior tool in synthetic data applications.

Implications and Future Directions

The introduction of Synthcity marks a considerable step in bridging the gap between synthetic data research and real-world application. By providing a unified, community-driven platform, Synthcity holds the potential to foster widespread adoption and advancement in synthetic data methodologies. Future developments could expand its support for additional data modalities and enhance native support for data with missing elements.

In conclusion, Synthcity presents itself as a pivotal tool in the AI landscape, addressing the multifaceted challenges of synthetic data generation. Its focus on modularity, interoperability, and comprehensive evaluation positions it well to drive forward both practical applications and theoretical advancements in the use of synthetic data across various domains.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Zhaozhi Qian (25 papers)
Bogdan-Constantin Cebere (1 paper)
Mihaela van der Schaar (321 papers)

Citations (42)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - vanderschaarlab/synthcity: A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation. (458 stars)