Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs (1706.02633v2)

Published 8 Jun 2017 in stat.ML and cs.LG

Abstract: Generative Adversarial Networks (GANs) have shown remarkable success as a framework for training models to produce realistic-looking data. In this work, we propose a Recurrent GAN (RGAN) and Recurrent Conditional GAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data. RGANs make use of recurrent neural networks in the generator and the discriminator. In the case of RCGANs, both of these RNNs are conditioned on auxiliary information. We demonstrate our models in a set of toy datasets, where we show visually and quantitatively (using sample likelihood and maximum mean discrepancy) that they can successfully generate realistic time-series. We also describe novel evaluation methods for GANs, where we generate a synthetic labelled training dataset, and evaluate on a real test set the performance of a model trained on the synthetic data, and vice-versa. We illustrate with these metrics that RCGANs can generate time-series data useful for supervised training, with only minor degradation in performance on real test data. This is demonstrated on digit classification from 'serialised' MNIST and by training an early warning system on a medical dataset of 17,000 patients from an intensive care unit. We further discuss and analyse the privacy concerns that may arise when using RCGANs to generate realistic synthetic medical time series data.

PDF Abstract

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs

In the paper titled "Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs," by Stephanie L. Hyland, Cristóbal Esteban, and Gunnar Rätsch, the authors propose the use of Recurrent Generative Adversarial Networks (RGAN) and their conditional variant (RCGAN) to generate realistic real-valued multi-dimensional time-series data, specifically for medical applications. This approach leverages the capabilities of recurrent neural networks (RNNs) in both the generator and discriminator components of GANs, particularly suited to capturing the sequential dependencies inherent in time-series data.

Key Contributions and Methods

The authors emphasize several primary contributions:

Adversarial Training for Time-series Data: They adapt the GAN framework to model real-valued sequential data using RNNs, resulting in RGAN and RCGAN.
Evaluation Approaches: They introduce novel evaluation methods for GANs, particularly relevant for sequential data, termed as Train on Synthetic, Test on Real (TSTR) and Train on Real, Test on Synthetic (TRTS).
Medical Data Generation: The generation of synthetic medical time series data to facilitate data-sharing without compromising patient privacy.
Empirical Privacy Analysis: The paper addresses privacy concerns by conducting empirical analyses and exploring the use of differentially private training methods.

Sequential Data Generation with GANs

The RGAN and RCGAN frameworks are tailored for time-series generation. The key architectural components include:

Recurrent Networks in GANs: Both the generator and discriminator in these GANs are implemented using RNNs, capable of handling the temporal aspects of sequential data.
Conditional Inputs: RCGANs allow conditioning on auxiliary information, making it possible to generate specific types of time-series data based on given conditions.

Evaluation Methodologies

Given the complexity in evaluating GANs, especially for time-series data, the authors propose:

Maximum Mean Discrepancy (MMD): Used to compare statistics of real and generated samples. This approach helps in evaluating whether the GANs learn the underlying data distribution effectively.
TSTR and TRTS Metrics: These metrics involve training machine learning models on synthetic data and evaluating on real data, and vice versa. This helps in determining the utility and realism of the generated data.

Experiments and Results

The paper demonstrates the capabilities of RGANs and RCGANs through multiple experiments:

Toy Datasets: The models effectively generated simple time-series data such as sine waves and Gaussian process samples, validated visually and quantitatively.
MNIST Sequential Data: By serializing MNIST digits, the authors illustrated the model's ability to generate image sequences and evaluated performance on digit classification tasks.
Intensive Care Unit (ICU) Data: Applied to a dataset from the eICU, the models generated synthetic time-series data of vital signs (SpO2, HR, RR, MAP). The TSTR evaluation showcased the potential to train predictive models with synthetic data that performed comparably to those trained on real data.

Addressing Privacy Concerns

Given the sensitivity of medical data, the authors explore methods to ensure privacy:

Reconstruction Error Analysis: Comparing the distribution of errors when reconstructing training and test data to check for overfitting.
Latent Space Interpolation: Interpolation between latent space points showed continuous transition in generated samples, indicating the model did not merely memorize training examples.
Differentially Private GAN Training: Using differential private stochastic gradient descent (DP-SGD), they trained the GANs with guarantees of differential privacy, showing that even with privacy constraints, the generated data retained utility for downstream tasks.

Implications and Future Directions

The implications of this research are manifold:

Synthetic Data Sharing: The generation of realistic synthetic medical data without compromising privacy opens avenues for data-sharing and collaborative research.
Evaluation Techniques: The novel evaluation methodologies can be widely adopted for other time-series generation tasks.
Potential for Enhanced Privacy: Advancements in differentially private training methods can further enhance the privacy guarantees of synthetic data generative models.

Future work may focus on improving the fidelity and privacy guarantees of such models, considering alternative architectures or training paradigms like unitary RNNs to enforce Lipschitz constraints for further stability and consistency in generative performance. Additionally, practical applications can explore the utility of generated synthetic data in diverse medical and time-series prediction tasks, contributing to the broader adoption of GAN frameworks in sensitive data domains.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Cristóbal Esteban (5 papers)
Stephanie L. Hyland (20 papers)
Gunnar Rätsch (59 papers)

Citations (724)

View on Semantic Scholar

Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs (1706.02633v2)