Language Models are Realistic Tabular Data Generators (2210.06280v2)

Published 12 Oct 2022 in cs.LG

Abstract: Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based LLMs, which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Authors (5)

Vadim Borisov (9 papers)
Kathrin Seßler (7 papers)
Tobias Leemann (14 papers)
Martin Pawelczyk (21 papers)
Gjergji Kasneci (69 papers)

Citations (175)

View on Semantic Scholar

Summary

Overview of "LLMs are Realistic Tabular Data Generators"

This paper discusses the innovative approach using transformer-based LLMs to generate realistic synthetic data from tabular datasets, addressing challenges in varied data characteristics and structural complexity often found in such datasets. The methodology involves conversion of tabular data into textual formats, enabling these LLMs to capture complex dependencies between heterogeneous data types without extensive preprocessing, which is typically required.

Introduction

Tabular data is widely utilized in machine learning but suffers from limitations due to imbalanced class distributions, privacy concerns, and data impurities. Current solutions like GANs and VAEs require intensive preprocessing, often losing critical contextual information in the process. The proposed method, Generation of Realistic Tabular data (GReaT), leverages autoregressive LLMs to overcome these barriers by converting tabular datasets into textual sequences, thus preserving the structure and semantics inherent in the original data.

Key Contributions

The paper posits that pretrained LLMs are suitable for modeling tabular datasets thanks to their proficiency in handling textual data, which can be transposed onto tabular data via transformation into subject-predicate-object sequences. Key innovations of the GReaT method include:

Textual Encoding and Random Ordering: The transformation of tabular data into sequences of meaningful text avoids lossy preprocessing and encodes rich contextual knowledge. Permutations applied to feature order remove artificial dependencies and provide flexibility for arbitrary conditioning during data generation.
Pretraining Adaptation: Utilizing existing transformer architectures allows GReaT to benefit from the substantial contextual knowledge that LLMs acquire from vast text corpora. This enhances the synthetic data generation capability beyond that of conventional numeric transformations.
Arbitrary Conditioning Capability: GReaT allows data generation conditioned on any subset of features, thus providing fine control over synthetic data distribution. This offers potential applications in imputation and oversampling within datasets.

Experimental Evaluation

The efficacy of GReaT is demonstrated through various experiments involving real-world and synthetic datasets. Results indicated that the method achieves state-of-the-art performance across different aspects:

ML Efficiency: Models trained on GReaT-generated synthetic data produce excellent results when tested on real datasets, confirming high fidelity in critical domain applications.
Discriminator Performance: GReaT achieved lower accuracy in discriminating between real and synthetic data, implying realistic data generation quality.
Distance to Closest Records (DCR): Ensures generated samples are distinct from existing records, validating data originality and diversity.
Qualitative Analysis: Demonstrates robust joint feature distribution preservation, outperforming existing models in capturing interdependencies between features.

Conclusion and Future Directions

The introduction of LLMs in tabular data generation represents a significant step toward merging capabilities of NLP and traditional data synthesis, offering a versatile and efficient alternative to conventional methods. Future research may explore optimization of numerical encoding strategies and the use of context-rich models in various data domains. Results suggest broad applicability in challenging applications such as healthcare data privacy and data-rich financial systems, poised to benefit from advanced synthetic data to alleviate data scarcity and privacy constraints.

This paper thereby positions itself as a pioneering exploration into harnessing LLMs' capabilities, opening new avenues for future research in AI-driven data synthesis methodologies.

PDF Markdown