Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

96 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

48 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

3 tokens/sec

DeepSeek R1 via Azure Pro

33 tokens/sec

2000 character limit reached

SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models (2404.03299v1)

Published 4 Apr 2024 in cs.LG, cs.CR, cs.DB, and cs.DC

Abstract: Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked distributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.

References (54)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel framework that combines local autoencoders with latent diffusion models to synthesize high-quality synthetic tabular data while preserving privacy.
It tackles vertical partitioning challenges by encoding mixed data types efficiently and reducing communication overhead through a stacked training approach.
Benchmarking reveals significant improvements over GAN-based methods, with up to 43.8 and 29.8 percentage points gains in data resemblance and utility, respectively.

SiloFuse: A Novel Approach for Cross-Silo Synthetic Data Generation using Latent Tabular Diffusion Models

Introduction

The proliferation of proprietary datasets across enterprises presents both a promise for collaborative knowledge discovery and a challenge owing to privacy constraints like GDPR. The generation of high-quality synthetic data that accurately mirrors the statistical properties of real datasets—without compromising privacy—remains a pivotal concern in distributed environments. Addressing this, the paper explores the novel framework SiloFuse, which pioneers the synthesis of cross-silo tabular data by leveraging a distributed latent tabular diffusion architecture.

Synthesis Challenge in Vertical Partitioning

Traditional synthesizers struggle in scenarios involving vertical partitioning of datasets across different silos where data features are distributed and need to be stored on-premise. The main challenges being:

Handling mixed data types necessitates innovative encoding strategies for both continuous and categorical variables, avoiding issues like sparsity and poor feature obfuscation inherent in one-hot encoding.
Ensuring the synthetic data captures cross-silo feature correlations without centralizing the original datasets, thus respecting privacy constraints.
Efficient communication during distributed training, sidestepping costly data exchanges that escalate with increased training iterations.

Framework Design: SiloFuse

SiloFuse introduces a novel architecture combining autoencoders with latent diffusion models to synthesize data:

Local Autoencoders: Silos first encode their data features into continuous latents, addressing data diversity (continuous and categorical features) and reducing sparsity. These encoded latents are then sent to a central coordinator.
Latent Diffusion Model: At the coordinator, these latents are synthesized using a backbone generative Gaussian diffusion model, ensuring global feature correlations are learned in the latent space.
Stacked Training Paradigm: Autoencoders and the diffusion model undergo separate training phases—local autoencoder training followed by centralized diffusion model training—significantly reducing the communication overhead to a single round of latent exchange.

The framework is theoretically backed by proving the impossibility of data reconstruction for vertically partitioned synthesis, enhancing privacy.

Benchmarking and Evaluation

SiloFuse has been rigorously evaluated against centralized methods on nine datasets. The framework:

Demonstrates notable performance over GANs, achieving up to 43.8 and 29.8 percentage points improvement in resemblance and utility, respectively.
Shows a fixed communication cost advantage over end-to-end training models, due to its unique stacked training approach.
Maintains robustness against feature permutations and varying numbers of clients, indicating a high degree of flexibility and resilience in diverse data distribution scenarios.

Implications and Future Directions

SiloFuse's approach to synthetic data generation not only addresses the pressing need for privacy-preserving data sharing across silos but also opens new avenues for collaborative data analysis without compromising data privacy. Its ability to efficiently manage communication costs and maintain data utility under privacy constraints presents a scalable solution for enterprises looking to leverage shared knowledge. Future work could explore enhancements in the model's ability to handle even more diverse datasets, or investigate novel paradigms for secure, privacy-preserving computation to further enrich collaborative data science endeavors.

Conclusion

SiloFuse represents a significant advancement in the field of synthetic data generation, especially for vertically partitioned, cross-silo scenarios. By marrying the concepts of latent diffusion models with autoencoders within a distributed architecture, it innovatively tackles the dual challenge of privacy preservation and data utility. This framework sets a new bar for future research in distributed synthetic data generation, paving the way for more sophisticated and privacy-compliant data collaboration techniques in the digital age.

Tweets

https://twitter.com/Moi39017963/status/1777282794281951322

https://twitter.com/knishimae0531/status/1777500215559455211