RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models (2506.00710v1)

Published 31 May 2025 in cs.LG

Abstract: Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a $2K+$SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at https://github.com/ValterH/RelDiff.

Summary

Relational Data Generative Modeling with Graph-Based Diffusion Models

This paper introduces RelDiff, a novel framework for generative modeling of relational data using graph-based diffusion models. The work addresses significant challenges inherent in synthesizing synthetic relational databases, which are characterized by complex structural and statistical dependencies across interconnected tables. Traditional methods often attempt to reduce these complexities by flattening relational data into single tables or by imposing constraints that fail to capture inter-table correlations effectively. RelDiff offers a comprehensive solution by explicitly modeling the relational graph structure and ensuring both high fidelity and referential integrity.

Key Contributions

Graph-Based Structure Generation: RelDiff employs a $2K+$SBM (Stochastic Block Model) graph generator, which is capable of accurately modeling relational data's hierarchical structures and foreign key relationships. This ensures that synthetic data preserves cardinalities and dependencies present in the original data.
Joint Diffusion Process: The framework synthesizes mixed-type attributes across tables using a graph-conditioned diffusion process, leveraging Graph Neural Networks (GNNs). This approach captures both intra-table and inter-table dependencies, ensuring coherent attribute generation.
Extensive Benchmarking: The authors evaluated RelDiff on 11 datasets with diverse schemas, consistently demonstrating superior performance compared to existing methods. Empirical results showcased up to 80% improvement over prior approaches in preserving column correlation between connected tables.

Methodology

RelDiff decomposes the generative modeling task into two components: graph structure generation and attribute synthesis. The graph structure is dictated by the foreign key relationships within the relational data, modeled using the $2K+$SBM framework—a nonparametric Bayesian approach that allows for the preservation of complex hierarchical and modular organization typical of relational databases.

For attribute synthesis, RelDiff utilizes a diffusion model where noise is introduced progressively to the data, and a GNN-based framework is employed to learn and apply reverse diffusion for generating synthetic attributes. This joint model respects the relational dependencies and synthesizes data that maintains the original data’s statistical properties.

Practical and Theoretical Implications

From a practical standpoint, RelDiff opens avenues for generating high-quality synthetic relational data that can be utilized in scenarios plagued by data access restrictions due to privacy concerns, such as in the healthcare and financial industries. The synthesized data can be used for tasks like missing value imputation and data augmentation, facilitating robust model development without infringing on privacy.

Theoretically, the paper presents a significant advancement in relational data synthesis by integrating diffusion models with GNNs, creating an intricate blend of graph theory and deep learning techniques. This approach sets the groundwork for future explorations in synthesizing more complex relational structures and modeling tasks, further bridging the gap between data availability and ethical data usage.

Future Directions

The authors hint at several promising directions for future research. Expanding the framework to accommodate larger-scale databases and ensuring scalability remains a key challenge. Moreover, extending the relational synthesis to include provable privacy guarantees could enrich the practical applicability of RelDiff. Additionally, exploring alternative graph generation techniques and diffusion processes might uncover new insights into the nuances of relational data modeling.

In conclusion, RelDiff offers a robust blueprint for synthetic relational database generation, combining traditional graph theory methods with cutting-edge diffusion models and neural networks. Its ability to maintain high fidelity and structural integrity holds potential for transforming data management practices across domains.