SynthGuard: Redefining Synthetic Data Generation with a Scalable and Privacy-Preserving Workflow Framework (2507.10489v1)

Published 14 Jul 2025 in cs.CR

Abstract: The growing reliance on data-driven applications in sectors such as healthcare, finance, and law enforcement underscores the need for secure, privacy-preserving, and scalable mechanisms for data generation and sharing. Synthetic data generation (SDG) has emerged as a promising approach but often relies on centralized or external processing, raising concerns about data sovereignty, domain ownership, and compliance with evolving regulatory standards. To overcome these issues, we introduce SynthGuard, a framework designed to ensure computational governance by enabling data owners to maintain control over SDG workflows. SynthGuard supports modular and privacy-preserving workflows, ensuring secure, auditable, and reproducible execution across diverse environments. In this paper, we demonstrate how SynthGuard addresses the complexities at the intersection of domain-specific needs and scalable SDG by aligning with requirements for data sovereignty and regulatory compliance. Developed iteratively with domain expert input, SynthGuard has been validated through real-world use cases, demonstrating its ability to balance security, privacy, and scalability while ensuring compliance. The evaluation confirms its effectiveness in implementing and executing SDG workflows and integrating privacy and utility assessments across various computational environments.

Summary

The paper presents a modular, privacy-preserving workflow that enforces data sovereignty and regulatory compliance for synthetic data generation.
It integrates containerized tools like Kubernetes and Kubeflow with built-in utility and privacy evaluations for robust and scalable performance.
Empirical validation on datasets from 1K to 100K rows demonstrates sublinear runtime scaling and practical efficiency in diverse, regulated environments.

SynthGuard: A Scalable and Privacy-Preserving Workflow Framework for Synthetic Data Generation

SynthGuard presents a comprehensive framework for synthetic data generation (SDG) that addresses the operational, legal, and technical challenges inherent in privacy-sensitive domains such as healthcare, finance, and law enforcement. The framework is motivated by the increasing demand for data-driven innovation under stringent regulatory constraints, particularly in the European context, where data sovereignty and compliance with GDPR and related directives are paramount.

Architectural Principles and Requirements

SynthGuard is grounded in the principles of domain ownership, computational governance, and data sovereignty, drawing inspiration from the Data Mesh paradigm. The requirements elicitation process, conducted across multiple EU research projects (LAGO and TEADAL), identified a set of cross-domain needs:

Anonymization and secure data generation to protect sensitive information.
Data sovereignty, ensuring data owners retain control and avoid off-premises exposure.
Compliance validation for both privacy and utility before data sharing.
Standardization and interoperability of pipeline specifications.
Scalability and modularity to support large-scale, flexible SDG workflows.

SynthGuard’s architecture is designed to ensure that all SDG processes, from specification to evaluation, are executed within the data owner’s controlled environment. This approach minimizes privacy risks associated with data transfer and external processing, and aligns with evolving legal interpretations of anonymization and data protection.

Technical Implementation

SynthGuard operationalizes its architectural model through a modular, containerized workflow system built on Nix, Kubernetes, and Kubeflow Pipelines, with Argo Workflows as the pipeline specification standard. The core components include:

Pipeline Specification: SDG workflows are defined as directed acyclic graphs (DAGs) of modular components, each encapsulating a specific function (e.g., data loading, preprocessing, generation, evaluation). These are implemented in Python and R, and packaged for portability and reproducibility.
Deployment Models: The framework supports both on-premises and cloud-based deployments, with options for container- or VM-based isolation. This flexibility allows data owners to tailor security and scalability according to their infrastructure and regulatory requirements.
Evaluation Mechanisms: SynthGuard integrates utility and privacy assessments directly into the pipeline. Utility is measured via univariate, bivariate, and population-level metrics (e.g., propensity score MSE, Kolmogorov-Smirnov distance), while privacy is evaluated using metrics such as CategoricalCAP, NewRowSynthesis, and inference attack scores.

The framework does not introduce new SDG models but provides a standardized, auditable environment for integrating existing methods (e.g., CTGAN, rule-based generators) into compliant workflows.

Validation and Empirical Results

SynthGuard was validated across six use cases in the LAGO and TEADAL projects, encompassing domains with diverse operational and regulatory requirements. The validation process included:

Iterative development with domain expert input, ensuring alignment with legal and technical constraints.
Deployment in local, on-premises, and cloud environments, demonstrating adaptability and compliance with data sovereignty mandates.
Scalability benchmarking: On datasets ranging from 1K to 100K rows, total pipeline runtime scaled sublinearly (1.6 min for 1K, 16 min for 100K rows on an 8 vCPU, 12GB RAM VM). Privacy and quality evaluations dominated runtime at scale, but parallel execution via Kubeflow mitigated bottlenecks.
Comprehensive reporting: Each pipeline execution produced detailed utility and privacy reports, supporting compliance verification and auditability.

SynthGuard’s modular design enabled concurrent execution of evaluation tasks, supporting efficient scaling without compromising privacy or utility assessments.

Implications and Future Directions

SynthGuard’s approach has several practical and theoretical implications:

Operationalization of Data Sovereignty: By localizing all SDG processes, SynthGuard provides a concrete mechanism for enforcing data sovereignty and computational governance, addressing a critical gap in existing SDG solutions.
Standardization and Reproducibility: The use of declarative pipeline specifications and containerized components facilitates reproducibility, auditability, and cross-domain interoperability, supporting the emergence of synthetic data marketplaces and federated data ecosystems.
Compliance and Trust: Integrated privacy and utility evaluations, combined with auditable workflows, enhance trust and facilitate compliance with evolving regulatory standards.

The framework’s current limitations include its focus on single-table scenarios and selected use cases. Future work will extend support to multi-table relational data, integrate advanced compliance validation (e.g., secure multi-party computation, trusted execution environments), and adopt dataspace protocols for broader interoperability. There is also scope for automating workflow generation and optimizing performance via hardware acceleration.

Conclusion

SynthGuard represents a significant advancement in the practical deployment of synthetic data generation workflows, offering a scalable, privacy-preserving, and governance-oriented solution. Its modular, programmable architecture and empirical validation across real-world use cases position it as a robust foundation for responsible data sharing in regulated environments. As data-sharing ecosystems evolve, frameworks like SynthGuard will be instrumental in balancing innovation with privacy, security, and compliance.

PDF Markdown

Related Papers

YouTube

Show All Videos