Papers
Topics
Authors
Recent
Search
2000 character limit reached

Knowledge-based Synthetic Data

Updated 2 March 2026
  • Knowledge-based synthetic data is artificially generated by explicitly integrating structured domain knowledge such as ontologies, rules, and knowledge graphs to enhance data validity and compliance.
  • Methodological approaches include multi-modal routing, knowledge infusion in GANs, rule-adhering generation, and physics-based simulation to ensure both statistical fidelity and regulatory compliance.
  • These techniques address challenges like data scarcity, privacy constraints, and long-tail scenario coverage, leading to improved utility in applications ranging from medical imaging to cybersecurity.

Knowledge-based synthetic data refers to artificial data generated by explicitly incorporating structured domain knowledge—such as ontologies, knowledge graphs, annotated rules, physical models, or metadata—into the synthesis process, rather than relying solely on empirical data distributions or unconstrained generative models. This paradigm addresses challenges of data scarcity, privacy constraints, lack of coverage for long-tail or rare scenarios, and the necessity for downstream utility and compliance in specialized tasks. Recent research demonstrates that leveraging domain knowledge not only corrects spurious correlations and enforces rule compliance but also enhances robustness, privacy, and transfer learning capabilities across tabular, textual, visual, and multimodal domains.

1. Architectures and Paradigms for Knowledge-Based Synthesis

Knowledge-based synthetic data generation encompasses a broad spectrum of architectural strategies:

  • Multi-modal routing with domain resources: RouteNator combines a weighted router, text-to-text, and vision-to-text LLMs, explicitly integrating domain metadata and structured knowledge graphs (KGs) for multi-faceted data creation. The weighted router samples generative “routes” so that synthetic data matches joint distributions over function type, content type, and length bins observed in real-world usage. Heuristic, text-based, and visually grounded routes are invoked as appropriate, with domain metadata injected into prompts or used for template composition (Belavadi et al., 15 May 2025).
  • Knowledge infusion into generative deep learning models: KIPPS extends conditional WGANs by appending domain-context and knowledge-infusion layers. Structured KGs map feature values and domain rules to rule-indicator vectors and property masks, which are concatenated with standard tabular inputs. A dedicated knowledge regularizer in the objective enforces output compliance with regulatory/domain constraints in addition to adversarial and DP-critic losses (Kotal et al., 2024).
  • Knowledge-guided adversarial training: KiNETGAN fuses a standard conditional GAN with a two-headed discriminator, whereby the additional D_KG component queries a knowledge graph reasoner for rule compliance (e.g., protocol-port consistency). Generator penalties are imposed for semantic violations, ensuring that generated network activity records are both statistically plausible and domain-valid (Kotal et al., 2024).
  • Rule-adherence in structured data synthesis: Rule-adhering generators introduce soft and/or hard constraints in the generative loss, enforcing domain expertise (e.g., encoding legal or feasible value combinations) either via penalty terms or rejection sampling, thus ensuring downstream safety and compliance (Platzer et al., 2022).
  • Physics-based and biophysical simulation: In medical imaging, frameworks such as S-SYNTH and T-SYNTH generate synthetic skin or breast images by simulating multi-layer tissue architectures, lesion growth, light or X-ray transport, and clinically meaningful anatomical variations. These deterministic or stochastic models encode both expert knowledge and biophysical process priors, producing images with precise annotation masks for downstream supervised learning (Kim et al., 2024, Wiedeman et al., 5 Jul 2025).
  • Knowledge graph-driven QA/data augmentation: GraphGen, SK-VQA, and SoG frameworks build fine-grained knowledge graphs from textual corpora or multi-modal sources, identify explicit knowledge gaps via calibration error metrics, and generate QA pairs or contextual samples targeting under-represented facts, long-tail relations, or cross-document associations (Chen et al., 26 May 2025, Su et al., 2024, Jiang et al., 2 May 2025).
  • Federated privacy-preserving synthesis: FDKT leverages differentially private few-shot demonstrations extracted from sensitive local data via DP-SGD, which are then used to guide LLM-based data augmentation at a server, enabling domain-specific knowledge transfer without exposing raw samples (Li et al., 2024).

2. Formalization and Integration of Structured Knowledge

Knowledge encoding and integration methods vary by data domain and task:

  • Knowledge Graphs (KG): Represented as directed graphs G=(V,E)\mathcal{G} = (V,E), where nodes encapsulate entities, actions, or semantic classes, and edges encode expert-defined relations or co-occurrences. KGs serve as sources for prompt engineering, random walks (for compositional sampling), or explicit rule checks within discriminators or loss functions (Belavadi et al., 15 May 2025, Kotal et al., 2024, Kotal et al., 2024, Chen et al., 26 May 2025, Jiang et al., 2 May 2025).
  • Domain Metadata: Structured as JSON-like key–value pairs annotating each asset, providing inputs for prompt templates in text or multi-modal LLMs, or input feature maps for deep networks (Belavadi et al., 15 May 2025).
  • Rule Constraints and Domain Logic: Encoded as indicator functions Ck(x)C_k(x), mapping synthesized records to binary violation flags based on domain logic (e.g., permitted attribute combinations, medically plausible parameter ranges). Enforced through soft penalties in the objective, hard rejection during sampling, or as validity queries in graph reasoners (Platzer et al., 2022).
  • Simulation Models: Explicit mathematical and biophysical models encapsulate knowledge of anatomy, physics, or domain dynamics, parameterized according to literature priors or empirical ranges (e.g., photon attenuation in tissues, lesion growth via advection–reaction–diffusion PDEs) (Kim et al., 2024, Wiedeman et al., 5 Jul 2025).
  • Calibration Error and Coverage Metrics: In knowledge-rich QA, expected calibration error (ECE) and comprehension loss LCRi\mathcal{L}_{C_{R_i}} are measured per fact or edge in a KG, guiding prioritization towards inadequately modeled knowledge (Chen et al., 26 May 2025).

3. Optimization Objectives and Constraint Enforcement

Knowledge-based synthetic data generation employs compound objective functions to jointly optimize statistical realism, knowledge compliance, and (where relevant) privacy:

  • Soft knowledge penalties: Generator loss includes cross-entropy or binary penalties comparing output indicators to rule-specified masks (e.g., LK=Ez,c[H(yKG(c),y^)]L_K = E_{z,c}[ H(y_{KG}(c), \hat y) ]) (Kotal et al., 2024).
  • Hard constraints and rejection sampling: A sample is retained only if it satisfies all domain rules (kCk(x^)=0\sum_k C_k(\hat x) = 0), guaranteeing strict compliance. When used in isolation, this may require over-sampling (Platzer et al., 2022).
  • Knowledge-informed adversarial regularization: Additional generator penalties actively discourage semantically invalid outputs (e.g., Ez,C[1Q(GC(z,C))]E_{z,C}[1-Q(G_C(z,C))] where QQ is the KG-validity query) (Kotal et al., 2024).
  • Causal/compositional constraints: CoInD introduces a Fisher divergence penalty enforcing that the conditional data distribution can be factorized across mutually independent domain attributes, critical for synthesizing unattested combinations and counterfactuals (e.g., rare subpopulation tuples) (Gaudi et al., 6 Mar 2025).
  • Differential privacy: DP constraints are enforced via per-example gradient clipping and Gaussian noise addition within the discriminator updates (DP-SGD), with total privacy cost tracked per standard mechanisms (Kotal et al., 2024, Li et al., 2024).

4. Distributional Alignment and Quality Assurance

Robust knowledge-based synthetic data must reflect true data distributions and preserve utility for downstream tasks:

5. Application Domains and Empirical Impact

Knowledge-based synthetic data approaches have demonstrated state-of-the-art results in diverse settings:

  • Function calling LLMs: Router-based synthetic data generation using KG and metadata enables robust fine-tuning of function-calling models, achieving F1F_1 up to 0.881 and content-type accuracy up to 0.756 vs. 0.239–0.676 for traditional or heuristic-only approaches (Belavadi et al., 15 May 2025).
  • Tabular data with domain constraints: KIPPS enforces both statistical fidelity and hard regulatory compliance under DP, yielding tabular synthetic data with strict rule satisfaction and high downstream utility (ML accuracy within 0.02 of the real-data baseline) while lowering vulnerability to membership or attribute inference attacks (Kotal et al., 2024).
  • Network security: Knowledge-infused GANs in intrusion detection preserve valid protocol-rule relations, improving detection accuracy (NIDS F1 by 3–5 points compared to plain GANs or CTGAN) and suppressing privacy attacks (Kotal et al., 2024).
  • Multimodal QA and context-augmented RAG: SK-VQA and GraphGen demonstrate that KG-based QA generation and context sampling yield datasets with greater question diversity and topical coverage, which, in turn, improve generalization across domains without compromising in-domain performance. GraphGen’s calibration-driven generation yields up to +4.73 ROUGE-F improvement over best baselines in multi-hop QA (Chen et al., 26 May 2025, Su et al., 2024).
  • Medical imaging under biophysical priors: S-SYNTH and T-SYNTH systematically control anatomical and physical parameters, enabling the synthesis of high-fidelity images with exhaustive annotation support. Augmenting limited real datasets with these samples yields consistent improvements (e.g., +2.8% Dice for skin lesion segmentation, +7% sensitivity for mammography detection), and synthetic evaluation mirrors known biases in the real world (Kim et al., 2024, Wiedeman et al., 5 Jul 2025).
  • Compositional robustness: CoInD’s Fisher divergence penalty restores compositional generalization, yielding worst-group accuracy of 80.7% on CelebA—outperforming both standard synthetic and real-data ERM under partial support (Gaudi et al., 6 Mar 2025).
  • System identification: Knowledge transfer from similar systems, modeled via a pre-trained meta-model, allows generation of synthetic trajectories that regularize downstream system identification. Regularization by synthetic data increases test R² from 0.889 to 0.956 under data scarcity (Piga et al., 2024).

6. Limitations, Open Questions, and Best Practices

Knowledge-based approaches rely critically on the quality and expressiveness of the encoded domain knowledge:

  • Coverage and expressivity: If rules, KGs, or simulation models do not fully capture the real-world domain, synthetic data may perpetuate or introduce new biases. Selection of rules, attribute sets, and KGs must be empirically validated (Platzer et al., 2022, Gaudi et al., 6 Mar 2025).
  • Hyperparameter sensitivity: Weighting of knowledge penalties, path sampling parameters in graph-guided generation, and DP budgets all require validation-based tuning to avoid mode collapse, over-regularization, or data utility loss (Kotal et al., 2024, Jiang et al., 2 May 2025, Li et al., 2024).
  • Scalability and computation: Fine-grained simulation, Monte Carlo path tracing, and LLM-based knowledge extraction can be computationally demanding but are increasingly tractable on modern hardware (Su et al., 2024, Wiedeman et al., 5 Jul 2025).
  • Data privacy and formal guarantees: Explicit DP mechanisms (gradient noise, composition accounting) must be integrated—privacy by design is essential when synthetic data originate from sensitive domains (Kotal et al., 2024, Li et al., 2024).

Best Practices:

  • Curate and validate domain-specific KGs, rule sets, or simulation models.
  • Empirically monitor statistical similarity to real data at all relevant marginals.
  • Apply both soft and hard constraint mechanisms as appropriate to the downstream safety requirements.
  • Optimize knowledge weights via cross-validation or empirical tuning for target tasks.
  • In privacy-sensitive contexts, enforce differential privacy at synthesis or model-update time and track composition budgets.

Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Knowledge-based Synthetic Data.