Graph Generating Dependencies (GGDs)
- Graph Generating Dependencies (GGDs) are a formal language that unifies tuple-, equality-, and differential constraints to specify both topological and attribute-based conditions in property graphs.
- They enable practical applications such as schema discovery, entity resolution, and data quality enhancement by enforcing existence, similarity, and identification constraints.
- Leveraging universal–existential semantics and efficient mining algorithms, GGDs underpin generative modeling and realistic graph synthesis while addressing complex reasoning challenges.
Graph Generating Dependencies (GGDs) provide a unified formalism for expressing, reasoning about, and generating constraints over graph-structured data, particularly within the property-graph data model. GGDs subsume tuple-generating, equality-generating, and differential constraints, supporting both topological and attribute-based requirements. Their impact spans graph database constraint languages, data quality and entity resolution frameworks, and foundational models of dependency structure in generative and learning settings.
1. Formal Definition and Syntax of GGDs
A GGD over property graphs is defined as a statement of the form
where:
- and are graph patterns, i.e., small labeled, directed graphs with variables for nodes and/or edges.
- , are finite sets of differential constraints on the variables, including:
- : a property value is within (under ) of constant ,
- : two property values are similar,
- or : identification or non-identification constraints.
A property graph provides vertices, edges, endpoint function , label function , and property function . A homomorphism matches pattern variables to graph elements respecting structure, directionality, and labels (wildcard permitted).
Satisfaction: iff for every source match with , there exists an extension (agreeing with on ) so that (Shimomura et al., 2020, Shimomura et al., 2022).
GGDs strictly generalize previous dependency and constraint formalisms on property graphs. They unify and strictly extend Graph Functional Dependencies (GFDs), Graph Entity Dependencies (GEDs), and Graph Differential Dependencies (GDDs), as well as introducing tuple-generating dependencies (TGDs) in the property-graph context (Shimomura et al., 2020, Shimomura et al., 2022).
2. Semantic and Reasoning Framework
GGDs operate under a universal–existential semantics: This supports both:
- Tuple-generating dependencies (TGD): “Whenever a source pattern occurs (subject to value constraints), a target pattern must also occur (possibly introducing fresh vertices or edges).”
- Equality-generating dependencies (EGD): “Whenever the source constraints hold, certain nodes/edges must be identified.”
Validation asks whether for a set of GGDs. The complexity is -complete in general, due to the universal-existential alternation over source and target matches. For patterns of bounded treewidth, data complexity becomes polynomial (Shimomura et al., 2020, Shimomura et al., 2022, Shimomura et al., 2024).
Other reasoning tasks:
- Satisfiability: Does there exist a graph such that ? Undecidable in general; coNP for weakly acyclic GGDs.
- Implication: Given and , does every model of satisfy ? This is coNP-complete under consistency (Shimomura et al., 2022).
Repair and entity resolution: Violations of a GGD can be “repaired” by generating missing vertices/edges or merging entities as dictated by the target side of the dependency. This yields a declarative basis for graph cleaning and enrichment workflows.
3. Expressivity and Applications
GGDs subsume classical graph dependencies and provide fine-grained control over graph structure and attribute-based constraints:
- Existence constraints: Enforce the presence of patterns (paths, cliques, etc.) conditional on attributes.
- Similarity/join constraints: Impose (possibly thresholded) similarities between node/edge properties (edit-distance, numerics, etc.).
- Identification/merging: Unify entities according to equality/distance conditions.
Case studies include:
- Entity resolution: E.g., generating a
sameAsedge or merging person nodes when names and birthdates approximately match (Shimomura et al., 2020). - Schema discovery and profiling: Mining GGDs from data enables reconstructing frequent topological/attribute patterns and correlations, giving users both schema-level and value-level insights (Shimomura et al., 2024).
- Data quality and cleaning: GGDs with confidence or support below 1 flag potential repair points, automating the search for missing or anomalous relationships.
GGDs also underpin models and principles in generative modeling, learning theory, and software graph analysis (Chanpuriya et al., 2023, Abélès et al., 2024, Li et al., 2024, 0802.2306, Musco et al., 2014).
4. GGDs in Graph Generative Modeling and Statistical Inference
Recent work generalizes the notion of GGDs beyond constraint satisfaction to govern dependencies in graph generation and statistical modeling.
Graph Generative Model Dependency Hierarchies
A three-level hierarchy characterizes generative models by edge dependencies (Chanpuriya et al., 2023):
- Edge-independent (EI): Each edge is generated independently (e.g., Erdős–Rényi, Stochastic Block Model).
- Node-independent (NI): Each node samples an embedding; edges are then independent conditional on embeddings (e.g., VGAE).
- Fully dependent (FD): Arbitrary joint distribution over all edges.
A central contribution is the overlap–accuracy–diversity tradeoff: overlap measures the expected fraction of shared edges between two generated graphs (memorization). The dependency class constrains the achievable density of higher-order motifs (e.g., triangles, cycles) at fixed overlap:
- EI: triangles,
- NI: ,
- FD: .
Dense-subgraph-planted generative models (via maximal clique enumeration) are introduced to attain near-optimal triangle/overlap tradeoffs for each class (Chanpuriya et al., 2023).
Learning Under Graph-Structured Dependencies
A formalism for (G,η)-mixing variables encodes GGDs in statistical learning: dependencies decay quantitatively as a function of graph distance via rates (Abélès et al., 2024). The fractional -chromatic number quantifies effective block-independence. These lead to generalization bounds that blend mixing bias (via ) and statistical variance (via ). This framework subsumes temporal-mixing and graphical-independence bounds, providing a toolkit for online-to-PAC reductions in arbitrarily dependent graph-structured data.
Structured Generative Modeling in DAGs
In directed acyclic graph (DAG) generation, GGDs appear as directional dependencies (across layers) and logical dependencies (within layers), as modeled in LayerDAG (Li et al., 2024). Here, autoregressive generation enforces partial order, while intra-layer discrete diffusion models capture arbitrary logical relationships, enabling synthesis of realistic synthetic graphs with high structural fidelity.
5. Algorithms, Mining, and Practical Implementations
Validation and Reasoning
The standard validation procedure (for a single GGD on graph ):
- Enumerate all source pattern matches .
- For each satisfying , seek an extension matching and (agreeing on ).
- If any has no valid , declare violation; else the GGD is valid (Shimomura et al., 2020, Shimomura et al., 2022, Shimomura et al., 2024).
GGD Discovery and Profiling
GGDMiner (Shimomura et al., 2024) is an end-to-end framework for discovering approximate GGDs from property-graph data:
- Preprocessing: Frequent label extraction, attribute-pair selection, and similarity clustering.
- Candidate generation: Lattice expansion over frequent patterns, differential constraints, and their matches.
- GGD extraction: Candidate pairs are scored by confidence—the fraction of source matches validated by some target match—using a compact, factorized "Answer Graph" representation for efficient match enumeration.
- Approximate GGDs: Output rules are filtered by support and confidence thresholds. This enables profiling at both schema and data-level granularity.
Empirical studies report high scalability (10×–100× speedup via answer graphs) and high coverage (70–97%) for realistic graph sizes (Shimomura et al., 2024).
Data Cleaning and Inconsistency Reporting
Given a set of GGDs, inconsistencies are identified by reporting all source matches that lack a corresponding valid target extension (via pattern join operations). In practice, anti-join and outer-join strategies suffice for millions of nodes and edges, despite theoretical complexity (Shimomura et al., 2022).
6. Relationships to Broader Dependency and Generative Models
GGDs interface with several other formalisms:
- Software dependency graphs: As generative models for software evolution, GGDs (in probabilistic or rule-based interpretations) explain in/out-degree distributions and other global graph properties, as in the asymmetry modeled by programmer awareness (0802.2306, Musco et al., 2014).
- Covariance graphs and statistical dependencies: For bi-directed graphical models, GGD criteria have been solved exactly via connectivity-based tests, given (WTC)-graphoid axioms (Peña, 2010).
- Service dependency and microservice architecture: GGDs underlie the random-graph generators for synthesizing realistic microservice topologies, capturing repeated call patterns and interface-level variations, as shown in microservice graph synthesis and resource scaling frameworks (Du et al., 2024).
7. Key Takeaways, Limitations, and Future Directions
GGDs offer a fully declarative, expressive, and semantically rigorous language for structural and value-based constraints in property graphs and network models. They unify the best of database dependency theory, statistical graphical modeling, and practical data-profiling needs.
Limitations:
- General reasoning (validation, implication, satisfiability) is computationally sharp (, undecidable in general), though tractable for patterns of bounded size and treewidth.
- Real-world constraints (cloning, coarsening, temporal evolution) may exceed the expressive reach of basic GGD syntax or require richer forms (copying splits, package-level dependencies) (0802.2306, Musco et al., 2014).
Current and future research directions:
- Discovery of GGDs from data at scale, with automated support selection and confidence estimation (Shimomura et al., 2024).
- Incremental validation and live cleaning in high-throughput graph databases.
- Integration of GGDs in generative models for more accurate motif–diversity and data-driven network simulation (Chanpuriya et al., 2023).
- Study of implication, satisfiability, and repair strategies for richer or domain-specific GGD classes (Shimomura et al., 2022, Shimomura et al., 2020).
- Extension to temporal, multi-graph, or attributed settings and partial-dependence/higher-order dependency hierarchies (Chanpuriya et al., 2023, Abélès et al., 2024).
GGDs thus constitute the foundational formalism for declarative reasoning, synthesis, and profiling in modern graph-structured data applications.
Key References (arXiv ID):
- GGDs: Graph Generating Dependencies (Shimomura et al., 2020)
- Reasoning on Property Graphs with GGDs (Shimomura et al., 2022)
- Discovering GGDs for Property Graph Profiling (Shimomura et al., 2024)
- On the Role of Edge Dependency in Graph Generative Models (Chanpuriya et al., 2023)
- Online-to-PAC Generalization Bounds under Graph-Mixing Dependencies (Abélès et al., 2024)
- LayerDAG: Layerwise Autoregressive Diffusion Model for DAG Generation (Li et al., 2024)
- Software Graphs and Programmer Awareness (0802.2306)
- A Generative Model of Software Dependency Graphs (Musco et al., 2014)
- Reading Dependencies from Covariance Graphs (Peña, 2010)
- A Microservice Graph Generator with Production Characteristics (Du et al., 2024)