Dependency Discovery in Data Systems

Updated 18 March 2026

Dependency discovery is the systematic identification of structural and statistical dependencies within data systems, crucial for optimizing queries, process mining, and knowledge graph reasoning.
It utilizes formal models and algorithmic principles, such as hypergraph-based enumeration and distributed primitives, to efficiently uncover dependencies like FDs, UCCs, and INDs.
Recent advances include redundancy-driven, incremental, and approximate techniques that address challenges of scalability, noise, and interpretability in diverse applications.

Dependency discovery is the systematic identification of structural or statistical dependencies within or between datasets, systems, or complex objects. Such dependencies encompass a broad range of formalisms, including but not limited to functional dependencies, unique column combinations, inclusion dependencies, order dependencies, file-level and process dependencies, and statistical or causal relationships. Dependency discovery enables fundamental data management operations, query optimization, process mining, software management, module integration, and knowledge graph reasoning.

1. Formal Models and Classes of Dependencies

A dependency formalism specifies a class of constraints or relations among components of a structured system. In relational databases, the primary classes include:

Functional Dependencies (FDs): $X \to A$ holds if, for any tuple pair $t_1, t_2$ , $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ (Wan et al., 15 Jan 2026, Bläsius et al., 2021).
Unique Column Combinations (UCCs): $X$ is a UCC if the projection on $X$ yields unique row identifiers (Bläsius et al., 2021).
Inclusion Dependencies (INDs): $R[X] \subseteq S[Y]$ if all values in $R[X]$ appear in $S[Y]$ .
Order Dependencies (ODs): $X \mapsto Y$ if sort order on $X$ implies sorted order on $t_1, t_2$ 0 (Lindner et al., 2024).
Graph Entity Dependencies (GEDs): Rules over matches in property graphs of the form $t_1, t_2$ 1, where $t_1, t_2$ 2 is a graph pattern (Zhou et al., 2023).

Process mining, software analysis, and knowledge graph reasoning generalize dependency discovery to event logs (dependency graphs), file-level dependencies (static/dynamic), genetic and protein co-dependencies, and logic-based connectivity relations (Tavakoli-Zaniani et al., 2022, Prokhorenko et al., 2023, Zhang et al., 2020, Silva et al., 6 Mar 2026).

Statistical and causal dependency discovery interprets dependencies as significant departures from independence, expressed as association rules ( $t_1, t_2$ 3: $t_1, t_2$ 4), or as conditional independencies in graphical/causal models (Hämäläinen et al., 2017, Ng et al., 31 Aug 2025, Oyen et al., 2013, Lahti et al., 2011).

2. Algorithmic and Theoretical Foundations

Dependency discovery is fundamentally dual: detection (does a dependency exist?) and discovery/enumeration (listing all, or the most useful, dependencies). The computational backbone for dependency discovery—especially for minimal FDs and UCCs—is hitting-set enumeration on hypergraphs constructed from tuple or object difference sets (Wan et al., 15 Jan 2026, Bläsius et al., 2021, Xu et al., 22 Jan 2026):

Hypergraph Representation: For FD discovery, the hyperedges correspond to minimal sets of attributes differentiating tuples with different right-hand-side values. Minimal hitting sets correspond to minimal LHSs defining FDs.
Enumeration Hardness: UCC-discovery ≡ minimal hitting set enumeration, FD-discovery ≡ union of transversal hypergraph enumeration, IND-discovery ≡ maximal satisfying assignments of antimonotone 3-normalized Boolean formulas (Bläsius et al., 2021).
Complexity: Detection of UCCs and FDs is W[2]-complete in parameterized complexity; inclusion dependency detection is W[3]-complete. Discovery is as hard as the underlying combinatorial enumeration, and in practice, requires heuristic or special-case optimizations for tractability (Bläsius et al., 2021).

Distributed dependency discovery relies on decomposing core algorithms into primitives—partitioning, evidence set generation, join, refinement testing, minimal set cover—factoring both computation and network communication costs in distributed or parallel systems (Saxena et al., 2019).

Non-relational dependency discovery adapts statistical approaches (hypothesis testing, association measures, multiple-testing control) for robust, statistically sound identification of dependencies that generalize beyond the observed sample (Hämäläinen et al., 2017). Causal dependency discovery employs constraint-based algorithms (e.g., PC), structural equation modeling, and intervention-calculus for uncovering causal, not merely correlational, dependencies (Ng et al., 31 Aug 2025).

3. Methodological Advances and Recent Algorithms

Redundancy-Driven and Top- $t_1, t_2$ 5 Discovery

Instead of exhaustively enumerating all dependencies, redundancy-driven algorithms, such as SDP, target the most informative $t_1, t_2$ 6 FDs according to a quantitative criterion (redundancy count: $t_1, t_2$ 7), using anti-monotone upper bounds $t_1, t_2$ 8 for pruning (Wan et al., 15 Jan 2026). Optimizations include:

Attribute ordering by partition cardinality for early discovery,
Pairwise partition cardinality matrices for tighter pruning,
Global best-first scheduling across all right-hand-sides.

These techniques reduce memory and computation by up to $t_1, t_2$ 9 on wide/high-cardinality relations.

Incremental and Scalable Discovery

The EAIFD algorithm reframes FD discovery under incremental updates as minimal hitting set enumeration on partial hypergraphs, enabled by a memory-bounded multi-attribute hash table (MHT) for fast candidate validation (Xu et al., 22 Jan 2026). Theoretical bounds show MHT memory use is independent of $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 0, and empirical results demonstrate $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 1 speedups and $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 2 memory reductions compared to prior incremental algorithms.

Approximate and Statistical Dependency Mining

For error-tolerant discovery, approaches such as FastAGEDs use semantics-based error measures $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 3 to identify "almost-hold" dependencies, discovering both exact and approximate rules efficiently via necessary-set reductions and depth-first search (Zhou et al., 2023). Statistically sound pattern discovery frameworks employ corrected hypothesis tests (Fisher, $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 4) and false discovery rate control to ensure extracted dependencies are not spurious (Hämäläinen et al., 2017).

Domain-Specific Approaches

Software/Package Systems: Algorithms for discovering package dependencies, conflicts, and defects via combinatorial group-testing and cover-free families minimize installation tests while guaranteeing exact recovery under bounded unknown constraint counts (Basat et al., 2018).
Knowledge Graphs: RuleDict discovers interpretable connectivity dependencies (EAR, CAR, bisEAR, RofR) via binomial null-hypothesis testing of entity and path overlaps under uniform groundings, providing calibrated, traceable rule strengths for link prediction and KG completion (Zhang et al., 2020).
Process Mining: ILP-based discovery of optimal dependency graphs in event logs guarantees fitness, simplicity, and connectivity, optimizing over dependency measures and loop statistics with explicit graph-theoretic constraints (Tavakoli-Zaniani et al., 2022).
Genetic/Protein Networks: HIDDENdb infers gene/protein co-dependencies by integrating multi-omic screens and statistical co-essentiality via Z-scores, with further module/community detection and structural enrichment analyses (Silva et al., 6 Mar 2026).

4. Workload-Driven and Application-Oriented Dependency Discovery

Effective utilization of discovered dependencies in real systems requires fast, context-aware discovery and integration with downstream optimizers:

Query Optimization: Workload-driven techniques extract FDs, UCCs, ODs, and INDs in milliseconds from cached query plan structures and validate them via metadata-aware routines, significantly boosting throughput in multiple DBMSs; efficiencies arise from both SQL rewrites and optimizer-embedded dependency propagation (Lindner et al., 2024).
Distributed Services: eBPF-assisted packet metadata tagging enables robust, protocol-agnostic, NAT-resilient reconstruction of distributed service dependency graphs, achieving $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 5 precision/recall even across complex network configurations (Landau et al., 17 Oct 2025).
Software Configuration: Dynamic system-call analysis tracks realized generate-use dependencies and notifiers in infrastructure-as-code tools, reconciling observed dependencies with declared ones to find missing or incorrect specifications (Sotiropoulos et al., 2019).

5. Challenges, Limitations, and Future Directions

Several challenges and open questions remain central:

Scalability and Complexity: Exact minimal discovery remains intractable for high-dimensional data; top- $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 6, approximate, or workload-restricted variants offer practical scalability (Wan et al., 15 Jan 2026, Zhou et al., 2023, Xu et al., 22 Jan 2026).
Noise and Statistical Significance: Statistical and causal dependency discovery must account for multiple-testing, latent confounders, and uncertainty in causal graph structure (Hämäläinen et al., 2017, Ng et al., 31 Aug 2025).
Integration and Incrementality: Efficient incremental discovery (e.g., EAIFD) is needed for evolving data, but deletions remain an open challenge (Xu et al., 22 Jan 2026); further, practical systems require tight coupling between discovery and usage.
Expressiveness: Beyond classic FDs/UCCs/INDs, richer dependency types (graph, temporal, higher-order, dynamic) and cross-modal relations are increasingly relevant (Zhou et al., 2023, Zhang et al., 2020).
Interpretability and Reasoning: Rule-based approaches deliver glass-box dependency explanations, but trade off smooth generalization; hybrid neural-symbolic systems are an active area of research (Zhang et al., 2020, Silva et al., 6 Mar 2026).

Planned directions include cost-based optimization of distributed discovery plans (Saxena et al., 2019), parallelization of incremental discovery, and generalized frameworks able to accommodate logical, statistical, and causal dependencies within unified, scalable systems.

6. Summary Table: Key Dependency Discovery Techniques

Domain/Type	Core Algorithmic Principle	Reference
FDs/UCCs (Relational)	Hypergraph hitting-set enumeration, pruning	(Bläsius et al., 2021, Wan et al., 15 Jan 2026)
Incremental FDs	Partial hypergraph, MHT, selective validation	(Xu et al., 22 Jan 2026)
Approximate (Graph)	Necessary-set DFS, error measure $t_1[X]=t_2[X] \implies t_1[A]=t_2[A]$ 7	(Zhou et al., 2023)
Distributed (All)	Primitives: group-by, evidence, join, set-cover	(Saxena et al., 2019)
Statistical	Hypothesis testing, FDR/Bonferroni, level-wise mining	(Hämäläinen et al., 2017)
Causal	Skeleton/edge-orientation (PC), IDA, SHAP integration	(Ng et al., 31 Aug 2025)
Process/Event logs	ILP optimal arc selection, fitness/precision	(Tavakoli-Zaniani et al., 2022)
Configuration/Software	Group-testing, dynamic trace analysis	(Basat et al., 2018, Sotiropoulos et al., 2019)
KG Reasoning	Connectivity rules (EAR, CAR, bisEAR, RofR)	(Zhang et al., 2020)
Service/OS	eBPF-based TCP tagging, file-level graph	(Landau et al., 17 Oct 2025, Prokhorenko et al., 2023)

Dependency discovery continues to evolve as a central theme in data management, with ongoing research driving advances in scalability, expressiveness, statistical robustness, incremental maintenance, and system-level integration.