Papers
Topics
Authors
Recent
Search
2000 character limit reached

Geo: A Query Rewrite Framework for Graph Pattern Mining

Published 25 May 2026 in cs.PL and cs.DB | (2605.26291v1)

Abstract: Graph pattern mining is important for analyzing graph data. Graph mining systems typically require answering pattern matching queries, which involve solving the NP-complete subgraph isomorphism problem. To address this, domain experts often develop custom optimization strategies based on exploiting substructural similarities across different patterns. While these optimizers can be effective, their development is challenging, limiting the exploration of interactions between different optimization strategies and restricts experts from continuously improving the optimizers -- such as by incorporating additional custom or general pattern-based equivalences over time. We present a programmable pattern matching query optimizer called Geo, which automatically manages the interactions between various equivalences, ensures the optimizations maintain correctness of results, and simplifies the management of substructure equivalences. Geo exposes a simple but flexible language for expressing pattern equivalences as rewrite rules. By maintaining canonical representations of generated patterns during equality saturation, Geo avoids issues arising from syntactic differences in isomorphic patterns. Additionally, we develop embedded reconstructablility (EmRec) that tracks provenance across equivalences to ensure various reconstructability needs of desired outputs. Our evaluation demonstrates that Geo can discover novel query equivalences through complex composition of various rewrite rules, enabling our optimized queries to achieve a cost reduction of up to 99% compared to the queries in prior work. We further test Geo's effectiveness at speeding up practical graph mining problems by using it in two representative case studies -- approximate pattern matching and quasi-clique mining, and find it is highly effective at optimizing these tasks, enabling cost reductions of up to 71%.

Summary

  • The paper introduces GEO, a framework that leverages a domain-specific language and equality saturation to optimize graph pattern queries, reducing query cost by up to 99%.
  • The paper employs canonicalization and provenance tracking (EMREC) to ensure correctness in batched query processing, achieving speedups up to 320x for single pattern queries.
  • The paper demonstrates both practical and theoretical benefits by validating GEO on large datasets, with cost reductions up to 71% and additional collective optimization gains of 27%.

GEO: A Programmable Framework for Query Rewriting in Graph Pattern Mining

Motivation and Problem Setting

The paper introduces GEO, a programmable and extensible query rewrite framework for graph pattern mining, addressing the inherent complexity of optimizing pattern matching queries in large graph datasets. The primary computational bottleneck in graph mining lies in solving the subgraph isomorphism problem, which is NP-complete. While domain-specific query optimizers (e.g., SUBGRAPH MORPHING, ESCAPE) leverage substructural pattern equivalences for improving performance, their ad hoc construction inhibits systematic exploration of equivalence interactions and incremental extensibility. The lack of a generalizable, maintainable, and extensible optimization framework limits both scalability and the ability to incorporate expert knowledge over time.

GEO Architecture and Language

GEO provides a domain-specific language (DSL) for expressing graph pattern queries and pattern-level rewrite rules. The DSL is designed to be algebraically tractable and amenable to batching and reconstructability requirements, allowing both individual and collective result aggregation. It supports pattern, union, and count constructs with provenance tracking. Through lightweight syntactic annotations, the language delineates reconstruction paths, which are crucial for maintaining correctness when rewrites impact result composition, especially in batched queries.

Pattern equivalences are formalized as rewrite rules. GEO leverages equality saturation via e-graphs (egg library), enabling systematic traversal of candidate rewrites and cost-based extraction of optimal queries. The framework supports both statically defined and dynamically generated rewrite rules, making it applicable to arbitrary patterns and facilitating the integration of expert-provided equivalences.

To mitigate the combinatorial explosion inherent in isomorphic pattern representations, GEO employs canonicalization (BLISS algorithm) for pattern terms. This ensures syntactic normalization, allowing the optimizer to operate on equivalence classes rather than redundant syntactic variants.

Optimization Algorithm and Canonicalization

GEO's optimization is realized through iterative equality saturation on canonicalized queries using e-graphs. Rewrite rules are required to respect canonicalization, guaranteeing that equivalent results are obtained regardless of syntactic representation. The optimizer minimizes the sum of pattern costs, as estimated by a domain-specific performance model, typically informed by prior work and empirical observations on pattern mining engines.

Correctness is formally established: if the algorithm achieves saturation with rewrite rules that respect canonicalization, the extracted solution is guaranteed to be minimal-cost under the provided cost function and equivalent to the original query in terms of reconstructability.

Embedded Reconstructability (EMREC)

EMREC is GEO’s mechanism for maintaining provenance across rewrites and query composition, enabling safe optimization for batched queries with diverse reconstructability requirements. Usage annotations and reconstruction paths track how individual and collective results should be computed, preventing incorrect result aggregation when interacting rewrites could otherwise cause information loss. This approach generalizes provenance-aware optimization, and ensures that post-optimization queries are semantically equivalent under the specified reconstruction semantics.

Experimental Evaluation

GEO is empirically evaluated on large-scale graph datasets (e.g., Friendster), with relative pattern costs derived from execution times on contemporary mining engines (e.g., Peregrine). The evaluation instantiates GEO with rewrite rules from both SUBGRAPH MORPHING and ESCAPE, demonstrating:

  • GEO discovers novel query equivalences through the combinatorial composition of diverse rewrite rules, yielding optimized queries that are up to 99% lower in cost than baseline queries.
  • For single pattern queries, GEO achieves speedups up to 320x; for batched-individual queries, GEO achieves speedups up to 213x.
  • GEO’s collective reconstructability optimization delivers up to 27% additional reduction compared to individual reconstructability.
  • Canonicalization is shown to be critical: disabling canonicalization results in a 13.5x slowdown and inability to eliminate redundant subterms.

Case studies in approximate pattern mining and quasi-clique mining show that GEO’s rewrite-based query optimization enables cost reductions of up to 71%.

Practical and Theoretical Implications

GEO is architecturally orthogonal to underlying pattern matching engines and cost models. By separating the expert-driven discovery of equivalences from their application in query optimization, GEO enables continual extensibility, systematic exploration of equivalence interactions, and domain-specific optimizations. The results demonstrate the effectiveness of canonicalized e-graphs and provenance-aware rewrite algebra for complex optimization problems.

The practical implication is a substantial reduction in computational resources, which is crucial in domains with large graphs and expensive mining tasks (bioinformatics, social networks, cybersecurity). Theoretically, the formal integration of canonicalization and provenance semantics through equality saturation establishes a principled foundation for automated query optimization in domains with rich equational theories.

Future Directions

The work suggests several promising avenues:

  • Automated discovery of novel rewrite rules for graph pattern mining, leveraging machine learning or synthesis methods.
  • Application of EMREC-style provenance tracking in other domains with batched subproblems (e.g., SQL optimization, view/index reuse).
  • Exploration of canonicalized e-graphs in domains with infinite or complex equational theories (e.g., regular expressions, lens languages).

Conclusion

GEO represents a principled, extensible, and efficient query rewrite framework for graph mining optimizers, addressing the challenges of complex pattern-level equivalence management, reconstructability, and canonicalization. By empowering domain experts to focus on equivalence specification and automating correctness and efficiency guarantees, GEO achieves substantial performance gains in practical graph mining tasks and sets a foundation for broader applications of canonicalized, provenance-aware rewrite optimization (2605.26291).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.