Graph-Based Equation Discovery Framework

Updated 20 November 2025

The paper introduces a graph-based framework that encodes equations as DAGs to enhance scalability and interpretability in equation discovery.
It employs a blend of graph neural networks, genetic programming, and variational autoencoders to robustly search and infer symbolic equations from data.
Applications span physics, materials science, and dynamical systems, demonstrating superior efficiency and improved error metrics compared to classical methods.

A graph-based equation discovery framework is a methodology for representing, searching, and inferring mathematical equations using graph-theoretic data structures and algorithms. In these frameworks, equations are encoded as graph objects—typically directed acyclic graphs (DAGs) or specialized knowledge graphs—enabling scalable, modular, and semantically meaningful exploration of equation space. Approaches span knowledge synthesis (mapping physical laws as conceptual graphs), symbolic regression (discovering free-form algebraic or differential equations from data), generative modeling, and causal or dynamical system identification. Central to these frameworks are mappings between symbolic or data-driven expressions and their associated computational or semantic graphs, together with graph-structured learning or search procedures for inference, optimization, and interpretation.

1. Formal Graph Representations of Equations

The foundational component of graph-based equation discovery frameworks is the data structure mapping equations to graphs:

Equation graphs as DAGs: Mathematical equations are encoded as directed acyclic graphs $G = (V, E)$ where each node $v \in V$ may represent an input variable, parameter, constant, or operator (e.g., $+$ , $\times$ , $\exp$ , $\log$ , differential operator) (Atkinson et al., 2019, Ranasinghe et al., 30 Mar 2025, Xu et al., 13 Nov 2025). Edges encode operand–operator relations, reflecting the compositional structure of expressions.
Edge and node features: Nodes are augmented with one-hot or learned feature vectors denoting their type (operator, variable, constant), while edges may carry positional or parametric annotations (e.g., argument order, scale factors, or material-specific parameters) (Ranasinghe et al., 30 Mar 2025, Xu et al., 13 Nov 2025).
Higher-level graphs: In knowledge graph approaches, nodes may represent entire physical equations and/or their constituent physical concepts or variables (e.g., speed of light $c$ ), and edges encode variable sharing or conceptual bridge strength (Romiti, 7 Aug 2025).
Equality graphs (e-graphs): To capture symbolic equivalence classes, equality graphs represent all semantically identical variants generated via a set of rewrite rules and merge them using union-find structures (Jiang et al., 8 Nov 2025). This efficiently factors out redundant exploration of equivalent expressions.

This formal graph structure allows for the seamless integration with graph neural networks (GNNs), graph generative models, or graph-based evolutionary search algorithms.

2. Core Algorithmic Techniques

Multiple algorithmic paradigms have emerged for navigation and inference within equation graphs:

Genetic programming over graphs: Candidate equations are evolved as DAGs via a loop of mutation (local changes to nodes or edges), crossover (swapping subgraphs between graphs), and selection, guided by fitness metrics (e.g., mean squared error, evidence lower bound) (Atkinson et al., 2019, Xu et al., 13 Nov 2025). Edge features often parameterize constants or coefficients, which are fit via nonlinear optimization (e.g., L-BFGS with multi-start) (Xu et al., 13 Nov 2025).
Graph neural networks for representation and link prediction: Graph attention networks (GATv2) are used to embed equation/concept nodes and predict link strengths or bridge scores between equations, enabling hypothesis generation and knowledge synthesis (Romiti, 7 Aug 2025). Node features include branch embeddings, variable embeddings, and bibliometric features; edge weights fuse Jaccard variable overlap, physics-informed importance, and branch similarity.
Variational generative modeling: Conditional variational autoencoders (CVAE) are applied to equation graphs, learning a latent representation $z$ of the equation space conditioned on a dataset representation. An asynchronous GNN encoder maps the DAG and data to the latent space; a graph-structured decoder reconstructs equations from $z$ (Ranasinghe et al., 30 Mar 2025).
Latent space optimization: Rather than searching the discrete equation space directly, Bayesian optimization is performed in the learned latent space of equations to find expressions best fitting observed data (Ranasinghe et al., 30 Mar 2025).
Equality graph augmentation: Equality graphs are integrated into symbolic regression solvers to prune search (e.g., in Monte Carlo Tree Search), reduce gradient estimator variance (in deep RL), and enrich LLM prompts, by aggregating over equivalence classes (Jiang et al., 8 Nov 2025).
Neural and symbolic-extraction hybrids for dynamical systems: Multi-layer perceptrons (MLP) and graph-adapted Kolmogorov–Arnold networks (KAN) model dynamical laws on graphs. Symbolic formulas are extracted post-hoc by spline-wise fitting or symbolic regression (e.g., PySR) (Cappi et al., 25 Aug 2025).

3. Training Objectives and Evaluation Metrics

Objectives are explicitly tied to the nature of the discovery task and the graph learning paradigm in use:

Fit-based loss functions: For data-driven discovery, the principal objective is usually mean squared error or negative log-likelihood between the predicted outputs of a candidate equation and observed data, often after calibrating parametric coefficients via variational inference or direct optimization (Atkinson et al., 2019, Ranasinghe et al., 30 Mar 2025, Xu et al., 13 Nov 2025).
Cross-entropy loss for link prediction: In knowledge-graph approaches, binary cross-entropy is minimized to train GNNs for predicting the presence or absence of conceptual links between equations (Romiti, 7 Aug 2025).
Bayesian and ensemble techniques: MCMC and Bayesian uncertainty quantification are incorporated for structural inference (especially in causal modeling) (Roy et al., 2023). Active learning techniques are used to iteratively refine model hypotheses (Atkinson et al., 2019).
Evaluation metrics: Frameworks report AUC (for link prediction), normalized mean squared error (NMSE), recovery rate of known ground-truth equations, structural complexity (number of operators), and parsimony measures (number of fitted parameters) (Romiti, 7 Aug 2025, Ranasinghe et al., 30 Mar 2025, Cappi et al., 25 Aug 2025, Xu et al., 13 Nov 2025, Jiang et al., 8 Nov 2025). For dynamical systems, long-term integration error on out-of-distribution graphs (MAE_traj) is a critical metric (Cappi et al., 25 Aug 2025).

4. Application Domains and Empirical Results

Graph-based equation discovery frameworks have demonstrated capabilities across a diverse array of scientific domains:

Physics knowledge synthesis: Large corpora of advanced physical equations are encoded as weighted graphs, revealing macroscopic domain structure, central hub equations, and stable, computationally-derived analogies between branches (e.g., Electromagnetism vs. Statistical Mechanics). Test AUC for link prediction reaches ~0.974, outperforming classical baselines (Romiti, 7 Aug 2025).
Free-form PDE/ODE discovery: Arbitrary compositions of algebraic and differential operators are encoded as DAGs. Automatic differentiation enables on-the-fly evaluation of candidate PDEs, leading to the successful recovery of second-order and heterogeneous elliptic PDEs from data (Atkinson et al., 2019).
Constitutive law identification in materials science: Compact, accurate models for strain-rate effects and strain hardening in metals are discovered, that surpass empirical models with fewer parameters. Unified graph-based models for dynamic plastic-stress laws outperform classical Johnson–Cook models by a factor of two in error (Xu et al., 13 Nov 2025).
Symbolic regression and structure-aware search: CVAE-based generative models with Bayesian optimization achieve a 55% exact solution rate over diverse nonlinear equation benchmarks, offering high validity and novel equation generation (Ranasinghe et al., 30 Mar 2025). Equality-graph–based methods reduce normalized MSE by 10–30% over classical symbolic regression methods on noisy and out-of-distribution data (Jiang et al., 8 Nov 2025).
Graph dynamical system identification: MLP and KAN-based frameworks extract governing symbolic ODEs for network-coupled dynamical systems (e.g., Kuramoto, epidemic, biochemical models), achieving superior accuracy and parsimony compared to sparse regression baselines. KAN-based spline-wise extraction yields formulas matching ground-truth with minimal excess complexity (Cappi et al., 25 Aug 2025).
Causal structure discovery in functional data: Bayesian models over cyclic directed graphs enable identification of causal subspaces in multivariate functional data, achieving higher true positive and lower false discovery rates compared to alternatives, and revealing functional connectivity structures in brain EEG data (Roy et al., 2023).

5. Theoretical Guarantees and Computational Aspects

Graph-based frameworks provide unique algorithmic and statistical guarantees:

Sample and search efficiency: Theoretical regret bounds for MCTS are tightened with e-graph augmentation, as merging equivalence classes reduces the effective branching factor $\kappa_\infty$ (Jiang et al., 8 Nov 2025).
Variance reduction in DRL: Aggregating rewards and gradients over equivalence classes with e-graphs provably reduces estimator variance via Rao–Blackwellization (Jiang et al., 8 Nov 2025).
Identifiability in causal graphs: Under specified conditions (causal sufficiency, disjoint cycles, stability), directed cyclic graph models for functional data are identifiable from joint observations (Roy et al., 2023).
Scalability constraints: Graph-based search scales combinatorially with the operator library and graph depth. Accurate approximate solutions require population sizes of 500+ in genetic programming (Atkinson et al., 2019), sparse or batched adjacency representations in deep learning (Romiti, 7 Aug 2025), and efficient edge-parameter embedding (Xu et al., 13 Nov 2025).
Overheads and practical limits: EGG construction saturates efficiently for expressions of length up to 20, with e-graph memory cost scaling as $O(n^2)$ compared to $O(2^n)$ for naive enumeration of all equivalent forms (Jiang et al., 8 Nov 2025).

6. Generalization, Limitations, and Future Directions

Frameworks are broadly generalizable but face scaling and expressivity challenges:

Disciplinary agnosticism: Graph representations and learning modules are domain-agnostic; replacing the operator/variable set or edge priors allows application to chemistry, biology, economics, and beyond (Romiti, 7 Aug 2025, Xu et al., 13 Nov 2025).
Structural templates and priors: Operator libraries and graph templates encode inductive biases: insufficient diversity can preclude discovery of key terms (e.g., third-order derivatives), while overly large templates impact computational tractability (Atkinson et al., 2019, Xu et al., 13 Nov 2025).
Interpretability vs. flexibility: MLP-based and GNN-based models offer greater expressivity but at the cost of interpretability, mitigated by symbolic extraction or white-box architectures (e.g., KANs, Spline-Wise fitting) (Cappi et al., 25 Aug 2025).
Equality-awareness and search efficiency: Domains with rich algebraic identities benefit most from e-graph integration, but overheads may be suboptimal for pure linear grammars (Jiang et al., 8 Nov 2025).
Scaling to large corpora and dense graphs: Future work will likely prioritize scalable, sparse GNN architectures, edge-weight learning, embedding of operators and tensors, and dynamic or context-sensitive graph weighting (Romiti, 7 Aug 2025).

Graph-based equation discovery frameworks are systematically advancing the synthesis, search, and scientific interpretation of mathematical laws, offering scalable, interpretable, and generalizable alternatives to both symbolic regression and black-box deep learning paradigms (Atkinson et al., 2019, Romiti, 7 Aug 2025, Ranasinghe et al., 30 Mar 2025, Xu et al., 13 Nov 2025, Cappi et al., 25 Aug 2025, Jiang et al., 8 Nov 2025, Roy et al., 2023).