Data-Driven Equation Discovery

Updated 20 October 2025

Data-driven equation discovery is a methodology that infers interpretable governing equations directly from observational, experimental, or simulation data.
It employs techniques like sparse regression, symbolic regression, deep learning, and Bayesian methods to select active terms and recover underlying dynamics.
The approach is applied across physics, climate science, and materials science, with key challenges including noise handling, feature library design, and ensuring model interpretability.

Data-driven equation discovery is an emerging suite of methodologies that aim to infer interpretable governing equations—often in ordinary or partial differential form—directly from observational, experimental, or simulation data, rather than relying solely on first-principles physical derivation. This paradigm is particularly valuable in domains where the physical laws are unknown, only partially known, or analytically inaccessible, offering a path toward parsimonious yet expressive models that aid prediction, control, and scientific understanding.

1. Foundations and Unified Mathematical Principles

At its core, data-driven equation discovery seeks to recover operators or functional relationships that govern system evolution, starting from time series or field data $u(s, t)$ collected over space $s$ and/or time $t$ . A widely accepted generic representation for this process is

$u_t^{(J)}(s, t) = M (u(s, t), u_x(s, t), u_y(s, t), \ldots, u_t^{(1)}(s, t), \ldots) + \varepsilon(s, t),$

where $u_t^{(J)}$ denotes the $J$ th time derivative, $M$ is the (potentially nonlinear and sparse) operator to be discovered, and the argument collects nonlinear functions or derivatives of $u$ (North et al., 2022). This form encompasses ODEs, PDEs, and more general dynamical laws.

The essential problem is to identify (i) the dictionary or library of candidate functions/operators, and (ii) the sparse set of active terms and their coefficients that constitute the true governing equation. The discovery is often cast as a regression problem—either linear (in the coefficients) or via symbolic regression (in the structure)—augmented by statistical, physical, or computational constraints.

2. Categories of Methodologies

Data-driven equation discovery approaches can be organized into several categories, each with distinct strengths and challenges.

Sparse Regression-Based Methods

Classical sparse regression, including LASSO, sequential thresholded least squares (STLSQ), or STRidge, systematically constructs an overcomplete library $\Theta(u)$ of candidate terms (e.g., $u$ , $u_x$ , $u^2$ , $u_xx$ , etc.),

$u_t = \Theta(u)\,\xi + \varepsilon,$

and solves for a sparse coefficient vector $\xi$ via penalized least squares (Maslyaev et al., 2019, North et al., 2022). The SINDy (Sparse Identification of Nonlinear Dynamics) approach is canonical here and often employs thresholding to promote parsimony (Levko, 11 May 2024).

Symbolic Regression and Evolutionary Approaches

Symbolic regression uses genetic programming (GP) or evolutionary algorithms to search the space of tree-structured analytic expressions, composing functions and derivatives into candidate equations (Maslyaev et al., 2019, 1908.10673, Hvatov et al., 2020). Evolutionary operators—crossover, mutation, selection—dynamically generate new expressions, which are then subject to numerical regression to determine coefficients.

EPDE (Evolutionary PDE Discovery), for example, dispenses with a fixed candidate library and instead constructs equation terms on the fly, allowing for richer or less constrained model discovery (Maslyaev et al., 2019, Hvatov et al., 2020).

Deep Learning-Based Approaches

Neural network-based methods employ DNNs either to serve as function approximators for the solution field (enabling automatic differentiation), or as flexible coordinate transforms (autoencoders) to latent spaces with simpler dynamics (Champion et al., 2019, Xu et al., 2019). Physics-informed neural networks (PINNs) directly incorporate the candidate PDE as a constraint in the loss function, regularizing the network toward satisfying the governing law (Norman et al., 30 May 2024, Han et al., 18 Sep 2025).

One hybrid approach first trains a neural network surrogate on data, generates meta-data by querying the network at arbitrary points (enabling data upsampling), calculates derivatives via automatic differentiation, and then applies sparse regression for structure recovery (Xu et al., 2019, Cheng et al., 2023).

Bayesian and Statistical Frameworks

Recent advances recast discovery as a Bayesian inference problem, formally accounting for uncertainty in model structure and parameters, as well as observational noise and missing data (North et al., 2022, North et al., 2022). Hierarchical models expand the latent process in smooth bases to avoid instability in numerical differentiation, and spike-and-slab or regularized horseshoe priors enforce sparsity.

LLM and Diffusion Model-Based Equation Generation

Recent symbolic regression frameworks, such as DrSR and DiffuSR, leverage LLMs and generative diffusion models, conditioned on structured data insights, to generate candidate equations. These methods integrate statistical priors, data-driven heuristics, and iterative refinement from equation generation performance (Wang et al., 4 Jun 2025, Han et al., 16 Sep 2025).

3. Library Construction, Physical Constraints, and Coordinate Selection

Most sparse and regression-based methods start with the generation of a comprehensive feature library $\Theta(u)$ , incorporating monomials, derivatives, or more complex motifs (such as Dirac delta functions for population balance equations (Leong et al., 19 Aug 2025)). The choice and structure of this library are critical: exhaustive combinatorial expansion provides completeness but at the expense of tractability; strategies like DMD-guided library design (Leong et al., 19 Aug 2025) or autoencoder-based coordinate discovery (Champion et al., 2019) help to focus the search on dynamically relevant subspaces or variables.

Physical constraints—symmetry, conservation, dimensional consistency—are often enforced either by constraining the candidate library, penalizing structurally inconsistent terms, or incorporating priors in Bayesian/statistical or evolutionary processes (Reinbold et al., 2019, Levko, 11 May 2024, Xiao et al., 9 Sep 2025). Dimensional analysis (e.g., Buckingham π theorem) further restricts admissible equations to those respect unit invariance, as in the FIND framework (Xiao et al., 9 Sep 2025).

Automatic coordinate selection and transformation is crucial, with some methods explicitly optimizing for low-dimensional latent representations in which the equations are sparsest and most interpretable (Champion et al., 2019).

4. Handling Noise, Missing Data, and Robustness

Noise in measurements and the challenges of numerical differentiation are fundamental obstacles. Strategies include:

Neural network surrogates trained as denoisers and for robust AD-based derivative calculation (Xu et al., 2019, Cheng et al., 2023);
Local polynomial interpolation (with smoothing or Savitzky–Golay filters) to compute derivatives with controlled amplification of noise (Reinbold et al., 2019);
Bayesian hierarchical models that propagate and infer measurement/parameter uncertainty and accommodate missing data via flexible incidence matrices (North et al., 2022);
Ensemble approaches, e.g., bagging/bragging, for model robustness when sampling is limited (Leong et al., 19 Aug 2025).

Empirical benchmarking shows that such methods can, in regimes of moderate noise, consistently recover the correct structural form and accurate coefficients of canonical PDEs (e.g., wave equation, Burgers’, KdV) and more complex laws, provided sufficient data density (Maslyaev et al., 2019, Xu et al., 2019, Levko, 11 May 2024).

5. Interpretability, Model Selection, and Physical Insight

Scientific utility demands interpretable, parsimonious models. Methods enforce sparsity via penalty terms (L1, thresholding, Bayesian priors), information criteria (e.g., parsimony-vs-accuracy metrics, redundancy loss (Cheng et al., 2023)), or physics-informed metrics balancing coefficient stability and equation fidelity.

Discovered models are validated by comparing predicted and observed dynamics and, when possible, by matching physical mechanisms: e.g., mapping data-driven differential terms to circuit elements or plasma processes (Levko, 11 May 2024), or expressing coefficients as functions of dimensionless parameters to generalize across scenarios (Han et al., 18 Sep 2025).

Advanced frameworks such as FIND (Xiao et al., 9 Sep 2025) decompose formula search into latent variable generation and symbolic regression, using dimensional constraints and optimization to guarantee interpretability and minimize the search space.

LLM-guided approaches (DrSR, DiffuSR) embed interpretation into a dual data-prior process, using both structured residual analysis and natural language equation reasoning, and leverage the intrinsic linguistic structural priors in LLMs for diversity and compactness (Wang et al., 4 Jun 2025, Han et al., 16 Sep 2025).

6. Applications Across Domains

Data-driven equation discovery has been successfully applied to:

Canonical physics PDEs: recovery of the wave, Burgers’, Korteweg–de Vries, and Navier–Stokes equations from clean and noisy data (Maslyaev et al., 2019, Xu et al., 2019, Cheng et al., 2023).
Multi-scale and homogenized materials models: linking fine-scale simulations to effective/macroscale PDEs with greatly reduced computational sampling (Arbabi et al., 2020).
Fluid and plasma physics: deriving low-dimensional ODE models capturing oscillatory phenomena in circuits and discharges (Levko, 11 May 2024).
Climate science and parameterization: symbolic regression for closed-form cloud cover parameterizations, physically constrained, transferable, and competitive with deep neural networks in terms of accuracy (Grundner et al., 2023).
Population balance models: discovery of multidimensional breakage equations with sparsely sampled data (Leong et al., 19 Aug 2025).
Granular flow and materials rheology: interpretable friction evolution laws as a function of microscopic or macroscopic dimensionless parameters (Han et al., 18 Sep 2025).
Discovery of critical system parameters and dimensionless numbers in electronics, materials, and astrophysics (Xiao et al., 9 Sep 2025).
Scientific symbolic regression in interdisciplinary settings via LLMs and hybrid or diffusion-based generators (Wang et al., 4 Jun 2025, Han et al., 16 Sep 2025).

7. Future Directions and Open Challenges

Major identified directions for future development include:

Reducing reliance on hand-specified or user-tuned feature libraries through Bayesian and generative model-based approaches (North et al., 2022, Wang et al., 4 Jun 2025, Han et al., 16 Sep 2025);
Deep integration of uncertainty quantification, hierarchical modeling, and handling of missing data (via Bayesian, bootstrapping, or statistical frameworks) (North et al., 2022, North et al., 2022);
Efficient handling of high-dimensional, multi-scale, and nonlocal phenomena through targeted library construction (DMD, autoencoding) and scalable optimization (Leong et al., 19 Aug 2025, Arbabi et al., 2020);
Expansion to multimodal or complex data sources, including images and sensor streams, and extending symbolic regression to coupled/implicit, stochastic, or delay-differential models;
Fusion of deep learning with symbolic and statistical models for end-to-end learning, and application to complex, real-world datasets.

A persistent challenge is devising frameworks that are robust and interpretable, yet flexible enough to capture the rich dynamics of real systems—including those with latent variables, ill-posed or partial observability, or emergent, data-driven influence factors. The trajectory of research suggests increasing hybridization of statistical, deep learning, symbolic, and domain-specific physical reasoning, fostering next-generation scientific discovery tools that systematically mine data for governing laws across disciplines.