A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data

Published 7 Apr 2026 in cs.LG, cs.CE, and math.DS | (2604.06081v1)

Abstract: Modeling real-world systems requires accounting for noise - whether it arises from unpredictable fluctuations in financial markets, irregular rhythms in biological systems, or environmental variability in ecosystems. While the behavior of such systems can often be described by stochastic differential equations, a central challenge is understanding how noise influences the inference of system parameters and dynamics from data. Traditional symbolic regression methods can uncover governing equations but typically ignore uncertainty. Conversely, Gaussian processes provide principled uncertainty quantification but offer little insight into the underlying dynamics. In this work, we bridge this gap with a hybrid symbolic regression-probabilistic machine learning framework that recovers the symbolic form of the governing equations while simultaneously inferring uncertainty in the system parameters. The framework combines deep symbolic regression with Gaussian process-based maximum likelihood estimation to separately model the deterministic dynamics and the noise structure, without requiring prior assumptions about their functional forms. We verify the approach on numerical benchmarks, including harmonic, Duffing, and van der Pol oscillators, and validate it on an experimental system of coupled biological oscillators exhibiting synchronization, where the algorithm successfully identifies both the symbolic and stochastic components. The framework is data-efficient, requiring as few as 100-1000 data points, and robust to noise - demonstrating its broad potential in domains where uncertainty is intrinsic and both the structure and variability of dynamical systems must be understood.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a hybrid framework that fuses deep symbolic regression with GP-MLE to extract interpretable drift and structured diffusion from noisy time series.
It employs GP-based denoising and a recurrent network for symbolic regression to achieve robust performance with minimal data even in high-noise regimes.
The method accurately recovers stochastic differential equations and quantifies state-dependent uncertainty, advancing the modeling of complex dynamical systems.

A Modular Machine Learning Framework for Data-Efficient Stochastic Nonlinear System Identification

Introduction and Problem Setting

The inference of stochastic nonlinear dynamics from finite, noisy time series is a persistent challenge in physics, biology, and engineering. Classical deterministic symbolic regression techniques yield interpretable governing equations but neglect uncertainty quantification and struggle in high-noise regimes. Conversely, Gaussian process (GP)-based models provide principled uncertainty estimates and robust denoising but lack interpretability, yielding “black-box” surrogates. Existing hybrid or Bayesian histogram approaches either impose restrictive functional assumptions, scale poorly with dimensionality, or require extensive datasets.

This work presents a hybrid framework fusing deep symbolic regression (DSR) with Gaussian process-based maximum likelihood estimation to recover both symbolic governing equations and their structured, state-dependent noise components from minimal, noisy measurements. The approach exploits the physical insight that stochastic fluctuations in real systems often mirror the functional structure of the underlying deterministic dynamics due to parameter variability.

Methodological Pipeline

The framework proceeds through modular stages, built on robust GP-based denoising and symbolic equation discovery, followed by explicit uncertainty modeling. An overview is shown below.

Figure 1: Pipeline for discovering SDEs with structured diffusion, illustrated on a Duffing oscillator. Raw data is denoised via GPs, then the drift is discovered with symbolic regression, and the diffusion structure is inferred by GP-MLE in the basis of discovered drift terms.

Preprocessing and Denoising:

Noisy time series are first denoised using GPs to produce smooth, differentiable state and derivative estimates. This allows the drift function to be estimated with minimal bias. Standard finite-difference differentiation is eschewed in favor of GP posterior derivatives, which provide significantly less noise amplification (see Figure 1B).

Deep Symbolic Regression of the Drift:

Denoised observations are next processed by deep symbolic regression, implemented here using recurrent neural networks trained by a risk-seeking reinforcement learning policy (DSR). Symbolic expressions are assembled token-by-token from an operator library and scored on fitting accuracy, parsimony, and additional domain-informed constraints.

Figure 3: DSR generates interpretable drift equations using a sequence model. Parsimony and domain constraints are enforced by reward shaping and episodic policy updates.

Diffusion Structure via Gaussian Process MLE:

The framework assumes “structured diffusion”—the stochastic differential equation (SDE) diffusion term is parameterized by the same basis functions present in the inferred drift. After subtracting the learned drift from the measured increments, the residuals correspond to the stochastic component. These are modeled as a heteroscedastic Gaussian process, where the variance at each state is a linear combination of squared drift basis function magnitudes, with independent variance parameters.

Figure 5: Extracted stochastic residuals (blue) are modeled by a heteroscedastic GP fitted via MLE, yielding precise, state-dependent diffusion structure.

Variance parameters are jointly optimized via MLE, with small (irrelevant) coefficients pruned post-inference for model sparsity (automatic relevance determination). Notably, the entire framework is modular: any symbolic regression or GP subroutine can be slotted into the pipeline, and domain-informed constraints or kernel architectures are easily incorporated.

Empirical Verification: Benchmarks and Robustness

Recovery on Stochastic Oscillator Benchmarks

Verification is performed on synthetic data from canonical oscillatory systems (harmonic, Duffing, van der Pol), subject to a spectrum of noise regimes (additive, multiplicative, parameter fluctuations, multi-parameter noise):

Figure 6: For a linear oscillator under increasing additive noise, drift and diffusion parameters are accurately recovered; spurious activations in diffusion emerge only at elevated noise.

Key performance features demonstrated:

Symbolic regression consistently yields accurate, parsimonious drift expressions, even with limited data ( $10^2$ – $10^3$ points) and high SNR perturbations.
GP-MLE diffusion parameter estimates closely match ground truth, with uncertainty bars reflecting intrinsic data ambiguity.
Minor spurious terms emerge in diffusion only as noise approaches the amplitude of the underlying signal.
The entire structural form of both deterministic and stochastic model components is robustly captured (Figure 6).
Figure 7: Recovery of drift (means) and diffusion (variances) for oscillators with structured, parameter-wise noise injected—framework resolves all active terms, with quantitatively accurate uncertainty measures.

Multiplicative Noise & Heteroscedasticity

Multiplicative noise cases, e.g., fluctuating resonance frequency or nonlinear parameter noise, are handled seamlessly by the drift-basis expansion. Estimated distributions for frequency noise or other parameter instabilities track the true underlying stochasticity both in moments and distributional shape when analyzed via the Euler-Maruyama approximation and MLE posteriors.

Figure 8: Natural frequency distribution under multiplicative noise reliably inferred—histogram of GP residuals aligns with ground truth.

Token-Wise Diffusion Recovery

Token occurrence frequencies in discovered diffusion models further corroborate the structural identification capacity, as accurate drift/diffusion pairings are found in all realizations for all benchmarks, with only minor artifact “token activations” at extreme noise.

Figure 4: Diffusion basis function identification is consistent across replicates and noise levels; minor spurious tokens arise only under very high noise.

Experimental Application: Coupled Bacterial Oscillator Synchronization

Applicability to real-world, sparse, noisy data is validated on an experimental system of two E. coli bacteria coupled via microcavities (from [Japaridze et al., 2025]). This setting involves sparse ( $\sim 10^2$ points), irregularly sampled phase trajectories.

Figure 2: Microscope image of coupled E. coli in microfluidic cavities exhibiting synchronization; phase time series are used for SDE identification.

The framework:

Recovers the noisy Adler equation phase model and coupling parameters, matching empirical estimates.
Discovers biologically relevant model structure, predicting both additive and parameter-coupling noise and breaking the assumption of symmetric coupling between oscillators.
Outperforms histogram- and Bayesian histogram-based methods (such as BISDEs), which fail under limited data due to their requirement for state space partitioning and statistical sampling.
Figure 10: GP-DSR (red bars) accurately predicts drift and diffusion coefficients (mean and variance) and reproduces observed synchronization events in simulation of the recovered model.

Theoretical and Practical Implications

Data Efficiency and Dimensionality:

This framework breaks the curse of dimensionality inherent in bin-based identification or Bayesian histogram methods for SDE inference. All MLE, denoising, and symbolic search scale with the trajectory length and number of candidate basis functions, not with state-space granularity. Thus, robust, interpretable SDE modeling becomes practical for moderate-to-high dimensional dynamical systems and minimal datasets.

Interpretability and Physical Constraints:

By parameterizing diffusion via drift basis functions, the model enforces physical structure (structured diffusion), enhancing interpretability and facilitating mechanisms for model parsimony and hypothesis testing. Uncertainty estimates are associated with explicit, interpretable quantities—basis-function-wise variance, not black-box noise terms.

Modularity and Extensibility:

Alternative SR (e.g., transformer-based, GP-based) or GP models (kernel design, colored noise) are trivially incorporated due to modular pipeline construction. Regularization (length penalty, domain constraints), reward shaping, and ARD can all be domain- or application-informed.

Limitations:

A core structural assumption is that diffusion shares support with the drift. Systems with noise of extrinsic origin (not reflecting parameter fluctuations) may require relaxing this constraint at the cost of interpretability or data-efficiency. Extraction of stochastic residuals depends on the Euler-Maruyama discretization and the fidelity of derivative estimates for higher-order systems.

Outlook and Conclusion

This framework offers a methodologically sound, data-efficient, and interpretable approach for learning SDEs from noisy, real-world data. It provides strong empirical evidence (benchmarks and experiment) that a modular pipeline which fuses denoising, symbolic drift reconstruction, and GP-MLE diffusion estimation can unlock accurate structured uncertainty quantification with minimal data and significant noise. Immediate applications span experimental biophysics, nanomechanics, and high-noise signal processing scenarios.

Scalable extensions include nonparametric diffusion modeling, adaptation to colored or non-Wiener noise via temporal GP kernels, and direct application to more complex or high-dimensional dynamical systems. The robust identification of symbolically structured drift and diffusion components enables deeper theoretical and mechanistic insight, supports downstream control/modeling, and facilitates reduced-model design in both simulation and experiment.

References

The methods and context discussed in this essay are detailed in "A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data" (2604.06081).

Markdown Report Issue