- The paper presents a hybrid framework that fuses deep symbolic regression with GP-MLE to extract interpretable drift and structured diffusion from noisy time series.
- It employs GP-based denoising and a recurrent network for symbolic regression to achieve robust performance with minimal data even in high-noise regimes.
- The method accurately recovers stochastic differential equations and quantifies state-dependent uncertainty, advancing the modeling of complex dynamical systems.
A Modular Machine Learning Framework for Data-Efficient Stochastic Nonlinear System Identification
Introduction and Problem Setting
The inference of stochastic nonlinear dynamics from finite, noisy time series is a persistent challenge in physics, biology, and engineering. Classical deterministic symbolic regression techniques yield interpretable governing equations but neglect uncertainty quantification and struggle in high-noise regimes. Conversely, Gaussian process (GP)-based models provide principled uncertainty estimates and robust denoising but lack interpretability, yielding “black-box” surrogates. Existing hybrid or Bayesian histogram approaches either impose restrictive functional assumptions, scale poorly with dimensionality, or require extensive datasets.
This work presents a hybrid framework fusing deep symbolic regression (DSR) with Gaussian process-based maximum likelihood estimation to recover both symbolic governing equations and their structured, state-dependent noise components from minimal, noisy measurements. The approach exploits the physical insight that stochastic fluctuations in real systems often mirror the functional structure of the underlying deterministic dynamics due to parameter variability.
Methodological Pipeline
The framework proceeds through modular stages, built on robust GP-based denoising and symbolic equation discovery, followed by explicit uncertainty modeling. An overview is shown below.
Figure 1: Pipeline for discovering SDEs with structured diffusion, illustrated on a Duffing oscillator. Raw data is denoised via GPs, then the drift is discovered with symbolic regression, and the diffusion structure is inferred by GP-MLE in the basis of discovered drift terms.
Preprocessing and Denoising:
Noisy time series are first denoised using GPs to produce smooth, differentiable state and derivative estimates. This allows the drift function to be estimated with minimal bias. Standard finite-difference differentiation is eschewed in favor of GP posterior derivatives, which provide significantly less noise amplification (see Figure 1B).
Deep Symbolic Regression of the Drift:
Denoised observations are next processed by deep symbolic regression, implemented here using recurrent neural networks trained by a risk-seeking reinforcement learning policy (DSR). Symbolic expressions are assembled token-by-token from an operator library and scored on fitting accuracy, parsimony, and additional domain-informed constraints.
Figure 3: DSR generates interpretable drift equations using a sequence model. Parsimony and domain constraints are enforced by reward shaping and episodic policy updates.
Diffusion Structure via Gaussian Process MLE:
The framework assumes “structured diffusion”—the stochastic differential equation (SDE) diffusion term is parameterized by the same basis functions present in the inferred drift. After subtracting the learned drift from the measured increments, the residuals correspond to the stochastic component. These are modeled as a heteroscedastic Gaussian process, where the variance at each state is a linear combination of squared drift basis function magnitudes, with independent variance parameters.
Figure 5: Extracted stochastic residuals (blue) are modeled by a heteroscedastic GP fitted via MLE, yielding precise, state-dependent diffusion structure.
Variance parameters are jointly optimized via MLE, with small (irrelevant) coefficients pruned post-inference for model sparsity (automatic relevance determination). Notably, the entire framework is modular: any symbolic regression or GP subroutine can be slotted into the pipeline, and domain-informed constraints or kernel architectures are easily incorporated.
Empirical Verification: Benchmarks and Robustness
Recovery on Stochastic Oscillator Benchmarks
Verification is performed on synthetic data from canonical oscillatory systems (harmonic, Duffing, van der Pol), subject to a spectrum of noise regimes (additive, multiplicative, parameter fluctuations, multi-parameter noise):
Figure 6: For a linear oscillator under increasing additive noise, drift and diffusion parameters are accurately recovered; spurious activations in diffusion emerge only at elevated noise.
Key performance features demonstrated:
- Symbolic regression consistently yields accurate, parsimonious drift expressions, even with limited data (102–103 points) and high SNR perturbations.
- GP-MLE diffusion parameter estimates closely match ground truth, with uncertainty bars reflecting intrinsic data ambiguity.
- Minor spurious terms emerge in diffusion only as noise approaches the amplitude of the underlying signal.
- The entire structural form of both deterministic and stochastic model components is robustly captured (Figure 6).
Figure 7: Recovery of drift (means) and diffusion (variances) for oscillators with structured, parameter-wise noise injected—framework resolves all active terms, with quantitatively accurate uncertainty measures.
Multiplicative Noise & Heteroscedasticity
Multiplicative noise cases, e.g., fluctuating resonance frequency or nonlinear parameter noise, are handled seamlessly by the drift-basis expansion. Estimated distributions for frequency noise or other parameter instabilities track the true underlying stochasticity both in moments and distributional shape when analyzed via the Euler-Maruyama approximation and MLE posteriors.
Figure 8: Natural frequency distribution under multiplicative noise reliably inferred—histogram of GP residuals aligns with ground truth.
Token-Wise Diffusion Recovery
Token occurrence frequencies in discovered diffusion models further corroborate the structural identification capacity, as accurate drift/diffusion pairings are found in all realizations for all benchmarks, with only minor artifact “token activations” at extreme noise.
Figure 4: Diffusion basis function identification is consistent across replicates and noise levels; minor spurious tokens arise only under very high noise.
Experimental Application: Coupled Bacterial Oscillator Synchronization
Applicability to real-world, sparse, noisy data is validated on an experimental system of two E. coli bacteria coupled via microcavities (from [Japaridze et al., 2025]). This setting involves sparse (∼102 points), irregularly sampled phase trajectories.
Figure 2: Microscope image of coupled E. coli in microfluidic cavities exhibiting synchronization; phase time series are used for SDE identification.
The framework:
Theoretical and Practical Implications
Data Efficiency and Dimensionality:
This framework breaks the curse of dimensionality inherent in bin-based identification or Bayesian histogram methods for SDE inference. All MLE, denoising, and symbolic search scale with the trajectory length and number of candidate basis functions, not with state-space granularity. Thus, robust, interpretable SDE modeling becomes practical for moderate-to-high dimensional dynamical systems and minimal datasets.
Interpretability and Physical Constraints:
By parameterizing diffusion via drift basis functions, the model enforces physical structure (structured diffusion), enhancing interpretability and facilitating mechanisms for model parsimony and hypothesis testing. Uncertainty estimates are associated with explicit, interpretable quantities—basis-function-wise variance, not black-box noise terms.
Modularity and Extensibility:
Alternative SR (e.g., transformer-based, GP-based) or GP models (kernel design, colored noise) are trivially incorporated due to modular pipeline construction. Regularization (length penalty, domain constraints), reward shaping, and ARD can all be domain- or application-informed.
Limitations:
A core structural assumption is that diffusion shares support with the drift. Systems with noise of extrinsic origin (not reflecting parameter fluctuations) may require relaxing this constraint at the cost of interpretability or data-efficiency. Extraction of stochastic residuals depends on the Euler-Maruyama discretization and the fidelity of derivative estimates for higher-order systems.
Outlook and Conclusion
This framework offers a methodologically sound, data-efficient, and interpretable approach for learning SDEs from noisy, real-world data. It provides strong empirical evidence (benchmarks and experiment) that a modular pipeline which fuses denoising, symbolic drift reconstruction, and GP-MLE diffusion estimation can unlock accurate structured uncertainty quantification with minimal data and significant noise. Immediate applications span experimental biophysics, nanomechanics, and high-noise signal processing scenarios.
Scalable extensions include nonparametric diffusion modeling, adaptation to colored or non-Wiener noise via temporal GP kernels, and direct application to more complex or high-dimensional dynamical systems. The robust identification of symbolically structured drift and diffusion components enables deeper theoretical and mechanistic insight, supports downstream control/modeling, and facilitates reduced-model design in both simulation and experiment.
References
The methods and context discussed in this essay are detailed in "A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data" (2604.06081).