Ensemble-SINDy: Robust Nonlinear Model Discovery
- Ensemble-SINDy (E-SINDy) is an ensemble-based extension of SINDy that robustly identifies parsimonious nonlinear dynamics from noisy, limited data using bootstrap aggregation and consensus regularization.
- It mitigates overfitting and quantifies uncertainty by aggregating sparse regression models from bootstrapped datasets and library sub-sampling, improving noise tolerance.
- The method has diverse applications—from atmospheric chemistry to industrial process modeling—with advanced techniques like conformal prediction enhancing model reliability.
Ensemble-SINDy (E-SINDy) is an ensemble-based extension of the Sparse Identification of Nonlinear Dynamics (SINDy) framework, designed to robustly discover parsimonious models of nonlinear dynamical systems from data under high noise, limited sampling, and model uncertainty. Leveraging bootstrap aggregation, library sub-sampling, and consensus-driven regularization, E-SINDy systematically mitigates overfitting, quantifies coefficient uncertainty, and enhances model interpretability and predictive reliability across a wide spectrum of applications ranging from physical and biological systems to industrial process modeling and data-driven closure for partial differential equations (Fasel et al., 2021, Babu et al., 2023, Kang et al., 13 Aug 2025, Guo et al., 2024, Fasel, 15 Jul 2025, Bekar et al., 2023, Yahagi et al., 7 Mar 2025).
1. Core Principles and Algorithmic Workflow
The canonical SINDy method seeks a sparse, physically meaningful representation of the vector field from time-series observations by expressing the dynamics in an overcomplete library of candidate functions and solving a penalized regression: where are the time derivatives of the state measurements and contains the coefficients governing the influence of candidate terms (Fasel et al., 2021, Babu et al., 2023).
E-SINDy augments this framework via stochastic ensemble methods:
- Bootstrap sampling: Create resampled data sets by sampling snapshots (rows of , ) with replacement ("data bagging"). Optionally, sample subsets of features ("library bagging") (Fasel et al., 2021, Yahagi et al., 7 Mar 2025, Guo et al., 2024).
- Sparse regression per bootstrap: For each sample, solve the penalized regression using methods such as LASSO, STRidge, or STLSQ, with regularization and optional thresholding to enforce sparsity (Yahagi et al., 7 Mar 2025, Babu et al., 2023, Bekar et al., 2023).
- Aggregation: Aggregate the resulting coefficient sets () by consensus—selecting terms with high inclusion probability (), averaging coefficients, or taking medians. Low-consensus or unstable terms are pruned (Fasel et al., 2021, Bekar et al., 2023, Babu et al., 2023).
- Model selection: Stabilize the support by retaining only terms with high frequency or stability, and tune regularization parameters by cross-validation or stability criteria.
A typical E-SINDy pseudocode involves:
- Ensemble generation by repeated bootstrapping and library sampling.
- Sparse model identification for each ensemble member.
- Extraction of elite models based on predictive thresholds for long-horizon simulation accuracy (Yahagi et al., 7 Mar 2025).
- Clustering or further consensus selection (e.g., coefficient averaging or voting) to produce final models.
2. Noise Robustness, Uncertainty Quantification, and Model Stability
E-SINDy confers substantial improvements in robustness to measurement noise and sparse data. Bootstrap aggregation prevents overfitting to noise or spurious correlations; only features consistently supported across bootstraps attain high inclusion probabilities. Empirical studies show that E-SINDy increases the noise tolerance for PDE discovery by over a factor of two relative to standard SINDy (Fasel et al., 2021). In the context of industrial diesel engine airpath modeling, even under synthetic noise up to of the signal RMS, E-SINDy models achieved multi-step values , outperforming standard SINDy which diverged (Yahagi et al., 7 Mar 2025).
Uncertainty quantification exploits the ensemble: for each term, standard deviations, confidence intervals, and inclusion frequencies are computed, and the ensemble forecast distribution yields empirical prediction intervals for future trajectories (Fasel et al., 2021, Guo et al., 2024, Bekar et al., 2023, Fasel, 15 Jul 2025). In atmospheric chemistry surrogate modeling, calibration curves for the empirical confidence intervals approach the theoretical $1:1$ correspondence, indicating well-calibrated uncertainty estimates (Guo et al., 2024). Bootstrapped intervals also diagnose underfitting or model misspecification, as large ensemble spread correlates with elevated prediction error.
Recent work has integrated conformal prediction into the E-SINDy workflow, providing finite-sample valid prediction intervals for time series, formal feature-importance (e.g., via leave-one-covariate-out statistics), and valid confidence intervals for model coefficients via jackknife and constrained resampling (Fasel, 15 Jul 2025). These intervals retain correct coverage rates even under non-Gaussian or heteroskedastic noise structures.
3. Methodological Extensions and Algorithmic Innovations
Substantial methodological diversity exists within the E-SINDy paradigm:
- Library Bagging: Random sub-sampling of library features per ensemble member mitigates deleterious effects of collinearity and over-parameterization in large candidate sets (Yahagi et al., 7 Mar 2025, Fasel et al., 2021, Guo et al., 2024).
- Elite Gathering and Clustering: Rather than simple mean aggregation, some frameworks select a set of "elite" ensemble members (e.g., in multi-step prediction), then cluster and average only within high-performing clusters for final model construction (Yahagi et al., 7 Mar 2025). This approach yields interpretable, reliable models in contexts where physically plausible basis functions are not guaranteed a priori.
- Parameter-Aware Consensus: Explicit inclusion of physical parameters and their inverses allows E-SINDy to recover unified, interpretable PDE models across parameter regimes. Dimensional similarity filters prune candidate terms to those consistent with the physical units of the target, vastly shrinking library size and preventing spurious identifications (Kang et al., 13 Aug 2025).
- Weak-Form and Nonlocal Formulations: For PDE discovery under high noise or on irregular domains, E-SINDy can operate on weak-form SINDy, or integrate nonlocal derivative estimation via the Peridynamic Differential Operator (PDDO), as demonstrated for moving boundary problems (Bekar et al., 2023). This enables simultaneous discovery of interior and boundary-evolution laws, with robust handling of geometrically complex or sparse data.
- Active Learning: Ensemble-based uncertainty metrics guide data collection (e.g., next-experiment selection) toward regions of maximal model uncertainty, accelerating convergence and increasing data efficiency (Fasel et al., 2021).
A range of solvers (LASSO, sequentially thresholded least squares/ridge [STLS/STRidge], ElasticNet, etc.) are compatible with E-SINDy, enabling tuning for statistical or physical prior structure (Fasel et al., 2021, Bekar et al., 2023).
4. Application Domains
E-SINDy has been successfully deployed in a variety of settings:
| Domain | Key Use Case | Reference |
|---|---|---|
| Power systems | Forced oscillation localization in wind farms | (Babu et al., 2023) |
| Atmospheric chemistry | Reduced-order box model for ozone/UQ | (Guo et al., 2024) |
| Turbulent closure modeling | Autonomous discovery of Smagorinsky-style models | (Kang et al., 13 Aug 2025) |
| Industrial process | Diesel engine airpath system (multi-step R2 > 0.95) | (Yahagi et al., 7 Mar 2025) |
| Multiphysics PDEs | Moving boundary (Fisher-Stefan) law recovery | (Bekar et al., 2023) |
| Fundamental chaos | Lotka-Volterra, Lorenz system hallmarks | (Fasel et al., 2021, Fasel, 15 Jul 2025) |
In forced oscillation detection, E-SINDy isolated the source turbine and frequency by retaining only sinusoidal library terms with large ensemble coefficients at the correct parameter combinations, even in multi-source/multifrequency test cases (Babu et al., 2023). In turbulent closure, E-SINDy autonomously rediscovered a Smagorinsky-type subgrid-scale closure without prior structure, outperforming classical models in and error metrics (Kang et al., 13 Aug 2025).
5. Uncertainty Quantification and Conformal Prediction
Uncertainty quantification is central to the E-SINDy methodology. By leveraging the distribution of coefficients and model predictions in the ensemble, statistical confidence can be attributed to candidate features and forecasts (Fasel et al., 2021, Guo et al., 2024). Integrated conformal prediction techniques provide finite-sample valid predictive intervals and robust, model-agnostic feature importance scores under minimal assumptions (Fasel, 15 Jul 2025).
Three principal conformal approaches in E-SINDy are:
- Ensemble batch prediction intervals (EnbPI): Out-of-bag residuals form the basis for coverage-calibrated time-series intervals.
- LOCO/LOCO-path feature importance: Systematic leave-one-covariate-out decomposition (and regularization path analysis) ranks or thresholds features with finite-sample false-inclusion control.
- Feature-conformal prediction: Jackknife and constrained-kernel approaches deliver valid confidence intervals on coefficients even in non-Gaussian or heteroskedastic settings.
Empirical results show that conformalized ensemble prediction intervals maintain target coverage rates (e.g., ), adapt to both Gaussian and non-Gaussian noise, and that LOCO-derived feature importances reliably separate true dynamical features from artifacts (Fasel, 15 Jul 2025).
6. Limitations, Open Challenges, and Future Directions
Despite notable advances, E-SINDy faces ongoing methodological challenges:
- Scalability: Full LOCO/LOCO-path and conservative feature-conformal UQ incur substantial computational overhead for large libraries, as interpretation requires retraining or cross-validation for each candidate term (Fasel, 15 Jul 2025, Kang et al., 13 Aug 2025).
- Model selection under dependence: Marginal coverage for time-series intervals (e.g., EnbPI, CP-PID) is only asymptotic under temporal dependence; formal finite-sample results remain an open problem (Fasel, 15 Jul 2025).
- Parameter and coordinate selection: Discovery of highly-structured, physically motivated models from raw or latent variables depends on library expressivity, dimensionality reduction (e.g., PCA), and, in the case of PDEs, selection of appropriate inductive bias (e.g., local coordinates for boundary laws) (Bekar et al., 2023, Guo et al., 2024).
- Integration with Bayesian/post-selection inference: While Bayesian and ensemble methods are complementary, formal post-selection guarantees for sparse models aggregated via bootstrapping remain an area of methodological development.
- Automation and theory: Automated detection of geometric or physical boundaries, discovery of models with truly nonlocal terms, and principled selection of library terms in spatially or temporally inhomogeneous systems require further research (Bekar et al., 2023, Kang et al., 13 Aug 2025).
Ongoing and future work will address these open challenges, extend E-SINDy to high-dimensional systems, develop real-time adaptive control formulations, combine ensemble and Bayesian inference, and deliver automated, uncertainty-quantified data-driven modeling pipelines for complex scientific and engineering applications.
7. Summary and Outlook
Ensemble-SINDy enables robust, interpretable sparse model discovery from time-series data in regimes dominated by noise, limited data, and model complexity. By extending SINDy with ensemble methods, bootstrapping, library bagging, and consensus-driven aggregation, it facilitates model selection, quantifies uncertainty, and prevents overfitting, as validated across a spectrum of empirical domains. Integration with conformal prediction, parameter-aware regularization, and advanced derivative estimation broadens its applicability to stiff PDEs, closure modeling, and uncertainty-aware forecasting, positioning E-SINDy as a central tool for data-driven nonlinear model identification under real-world conditions (Fasel et al., 2021, Babu et al., 2023, Kang et al., 13 Aug 2025, Fasel, 15 Jul 2025, Guo et al., 2024, Yahagi et al., 7 Mar 2025, Bekar et al., 2023).