Data-Driven Parameterization

Updated 6 March 2026

Data-driven parameterization is a methodology that uses empirical data and machine learning to derive optimized schemes for representing unresolved processes in complex models.
It integrates techniques like neural networks, recurrent architectures, symbolic regression, and clustering to improve prediction accuracy and model efficiency across multiple domains.
By enforcing physical constraints and adopting scalable strategies, this approach enhances model robustness, transferability, and addresses challenges like extrapolation and rare event sampling.

Data-driven parameterization refers to the use of empirical data and machine learning techniques to construct parameterization schemes that describe unresolved or subgrid processes in large-scale models, or to define model structures/decision variables in mathematical optimization pipelines. These approaches systematically leverage observed or simulated data—rather than solely relying on first-principles or traditional analytic closures—to infer optimized schemes for flux representation, process coupling, closure relations, or design parameter groupings. Data-driven parameterization spans a wide range of scientific and engineering domains, including climate and ocean modeling, control systems, astrophysical process modeling, and structural optimization.

1. Principles and Scope of Data-Driven Parameterization

Data-driven parameterization encompasses a set of methodologies in which parameterization schemes—whether mappings between resolved model variables and subgrid tendencies or groupings of model/design parameters—are learned, refined, or validated directly from data. This domain departs from classic analytic or semi-empirical parameterizations by making the data itself central to (i) functional form discovery (symbolic regression), (ii) parameter value determination (statistical learning), and/or (iii) the identification of relevant parameter groupings (hierarchical reparameterization). The end goal is to improve prediction skill, robustness, or efficiency with respect to the modeled system or optimization objective (Wu et al., 6 Mar 2025, Falga et al., 3 Nov 2025, Perezhogin et al., 2023, Perezhogin et al., 13 May 2025, Grundner et al., 2023, Fabris et al., 2024).

In geosciences and engineering, data-driven parameterization is employed for closure of subgrid-scale turbulence and mixing, inference of momentum or heat fluxes, stochastic representations of unresolved processes, and—in structural optimization—hierarchical partitioning of design variables to enable effective surrogate modeling and optimization.

2. Machine Learning Architectures and Symbolic Techniques

Multiple architectures are utilized in data-driven parameterization:

Feed-forward Neural Networks (ANN/MLP): Common for regression problems (e.g., air-sea fluxes, subgrid eddy stress tensors) where inputs are features derived from model variables, and outputs are parameterized tendencies or fluxes. These are typically trained on simulation or observational data using losses corresponding to negative log-likelihood or mean squared error. Uncertainty quantification (UQ) can be performed by parallel variance-networks (Wu et al., 6 Mar 2025, Perezhogin et al., 13 May 2025).
Recurrent Neural Networks (RNN/GRU): Employed for learning temporal evolution of unresolved processes, especially in the context of embedded superparameterization (SP) (Chattopadhyay et al., 2020).
Symbolic Regression: Techniques such as genetic programming, combined with sequential feature selection, identify explicit analytic expressions that are physically interpretable and transferable, as demonstrated for cloud cover (Grundner et al., 2023) and star formation scaling (Salim et al., 7 May 2025).
Hierarchical Clustering & Integer Linear Programming (ILP): Used for adaptive reparameterization in engineering design, enabling automated splitting of parameter groups based on surrogate-based performance metrics (Fabris et al., 2024).

3. Methodological Strategies and Physical Constraints

Data-driven parameterization strategies often integrate domain-specific priors and constraints to ensure physical plausibility, stability, and generalizability:

Normalization and Dimensional Scaling: Feature and output transformation based on dimensional analysis or self-similarity, encoding grid-resolution or local flow characteristics, which enables generalization across model configurations (Perezhogin et al., 13 May 2025, Falga et al., 3 Nov 2025).
Enforcement of Symmetry and Conservation Laws: Momentum or energy conservation, tensor symmetry (angular momentum), galilean invariance, and reflection/rotation invariance are hard-wired into NN architectures or enforced via data augmentation (Perezhogin et al., 13 May 2025, Perezhogin et al., 2023).
Physical Interpretation of Discovered Equations: In symbolic regression, terms with clear mechanistic roles (nonlinear dependencies, thresholding, rational forms for phase transitions) are prioritized for interpretability and diagnostics (Grundner et al., 2023, Salim et al., 7 May 2025).
Stochastic and Probabilistic Schemes: Non-deterministic processes (e.g., turbulence, cloud fraction, subgrid fluxes) are parameterized by learning input-conditioned output distributions, typically Gaussian or generalizable to multivariate/mixture models (Wu et al., 6 Mar 2025, Li et al., 2022).

4. Evaluating Performance and Generalization

Empirical validation spans both offline (held-out data/predictive skill) and online (model-integrated simulations) regimes:

Regression Metrics: Normalized mean squared error, $R^2$ , pattern correlation, RMSE, and Hellinger distance to distributions are widely used to assess fit and bias (Wu et al., 6 Mar 2025, Perezhogin et al., 13 May 2025, Grundner et al., 2023).
Physical System Metrics: Energy budgets (kinetic, available potential), flux divergences, tracer or state evolution (e.g., temperature, mixed-layer depth), and critical event reproduction (jets, cloud formation).
Transfer and Adaptivity: Validating transferability across regimes, grid spacings, and observational datasets (e.g., DYAMOND $\to$ ERA5 for cloud cover (Grundner et al., 2023)), including few-shot fine-tuning or transfer learning to new physical regimes (Chattopadhyay et al., 2020, Salim et al., 7 May 2025).
Comparison to Analytical/Bulk Formulae: Data-driven parameterizations are systematically compared to classical closures or bulk algorithms, with systematic skill improvements documented in both mean and uncertainty quantification (Wu et al., 6 Mar 2025, Perezhogin et al., 2023).

5. Specialized Applications Across Scientific Domains

Geophysical Fluid Dynamics

Air-sea fluxes: Replacement of deterministic bulk formulas with input-conditioned ANNs providing mean and variance predictions, supporting stochastic boundary forcing in ocean models (Wu et al., 6 Mar 2025).
Boundary layer momentum transport: Unified parameterizations for oceanic and atmospheric turbulent fluxes leveraging self-similar normalized profiles and joint LES datasets (Falga et al., 3 Nov 2025).
Eddy stress and backscatter: ANN and analytic closures generalize across grid spacings by integrating non-dimensional scaling and enforcing conservation; physics-based CNNs and interpretable regressions are compared (Perezhogin et al., 13 May 2025, Perezhogin et al., 2023).
Cloud cover: Analytical expressions derived from symbolic regression consistently match or outperform neural networks and semi-empirical models, provide direct interpretability, and exhibit superior transferability (Grundner et al., 2023).

Astrophysical and Structural Domains

Star formation scaling: Symbolic regression recovers interpretable low-scatter scaling laws directly from high-resolution simulations, outperforming classical analytic models and supporting physical interpretation (Salim et al., 7 May 2025).
Structural optimization: Automated, hierarchical reparameterization of ship hull design variables is performed using surrogate-based ILP clustering, enabling more efficient exploration in high-dimensional design spaces (Fabris et al., 2024).

Control and System Identification

Output feedback LQR: Frameworks for the data-driven construction of substitute state spaces and full closed-loop system responses via trajectory libraries, circumventing explicit system identification and ensuring theoretical completeness (Xie et al., 28 Aug 2025, Xue et al., 2020).
Error-in-variables and robust estimation: Direct data-driven parameterization of the feasible system set via quadratic matrix inequalities (QMI) and semidefinite programming, with explicit SNR-based well-posedness diagnostics; applicable for robust observer/controller synthesis (Brändle et al., 2024, Brändle et al., 2024).
Derivative-free continuous-time stabilization: Embedding unknown plant dynamics into stable filter banks enables convex, derivative-free output-feedback stabilization from input–output data, ensuring noise-robustness (Possieri, 20 Jan 2026).

6. Practical Considerations, Limitations, and Future Directions

Several recurring challenges and best practices mark the data-driven parameterization field:

Extrapolation Risks and Physical Plausibility: Neural architectures lacking sufficient inductive bias may fail in unsampled regimes or yield physically implausible outputs. Incorporating invariances, constraints, and dimensional scaling is shown to mitigate such risks (Perezhogin et al., 13 May 2025, Perezhogin et al., 2023, Grundner et al., 2023).
Sampling Rare Events and Imbalance: For phenomena dominated by extreme or infrequently sampled events (e.g., rare but important subgrid processes), resampling and importance weighting strategies must be carefully tuned to avoid overfitting or underrepresentation (Yang et al., 2024).
Scalability and Efficiency: Data-driven parameterization frameworks are designed to exploit the available data with scalable computational strategies, via surrogate modeling, reduced-order modeling (POD + GPR), and, in robust control, by decoupling complexity from the raw data volume (Fabris et al., 2024, Brändle et al., 2024).
Adaptivity and Transfer: Transfer learning, symbolic regression with few-shot updating, and hierarchical clustering enable data-driven parameterizations to remain robust as the system, physical regime, or available data evolves (Grundner et al., 2023, Chattopadhyay et al., 2020).
Theory-Practice Gap in Algorithmics: Data-driven parameter evaluation on real-world benchmark instances guides the selection of parameterizations in algorithmics, moving beyond worst-case theory and ensuring that practical performance aligns with parameter distributions actually observed in applications (Komusiewicz et al., 8 Sep 2025).

References

(Wu et al., 6 Mar 2025) Data-Driven Probabilistic Air-Sea Flux Parameterization
(Falga et al., 3 Nov 2025) Towards a Unified Data-Driven Boundary Layer Momentum Flux Parameterization for Ocean and Atmosphere
(Perezhogin et al., 2023) A stable implementation of a data-driven scale-aware mesoscale parameterization
(Perezhogin et al., 13 May 2025) Generalizable neural-network parameterization of mesoscale eddies in idealized and global ocean models
(Grundner et al., 2023) Data-Driven Equation Discovery of a Cloud Cover Parameterization
(Salim et al., 7 May 2025) A data-driven approach for star formation parameterization using symbolic regression
(Fabris et al., 2024) Data-driven parameterization refinement for the structural optimization of cruise ship hulls
(Xie et al., 28 Aug 2025) An Efficient Data-Driven Framework for Linear Quadratic Output Feedback Control
(Xue et al., 2020) Data-Driven System Level Synthesis
(Brändle et al., 2024) A System Parameterization for Direct Data-Driven Estimator Synthesis
(Brändle et al., 2024) A System Parametrization for Direct Data-Driven Analysis and Control with Error-in-Variables
(Possieri, 20 Jan 2026) Derivative free data-driven stabilization of continuous-time linear systems from input-output data
(Li et al., 2022) Stochastic data-driven parameterization of unresolved mesoscale eddies
(Komusiewicz et al., 8 Sep 2025) The Parameter Report: An Orientation Guide for Data-Driven Parameterization
(Chattopadhyay et al., 2020) Data-driven super-parameterization using deep learning: Experimentation with multi-scale Lorenz 96 systems and transfer-learning
(Lei et al., 2016) Data-driven parameterization of the generalized Langevin equation
(Muthukumar et al., 2014) A Deep Learning Approach to Data-driven Parameterizations for Statistical Parametric Speech Synthesis
(Yang et al., 2024) Overcoming set imbalance in data driven parameterization: A case study of gravity wave momentum transport