Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data-Driven Parameterization

Updated 6 March 2026
  • Data-driven parameterization is a methodology that uses empirical data and machine learning to derive optimized schemes for representing unresolved processes in complex models.
  • It integrates techniques like neural networks, recurrent architectures, symbolic regression, and clustering to improve prediction accuracy and model efficiency across multiple domains.
  • By enforcing physical constraints and adopting scalable strategies, this approach enhances model robustness, transferability, and addresses challenges like extrapolation and rare event sampling.

Data-driven parameterization refers to the use of empirical data and machine learning techniques to construct parameterization schemes that describe unresolved or subgrid processes in large-scale models, or to define model structures/decision variables in mathematical optimization pipelines. These approaches systematically leverage observed or simulated data—rather than solely relying on first-principles or traditional analytic closures—to infer optimized schemes for flux representation, process coupling, closure relations, or design parameter groupings. Data-driven parameterization spans a wide range of scientific and engineering domains, including climate and ocean modeling, control systems, astrophysical process modeling, and structural optimization.

1. Principles and Scope of Data-Driven Parameterization

Data-driven parameterization encompasses a set of methodologies in which parameterization schemes—whether mappings between resolved model variables and subgrid tendencies or groupings of model/design parameters—are learned, refined, or validated directly from data. This domain departs from classic analytic or semi-empirical parameterizations by making the data itself central to (i) functional form discovery (symbolic regression), (ii) parameter value determination (statistical learning), and/or (iii) the identification of relevant parameter groupings (hierarchical reparameterization). The end goal is to improve prediction skill, robustness, or efficiency with respect to the modeled system or optimization objective (Wu et al., 6 Mar 2025, Falga et al., 3 Nov 2025, Perezhogin et al., 2023, Perezhogin et al., 13 May 2025, Grundner et al., 2023, Fabris et al., 2024).

In geosciences and engineering, data-driven parameterization is employed for closure of subgrid-scale turbulence and mixing, inference of momentum or heat fluxes, stochastic representations of unresolved processes, and—in structural optimization—hierarchical partitioning of design variables to enable effective surrogate modeling and optimization.

2. Machine Learning Architectures and Symbolic Techniques

Multiple architectures are utilized in data-driven parameterization:

  • Feed-forward Neural Networks (ANN/MLP): Common for regression problems (e.g., air-sea fluxes, subgrid eddy stress tensors) where inputs are features derived from model variables, and outputs are parameterized tendencies or fluxes. These are typically trained on simulation or observational data using losses corresponding to negative log-likelihood or mean squared error. Uncertainty quantification (UQ) can be performed by parallel variance-networks (Wu et al., 6 Mar 2025, Perezhogin et al., 13 May 2025).
  • Recurrent Neural Networks (RNN/GRU): Employed for learning temporal evolution of unresolved processes, especially in the context of embedded superparameterization (SP) (Chattopadhyay et al., 2020).
  • Symbolic Regression: Techniques such as genetic programming, combined with sequential feature selection, identify explicit analytic expressions that are physically interpretable and transferable, as demonstrated for cloud cover (Grundner et al., 2023) and star formation scaling (Salim et al., 7 May 2025).
  • Hierarchical Clustering & Integer Linear Programming (ILP): Used for adaptive reparameterization in engineering design, enabling automated splitting of parameter groups based on surrogate-based performance metrics (Fabris et al., 2024).

3. Methodological Strategies and Physical Constraints

Data-driven parameterization strategies often integrate domain-specific priors and constraints to ensure physical plausibility, stability, and generalizability:

  • Normalization and Dimensional Scaling: Feature and output transformation based on dimensional analysis or self-similarity, encoding grid-resolution or local flow characteristics, which enables generalization across model configurations (Perezhogin et al., 13 May 2025, Falga et al., 3 Nov 2025).
  • Enforcement of Symmetry and Conservation Laws: Momentum or energy conservation, tensor symmetry (angular momentum), galilean invariance, and reflection/rotation invariance are hard-wired into NN architectures or enforced via data augmentation (Perezhogin et al., 13 May 2025, Perezhogin et al., 2023).
  • Physical Interpretation of Discovered Equations: In symbolic regression, terms with clear mechanistic roles (nonlinear dependencies, thresholding, rational forms for phase transitions) are prioritized for interpretability and diagnostics (Grundner et al., 2023, Salim et al., 7 May 2025).
  • Stochastic and Probabilistic Schemes: Non-deterministic processes (e.g., turbulence, cloud fraction, subgrid fluxes) are parameterized by learning input-conditioned output distributions, typically Gaussian or generalizable to multivariate/mixture models (Wu et al., 6 Mar 2025, Li et al., 2022).

4. Evaluating Performance and Generalization

Empirical validation spans both offline (held-out data/predictive skill) and online (model-integrated simulations) regimes:

  • Regression Metrics: Normalized mean squared error, R2R^2, pattern correlation, RMSE, and Hellinger distance to distributions are widely used to assess fit and bias (Wu et al., 6 Mar 2025, Perezhogin et al., 13 May 2025, Grundner et al., 2023).
  • Physical System Metrics: Energy budgets (kinetic, available potential), flux divergences, tracer or state evolution (e.g., temperature, mixed-layer depth), and critical event reproduction (jets, cloud formation).
  • Transfer and Adaptivity: Validating transferability across regimes, grid spacings, and observational datasets (e.g., DYAMOND →\to ERA5 for cloud cover (Grundner et al., 2023)), including few-shot fine-tuning or transfer learning to new physical regimes (Chattopadhyay et al., 2020, Salim et al., 7 May 2025).
  • Comparison to Analytical/Bulk Formulae: Data-driven parameterizations are systematically compared to classical closures or bulk algorithms, with systematic skill improvements documented in both mean and uncertainty quantification (Wu et al., 6 Mar 2025, Perezhogin et al., 2023).

5. Specialized Applications Across Scientific Domains

Geophysical Fluid Dynamics

  • Air-sea fluxes: Replacement of deterministic bulk formulas with input-conditioned ANNs providing mean and variance predictions, supporting stochastic boundary forcing in ocean models (Wu et al., 6 Mar 2025).
  • Boundary layer momentum transport: Unified parameterizations for oceanic and atmospheric turbulent fluxes leveraging self-similar normalized profiles and joint LES datasets (Falga et al., 3 Nov 2025).
  • Eddy stress and backscatter: ANN and analytic closures generalize across grid spacings by integrating non-dimensional scaling and enforcing conservation; physics-based CNNs and interpretable regressions are compared (Perezhogin et al., 13 May 2025, Perezhogin et al., 2023).
  • Cloud cover: Analytical expressions derived from symbolic regression consistently match or outperform neural networks and semi-empirical models, provide direct interpretability, and exhibit superior transferability (Grundner et al., 2023).

Astrophysical and Structural Domains

  • Star formation scaling: Symbolic regression recovers interpretable low-scatter scaling laws directly from high-resolution simulations, outperforming classical analytic models and supporting physical interpretation (Salim et al., 7 May 2025).
  • Structural optimization: Automated, hierarchical reparameterization of ship hull design variables is performed using surrogate-based ILP clustering, enabling more efficient exploration in high-dimensional design spaces (Fabris et al., 2024).

Control and System Identification

  • Output feedback LQR: Frameworks for the data-driven construction of substitute state spaces and full closed-loop system responses via trajectory libraries, circumventing explicit system identification and ensuring theoretical completeness (Xie et al., 28 Aug 2025, Xue et al., 2020).
  • Error-in-variables and robust estimation: Direct data-driven parameterization of the feasible system set via quadratic matrix inequalities (QMI) and semidefinite programming, with explicit SNR-based well-posedness diagnostics; applicable for robust observer/controller synthesis (Brändle et al., 2024, Brändle et al., 2024).
  • Derivative-free continuous-time stabilization: Embedding unknown plant dynamics into stable filter banks enables convex, derivative-free output-feedback stabilization from input–output data, ensuring noise-robustness (Possieri, 20 Jan 2026).

6. Practical Considerations, Limitations, and Future Directions

Several recurring challenges and best practices mark the data-driven parameterization field:

  • Extrapolation Risks and Physical Plausibility: Neural architectures lacking sufficient inductive bias may fail in unsampled regimes or yield physically implausible outputs. Incorporating invariances, constraints, and dimensional scaling is shown to mitigate such risks (Perezhogin et al., 13 May 2025, Perezhogin et al., 2023, Grundner et al., 2023).
  • Sampling Rare Events and Imbalance: For phenomena dominated by extreme or infrequently sampled events (e.g., rare but important subgrid processes), resampling and importance weighting strategies must be carefully tuned to avoid overfitting or underrepresentation (Yang et al., 2024).
  • Scalability and Efficiency: Data-driven parameterization frameworks are designed to exploit the available data with scalable computational strategies, via surrogate modeling, reduced-order modeling (POD + GPR), and, in robust control, by decoupling complexity from the raw data volume (Fabris et al., 2024, Brändle et al., 2024).
  • Adaptivity and Transfer: Transfer learning, symbolic regression with few-shot updating, and hierarchical clustering enable data-driven parameterizations to remain robust as the system, physical regime, or available data evolves (Grundner et al., 2023, Chattopadhyay et al., 2020).
  • Theory-Practice Gap in Algorithmics: Data-driven parameter evaluation on real-world benchmark instances guides the selection of parameterizations in algorithmics, moving beyond worst-case theory and ensuring that practical performance aligns with parameter distributions actually observed in applications (Komusiewicz et al., 8 Sep 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Driven Parameterization.