MSM Emulators: Surrogate Modeling for Complex Systems

Updated 1 October 2025

MSM emulators are computational surrogates that utilize MSM statistics and machine learning to simulate long-timescale dynamics in complex systems.
They employ diverse methods including MSM/RD coupling, projection-based model reduction, and ANOVA expansions to balance fidelity and efficiency.
Applications span molecular kinetics, nuclear physics, and time series analysis, enabling scalable simulation and robust parameter exploration.

A Markov State Model (MSM) emulator is a computational surrogate that leverages statistical or structural properties from MSMs to efficiently reproduce long-timescale dynamics, mechanistic features, or other functional outputs of complex systems. MSM emulators are prominent in fields ranging from molecular kinetics (where they reconstruct atomistic protein trajectories) to physics (as in nuclear simulation), machine learning (time-series and classification), multi-commodity flow optimization, and high-dimensional model approximation. Across applications, MSM emulators seek to balance fidelity, computational efficiency, support for parameter exploration, and adaptation to the intrinsic scale separation or interaction structure of the underlying model.

1. Fundamental Principles and Definitions

MSM emulators derive from the MSM formalism, which discretizes the state space into metastable regions and models the rates of transitions between these regions—typically using a transition probability matrix $T$ estimated from MD or other time series data. The core construct is a Markov process where future states depend only on the current state:

$\rho(t + \tau) = T^\top(\tau) \rho(t)$

for probability vector $\rho$ and lag $\tau$ .

In the context of multi-physics and high-dimensional simulation, “emulator” refers to a surrogate model—often constructed by regression, projection, or machine learning—that approximates the input–output behavior of the full dynamical system. MSM emulators implement this surrogate via direct use of MSM statistics, projection-based model order reduction (MOR), nonlinear approximators (e.g., neural networks, random forests), or ANOVA expansions. For some domains, “MSM” denotes a specific metric (Move-Split-Merge, see (Holznigenkemper et al., 2023)), used for time series comparison and classification.

2. Construction and Coupling Methods

The construction of MSM emulators can take several forms:

MSM/RD Coupling: As in (Dibak et al., 2017), a hybrid scheme couples MSMs for the interacting region (where molecular conformational dynamics occur) to reaction–diffusion (RD) simulations for the non-interacting region. Entry and exit transfer probabilities ( $p_\text{entry}$ , $p_\text{exit}$ ) mediate switching between simulation domains, conserving macroscopic kinetics. This coupling supports computationally efficient simulation at multiple spatiotemporal scales.
Projection-Based Model Reduction: Nuclear emulators frequently employ Ritz and Galerkin projection methods (Melendez et al., 2022), spanning a low-dimensional “small space” using high-fidelity solutions (“snapshots”) and projecting the operator (e.g., Hamiltonian) onto this subspace. Eigenvector continuation is a special case, yielding reduced generalized eigenvalue problems:

$H_{\text{red}} c = E N c, \quad H_{\text{red}} = X^\dagger H X, N = X^\dagger X$

Data-Informed Surrogate Modeling: Approaches utilizing machine learning (RF, NN) fit complex system outputs (e.g., time-evolution of distribution coefficient $K_d$ ) against parameterized data, with models trained on time series of simulated predictors (Lu et al., 2020). Clustering methods (e.g., k-means with dynamic time warping) stratify regimes, improving accuracy for highly nonlinear response surfaces.
Synthetic Trajectory Generation: Fine-grained MSMs can be built from atomistic MD data using feature extraction (e.g., tICA), custom clustering, and transition counting, backmapping the resulting discrete trajectories to atomistic coordinates (Russo et al., 2022). This produces “synMD” trajectories many orders faster than direct MD.
Time Series Metric Emulation: In elastic time series analysis, the MSM metric (Move-Split-Merge) is computed via dynamic programming with move, split, and merge operations under cost constraints. Algorithmic improvements (pruning, lower/upper bounds, heuristics adapted from DTW) enable linear- or subquadratic-time surrogate computation, competitive with and often surpassing DTW (Holznigenkemper et al., 2023).
ANOVA-Based Surrogate Expansion: High-dimensional surrogate construction can utilize exact/finite ANOVA decompositions, incorporating derivatives when available (Lamboni, 15 Mar 2025). The expansion takes the form:

$f(x) = \mathbb{E}[f(X')] + \sum_{v \subseteq \{1, \dotsc, d\}, |v| > 0} \mathbb{E}_{X'}\left[ \mathcal{D}^{|v|} f(X') \prod_{k \in v} \frac{(F_k(X'_k) - 1_{[X'_k \ge x_k]})}{\rho_k(X'_k)} \right]$

Truncating at an appropriate interaction order $d_0$ yields efficient dimension-free emulators, with parametric $O(N^{-1})$ MSE rates when derivatives are usable.

3. Performance, Validation, and Efficiency

Evaluating MSM emulator accuracy involves comparison against reference data (MD, high-fidelity simulations, or experiments):

In MSM/RD (Dibak et al., 2017), relative errors in mean first passage times (MFPTs) between metastable states are below 9%, and calculated reaction rates align with experimental constants.
Synthetic MSM emulators (Russo et al., 2022) produce Trp-cage folding trajectories at rates exceeding 200 ms/day compared to 100 μs/day for MD on specialized hardware.
Data-informed emulators (Lu et al., 2020) use relative $L_2$ errors to validate predictions; clustering reduces average errors by up to 50%.
ANOVA-based emulators (Lamboni, 15 Mar 2025) demonstrate parametric convergence independent of dimension for derivative-informed cases, with input selection via sensitivity indices further improving efficiency.
Time series MSM metric emulators (Holznigenkemper et al., 2023) efficiently compute elastic distances, outperforming DTW on many UCR datasets.

4. Input Screening, Sensitivity, and Truncation

Input selection is pivotal, especially for high-dimensional models:

Sensitivity indices (main $S_j$ and total $S_{T_j}$ ) from the ANOVA decomposition (Lamboni, 15 Mar 2025) inform truncation order $d_0$ and variable inclusion/exclusion. For example, with $\sum_j (2S_j + S_{T_j}) \in (2, 3]$ , truncation at $d_0 = 3$ is justified.
Random Forests provide feature importance scores during training (Lu et al., 2020), and clustering stratifies dynamic regimes.

5. Special Considerations and Open Challenges

Several unresolved or application-specific challenges are outlined:

MSM/RD coupling must carefully define metastable states and merge/split protocols to ensure conservation of kinetics (Dibak et al., 2017).
Projection-based reductions struggle with non-affine or nonlinear parameter dependencies, prompting use of hyperreduction strategies (Melendez et al., 2022).
In multi-commodity flow settings (Haeupler et al., 20 Jun 2024), representation of flows “implicitly” rather than via explicit decomposition breaks the $O(mk)$ barrier (where $m$ is edges and $k$ commodities), enabling near-linear runtime algorithms.
Derivative-free ANOVA emulators require careful stochastic perturbation and bandwidth parameter selection for efficient global approximation (Lamboni, 15 Mar 2025).

6. Applications Across Domains

MSM emulators support diverse applications:

Domain	Emulator Methodology	Key Use Case
Molecular Kinetics	MSM/RD hybrid, synMD, clustering	Protein-ligand binding, trajectory generation
Nuclear Physics	Model reduction (Ritz/Galerkin/EC)	Bound-state and scattering problem emulation
Multi-Physics	ML surrogates (RF, NN), ANOVA	Nuclear waste THMC, groundwater contamination
Time Series Analysis	MSM metric, pruning/heuristics	Classification, similarity search
Optimization/Networks	Low-step flow emulators	Multi-commodity flow, routing

This suggests that MSM emulators, when constructed with domain-specific techniques, offer scalable and efficient surrogates across disparate scientific and engineering disciplines.

7. Future Directions and Outlook

The evolution of MSM emulator methodologies encompasses:

Development of improved hyperreduction algorithms, adaptive bases, and snapshot selection (Melendez et al., 2022), applicable to nonlinear and time-dependent high-dimensional problems.
Incorporation of advanced sampling (e.g., tensor products, generative models) for complex molecular interactions or higher-order reactions (Dibak et al., 2017).
Integration of sensitivity-driven input selection and interaction truncation, enabling dimension-free emulation even as model complexity increases (Lamboni, 15 Mar 2025).
Further refinement of fast time series metric algorithms, extending linear-time strategies to structured cases beyond constants (Holznigenkemper et al., 2023).
Practical deployment of parallel algorithms and implicit representations in large-scale network optimization (Haeupler et al., 20 Jun 2024).

A plausible implication is that MSM emulators, supported by projection, statistical, and machine learning frameworks, will continue to expand the scope and tractability of high-fidelity simulation and analysis across a broad range of complex systems.