On-Average Model Stability Overview

Updated 23 September 2025

On-average model stability is defined as the convergence of time- or sample-averaged outputs to an invariant structure despite underlying randomness.
It is crucial in high-dimensional statistics and machine learning for ensuring reproducible inference through techniques like cross-validation, ensemble learning, and regularization.
Stability frameworks guide practical strategies in model aggregation, dynamical control, and feature selection, balancing predictive performance with reliability.

On-average model stability refers to the property that a stochastic process, statistical estimator, machine learning model, or dynamical system exhibits predictable, regular long-term behavior after integrating or averaging over sources of randomness or perturbations. The concept is central in analyzing Markov processes, ergodic theory, high-dimensional statistics, model retraining, ensemble forecasting, control, and more. Formal definitions and criteria for on-average stability depend on the mathematical domain, but the central theme is that time-averaged, sample-averaged, or resampling-averaged output converges to a regular, interpretable, or invariant structure, even if individual trajectories or selections may be unstable.

1. Mathematical Foundations: Stochastic Processes and the e‐Property

A rigorous criterion for on-average model stability is formulated for Markov processes in terms of semigroup properties (Bessaih et al., 2010). Let $(P_t)_{t\geq0}$ denote the Markov semigroup acting on probability measures or functions. A process exhibits on-average stability if:

e‐Property (Equicontinuity): For every bounded Lipschitz function $f\in L_b(X)$ and each $x\in X$ , the family $\{P_t f: t\geq0\}$ is equicontinuous at $x$ : for any $\varepsilon > 0$ , there exists $\delta > 0$ such that for all $y \in B(x,\delta)$ and $t \geq 0$ ,

$|P_t f(x) - P_t f(y)| < \varepsilon$

Average Boundedness: For any $\varepsilon > 0$ and any bounded set $A \subset X$ , there exists a bounded Borel set $B \subset X$ such that for any probability measure $\mu$ supported in $A$ ,

$\limsup_{T \to \infty} \frac{1}{T} \int_0^T P_s \mu(B) \, ds > 1 - \varepsilon$

This ensures that, on average, most of the measure's mass remains in a bounded region.

Concentration at a Point: For any $\varepsilon > 0$ and bounded $A \subset X$ , there exists $\alpha > 0$ and $z \in X$ so that for any two probability measures $\mu_1, \mu_2$ supported in $A$ , there exists a time $t > 0$ such that

$P_t \mu_i(B(z, \varepsilon)) \geq \alpha, \quad \text{for } i=1,2$

The key result is that if these three properties hold, the process is asymptotically stable: for any two initial distributions, the time-marginals converge,

$\lim_{t\to\infty} |\langle f, P_t^* \mu_1 \rangle - \langle f, P_t^* \mu_2 \rangle| = 0$

This demonstrates “on-average” (in the sense of trajectory distribution) loss of memory and convergence to a unique invariant measure. The framework directly applies to stochastic PDEs in turbulence modeling (e.g., Sabra and GOY shell models), where energy control (average boundedness) and dissipation (concentration) drive asymptotic stability.

2. Stability in High-Dimensional Statistics and Machine Learning

Stability is essential to reproducible inference in settings with substantial random variation or model perturbations (Yu, 2013). When estimators or model selections are perturbed—via data resampling (jackknife, bootstrap, cross-validation) or random seeds—stability manifests as insensitivity of results to such changes. Two central formalisms are:

Estimation Stability Measure (ES): For a regularization parameter $\tau$ , $V$ cross-validation splits, and Lasso solutions $\hat\beta_v(\tau)$ :

$ES(\tau) = \frac{1}{V} \sum_{v=1}^V \|X\hat\beta_v(\tau) - \hat{m}(\tau)\|^2 / \hat{m}^2(\tau), \quad \hat{m}(\tau) = \frac{1}{V} \sum_v X\hat\beta_v(\tau)$

Small $ES(\tau)$ indicates predictions robust to subsampling/perturbation. Model tuning (ES-CV) can select solutions with up to 60% fewer predictors and minimal loss ( $\sim$ 1.3%) in performance for high-dimensional fMRI models.

Leave-One-Out (“On-Average”) Stability: The average sensitivity of a learning algorithm to exclusion of a data point. For empirical risk minimization (ERM),

$\Delta(S,W) = \frac{1}{n} \sum_{i=1}^{n} [\ell_i(w_i) - \ell_i(w)]$

where $w$ is the ERM solution on $S$ , and $w_i$ is the ERM solution on $S$ with point $i$ removed. This “average stability” quantifies generalization error and is tight for exp-concave losses; notably, it is invariant under data preconditioning (Gonen et al., 2016), which gives sharp, dimension-based risk bounds ( $2\rho^2 d / \alpha n$ ) and shows that explicit regularization is not statistically required to counter ill-conditioning when preconditioning is incorporated.

3. Model Stability Under Data, Algorithmic, and Environmental Perturbations

Stability can be quantified with respect to a range of perturbations, connecting theory and applied robustness:

Random Seed/Initialization: The same architecture and data can yield significantly different predictions and counterfactual explanations depending on seed (initialization, shuffling, dropout) (Madhyastha et al., 2019). Stability is measured via standard deviation of accuracy and divergence (e.g., relative entropy, Jaccard distance) in explanations (e.g., attention weights, LIME). Aggressive Stochastic Weight Averaging (ASWA) and its norm-filtered variant (NASWA) average weights across optimization trajectories, reducing standard deviation in accuracy and explanation by up to 89%.
Continuous Data Updates: Where models are retrained as new data arrive, jitter quantifies (on average) the fraction of test predictions that flip between retrainings (Liu et al., 2022). Non-recurrent architectures (e.g., CNN, Transformer) or pre-trained fastText embeddings yield lower jitter; ensemble and incremental learning further reduce instability.
Retraining and Ensembles in Forecasting: In time series forecasting, stability encompasses both point and probabilistic output consistency over retraining cycles. Metrics such as Multi-Quantile Change (MQC) evaluate changes in predicted quantiles across retraining steps (Zanotti, 6 Jun 2025). Less frequent retraining and ensemble diversity enhance stability without performance loss.

4. Multi-Level and Dynamical Interpretations of On-Average Stability

Stability can be formalized and assessed at different levels:

Level	Description	Example Assessment Method
Mean Level	Consistency of mean prediction or risk across perturbations	Bootstrap mean, average loss
Distribution Level	Stability of the distribution or histogram of predictions	Kernel density plots, risk histograms
Subgroup Level	Consistency in specific subgroups or under stratification	Subgroup MAPE, calibration plots
Individual Level	Repeatability of individual predictions across perturbations	MAPE, prediction instability plots

Instability may arise from small samples, over-parameterization, lack of penalization, or model complexity (Riley et al., 2022). Techniques such as bootstrapped model re-estimation and explicit stability plots (prediction instability, calibration instability, MAPE) allow fine-grained diagnosis.

In dynamical systems, especially hybrid or switched systems, on-average Lyapunov stability is obtained by averaging the effects of switching (e.g., via average dwell-time constraint) rather than requiring per-interval constraints (Rossa et al., 6 May 2024). For such systems, multiple Lyapunov functions $W_i(x)$ satisfying intermode inequalities,

$W_j(x) \leq e^{\alpha \tau} W_i(x), \quad D^+_{f_i} W_i(x) \leq -(1+\alpha) W_i(x)$

ensure that on average, solutions decay, and the effect of chattering is bounded.

5. On-Average Stability in Ensemble Model Aggregation and Pruning

Model averaging and pruning techniques achieve stability by averaging out variance and reducing sensitivity to particular solutions:

L2-Penalty Model Averaging: Inspired by ridge regression, adding an L2 penalty to averaging weights in Mallows or jackknife model averaging prevents extreme weights and unstable predictions in the presence of correlated models, ensuring consistency and stability (Zhu et al., 2023).
Model Aggregation as Stability Mechanism: Averaging over model checkpoints (with Exponential Moving Average), or over different model architectures in ensemble forecasting, smooths prediction output, leading to more robust (less variable) results with minimal performance tradeoff (Cheng et al., 2023, Zanotti, 6 Jun 2025).
Pruning and Regularization: Model pruning eliminates weights associated with common features that are prone to overfitting, leading to more stable out-of-distribution detection (Cheng et al., 2023).

6. Quantifying Stability Under Distributional Shift and Feature Selection

A unified mathematical criterion for stability in the face of distributional shift is posed as the minimal distributional perturbation (measured by optimal transport with moment constraints) required to degrade model risk to a threshold (Blanchet et al., 6 May 2024). Formally,

$\mathcal{L}(\beta, r) = \inf_{ \mathcal{Q} } \mathbb{M}_c(\mathcal{Q}, \hat{\mathcal{P}} ) \text{ s.t. } \mathbb{E}_\mathcal{Q}[W \ell(\beta, Z)] \geq r$

With suitable cost decompositions, this approach distinguishes between sensitivity to data corruptions and sub-population shifts. Strong duality and convex optimization formulations enable practical computation and differentiation of stability across models and features.

For feature selection stability in high-dimensional data with redundant features, adjusted stability measures (SMA) provide an “on-average” perspective by forgiving exchange of highly correlated/exchangeable features, avoiding artificial instability induced by identifier differences (Bommert et al., 2021).

7. Implications, Applications, and Trade-offs

On-average model stability underpins reproducibility and interpretability across disciplines: turbulence modeling, cancer phenotypic evolution (Niu et al., 2015), fMRI decoding, clinical risk prediction, out-of-distribution detection, ensemble and retrained forecasting. The trade-off between predictive power and stability is often concave; small sacrifices in nominal performance can yield large gains in interpretability, reliability, and practical trustworthiness (Bertsimas et al., 28 Mar 2024, Zanotti, 6 Jun 2025). Stability assessment is essential prior to validation, deployment, and especially in high-stakes or continuously updating environments.

Systematic stability-aware practices—robust cross-validation, model and feature aggregation, use of regularization, and explicit instability quantification—are now recognized as critical for reliable, reproducible inference and decision-making in modern statistical and machine learning pipelines.