Variance-Based Penalties in Optimization

Updated 4 July 2026

Variance-based penalty is an approach that incorporates a dispersion term into primary objectives to balance mean performance and variability.
It is applied across reinforcement learning, distributionally robust optimization, empirical risk minimization, and portfolio selection to manage uncertainty and risk.
Despite its theoretical appeal, direct variance penalization can be unstable, prompting alternative strategies such as mean-deviation penalties and variance-informed scaling.

A variance-based penalty is an objective modification in which a primary criterion—typically expected return, empirical risk, or expected cost—is augmented by a term proportional to a variance or variance-derived dispersion functional. In the supplied literature, this pattern appears in mean–variance reinforcement learning, variance-penalized distributionally robust optimization, empirical Bernstein-type statistical learning, domain generalization via cross-domain risk dispersion, and portfolio selection with explicit covariance terms. Canonical instances include the mean–variance reinforcement learning objective

$\max_\pi \; \mathbb{E}[G_0] - \lambda\,\mathbb{V}[G_0],$

the variance-penalized DRO problem

$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$

and sample variance penalization

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$

Across these settings, the variance term is used either as a risk surrogate, a robustness regularizer, or a device for balancing approximation and estimation error (Luo et al., 15 Apr 2025, Birrell, 2020, 0907.3740).

1. Formal structure and representative objective classes

The common template is a mean–dispersion tradeoff. In reinforcement learning, the return random variable is

$G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$

and the variance penalty instantiates a generic mean–variability objective

$\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$

by choosing $\mathbb{D}[G_0] = \mathbb{V}[G_0]$ . In stochastic and robust optimization, the corresponding structure is an expected cost plus a variance penalty, for example

$H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$

In statistical learning, the role of the variance term is played by an empirical loss variance or sample variance, while in domain generalization it is the empirical standard deviation or variance of domainwise risks (Luo et al., 15 Apr 2025, Birrell, 2020, 0907.3740, Xie et al., 2020).

The variance itself is used in its standard form

$\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$

but the supplied literature also treats empirically estimated analogues and closely related cross-domain dispersion functionals. A notable distinction is whether the penalty is quadratic in deviations, as with variance, or linear in deviations, as with standard-deviation- or MAD-type objectives; that distinction is central to several of the instability and robustness results discussed later.

Setting	Penalized objective	Penalized quantity
Risk-averse RL	$\mathbb{E}[G_0]-\lambda\,\mathbb{V}[G_0]$	Return variance
VP-DRO	$E_Q[\rho_x]+\operatorname{Var}_Q[\phi_x]$	Variance under ambiguity
SVP	$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 0	Sample variance of loss
RVP	$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 1	Dispersion of domain risks
Extended portfolio model	$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 2 inside the objective	Covariance-based variance term

A plausible implication is that “variance-based penalty” is not a single algorithmic motif but a family of objective perturbations whose shared feature is explicit penalization of second-order dispersion. The concrete computational consequences, however, differ sharply by domain.

2. Reinforcement learning and Markovian reward optimization

In policy-gradient reinforcement learning, variance is treated as a canonical measure of variability. For a policy $\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 3, the gradient of the variance satisfies

$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 4

which yields, for $\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 5,

$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 6

The literature emphasizes three structural consequences of this formula: the appearance of $\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 7 and therefore squared returns, the “double sampling” requirement for unbiased estimation of $\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 8, and the non–positive-homogeneity of variance under reward rescaling. In REINFORCE and PPO, variance is incorporated through the generic update

$\min_{x\in \mathcal{X}} \sup_{Q\in\mathcal{U}} \left\{ E_Q[\rho_x] + \operatorname{Var}_Q[\phi_x] \right\},$ 9

but the empirical study reports that Variance and Semi_Variance exhibit unstable updates, high gradient variance, and difficulty in selecting stable hyperparameters; in several Mujoco plots they are omitted because they fail to learn a reasonable policy (Luo et al., 15 Apr 2025).

A distinct actor–critic line of work uses a direct temporal-difference estimator of return variance rather than a second-moment surrogate. With

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 0

the variance function satisfies

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 1

This leads to a direct variance TD target

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 2

and to actor updates of the form

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 3

The reported result is convergence to locally optimal policies for finite state–action MDPs, both on-policy and off-policy, while reducing variance of returns and maintaining competitive mean return (Jain et al., 2021).

For total-reward MDPs, the variance-penalized expectation

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 4

induces a structural pathology: optimal schedulers can be eventually reward-minimizing, meaning that once enough reward has accumulated they minimize future expected rewards. The supplied analysis treats this as conceptually undesirable for risk aversion, because it suppresses additional gains on already favorable trajectories. Semi-variance does not remove the problem: for any $SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 5, there exist MDPs in which every SVPE-optimal scheduler is eventually reward-minimizing (Baier et al., 2024).

3. Distributionally robust and convex-analytic formulations

Variance penalties are especially significant in DRO because $SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 6 is nonconvex in the distribution $SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 7. Nevertheless, for

$SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 8

an exact finite-dimensional convex reformulation is available: $SVP_{\lambda}(\mathbf{X}) \;=\; \arg\min_{f\in\mathcal{F}} \; P_n(f,\mathbf{X}) \;+\; \lambda \sqrt{\frac{V_n(f,\mathbf{X})}{n}}.$ 9 The right-hand side is convex in $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$0, and the equality is tight rather than merely upper-bounding. This turns an infinite-dimensional robust maximization over measures into a finite-dimensional convex program and simultaneously yields tight uncertainty-quantification bounds for variance under model misspecification (Birrell, 2020).

A related but distinct construction appears in convex stochastic optimization. There, a $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$1-DRO neighborhood around the empirical distribution yields the robust empirical risk

$G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$2

For bounded loss, this objective admits a variance expansion of the form

$G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$3

with $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$4 and $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$5, and it is convex whenever $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$6 is convex. This supplies a convex surrogate for direct empirical variance regularization, which is generally nonconvex even when the loss itself is convex (Duchi et al., 2016).

These two lines are mathematically aligned in one specific sense: both replace a direct optimization over a nonconvex variance-penalized criterion by a tractable convex object whose dependence on variance is preserved exactly or asymptotically. This suggests that robust duality is one of the principal routes by which variance penalties become computationally usable in high-dimensional settings.

4. Statistical learning, empirical Bernstein control, and domain-level risk dispersion

In classical learning theory, variance-based penalties arise from variance-sensitive concentration bounds. Given $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$7, the sample variance is defined symmetrically as

$G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$8

and empirical Bernstein inequalities replace the unknown variance in Bennett-type bounds by $G_0 \overset{\mathrm{def}{=} \sum_{t=0}^{T-1} \gamma^t R_{t+1},$9. This leads directly to sample variance penalization,

$\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 0

The corresponding excess-risk guarantee is variance-sensitive: for suitable $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 1, the excess risk of $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 2 is bounded in terms of $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 3, the variance of an optimal hypothesis. When $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 4, the bound becomes essentially $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 5, and the paper gives a finite-class example in which SVP achieves $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 6 while ERM remains at $\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 7 (0907.3740).

In domain generalization, the same broad idea is transferred from samplewise loss dispersion to domainwise risk dispersion. With empirical domain-risk vector

$\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 8

V‑REx uses

$\max_\pi \; \mathbb{E}[G_0] - \lambda\, \mathbb{D}[G_0]$ 9

whereas Risk Variance Penalization uses

$\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 0

The supplied analysis shows that RVP arises from a quasi-DRO problem over domain weights, establishes pointwise and uniform links between the min–max formulation and the mean-plus-dispersion objective, and gives an asymptotic tuning rule

$\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 1

The stated interpretation is that $\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 2 acts asymptotically as an upper confidence bound on the average domain risk, while empirically improving worst-domain behavior under appropriate domain diversity (Xie et al., 2020).

A plausible synthesis of these two literatures is that variance-based penalties in learning serve two formally different but structurally related roles: they either tighten risk upper bounds by exploiting low empirical variance, or they encode robustness against cross-domain heterogeneity by shrinking the dispersion of domainwise losses.

5. Portfolio optimization and variance-component interpretations

In portfolio optimization, variance-based penalties appear both directly as quadratic risk terms and indirectly through variance-component parameterizations. An extended mean–variance–CVaR portfolio model with short selling and cardinality constraints uses the objective

$\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 3

where $\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 4 is the variance term. After splitting variables and introducing the quadratic penalty

$\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 5

the $\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 6-subproblem becomes a strictly convex quadratic program with

$\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 7

which is positive definite for any $\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 8. This yields a closed-form $\mathbb{D}[G_0] = \mathbb{V}[G_0]$ 9-update inside a penalty-decomposition plus block-coordinate-descent scheme. The paper reports that Algorithm 2 is about twice as fast as PADM and achieves small gaps relative to direct CVX–MOSEK solutions on the S&P-based instances considered (Mousavi et al., 2024).

A different variance-based interpretation arises in high-dimensional ridge regression. Under the Gaussian random-effects model

$H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 0

the ridge penalty is not an external hyperparameter but the variance ratio

$H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 1

The same variance components define the heritability index

$H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 2

Maximum marginal likelihood is then used to estimate $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 3, and therefore $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 4, directly from the marginal model. The supplied study reports good performance of MML relative to CV, and for Poisson and Binomial ridge regression it reports superior accuracy of the resulting MML estimator of $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 5 as compared to CV (Veerman et al., 2019).

These two examples use the word “variance” differently. In the portfolio model it is a direct penalized quadratic form in the decision variable, whereas in ridge-type models it is a latent variance-component ratio that determines the strength of an $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 6 penalty. In both cases, however, the penalty magnitude is controlled by second-order structure.

6. Failure modes, critical phenomena, and alternatives to direct variance penalization

The most consistent negative result in the supplied literature is that direct variance penalties are often theoretically natural but operationally brittle. In policy-gradient RL, the quadratic return term $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 7, the need for double sampling, and non–positive-homogeneity make variance-based penalties highly sensitive to scale and sample noise; empirically, Variance and Semi_Variance are “not recommended in practice” relative to CVaR Deviation, Gini Deviation, Mean Deviation, and Semi_STD (Luo et al., 15 Apr 2025).

For accumulated rewards in MDPs, the critique is more structural. Variance-penalized expectation and semi-variance-penalized expectation can force eventually reward-minimizing behavior, whereas MADPE with $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 8, SMADPE with $H[P,x] = E_P[\rho_x] + \operatorname{Var}_P[\phi_x].$ 9, and threshold-based penalties admit eventually reward-maximizing optimal schedulers. The threshold-based objective

$\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 0

is especially notable because once $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 1, the penalty vanishes and the objective coincides with expected reward, thereby avoiding the suppression of good right-tail outcomes (Baier et al., 2024).

High-dimensional portfolio variance optimization under asymmetric $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 2 regularization exhibits a separate critical phenomenon. The supplied replica analysis states that regularization extends the interval where the optimization can be carried out and suppresses large sample fluctuations, but that the performance of $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 3 regularization is “rather disappointing”: if $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 4 is small, the regularizer does not play any role, while where it becomes active the estimation error is already very large. The same analysis finds that $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 5 regularization can eliminate at most half the assets, and that there is a critical ratio $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 6 beyond which the $\mathbb{V}[X] \;=\; \mathbb{E}\big[(X-\mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - (\mathbb{E}[X])^2,$ 7-regularized variance cannot be optimized because the regularized variance becomes constant over the simplex (Kondor et al., 2017).

An adjacent response to these issues is to avoid explicit variance minimization while still using variance information indirectly. In improved excited-state VMC with deep-learning ansatzes, the main objective is energy plus overlap and spin penalties, while local-energy variance enters only through adaptive scaling of overlap penalties. The stated motivation is that direct variance minimization for neural-network ansatzes involves third derivatives and can converge to undesired states, whereas variance-informed scaling preserves adaptivity without explicit variance optimization (Szabó et al., 2024).

Taken together, these results delineate the modern status of variance-based penalties. They remain central when dispersion itself is the target quantity, when robust duality yields tractable convex surrogates, or when empirical Bernstein effects are exploitable. But the same literature repeatedly shows that raw quadratic variance penalization can induce unstable gradients, pathological policies, or poor high-dimensional conditioning, making one-sided deviations, MAD-type penalties, CVaR-style objectives, and variance-informed but not variance-minimizing penalties recurring alternatives.