Mean-Variance Plane of Harmony

Updated 1 October 2025

The mean-variance plane of harmony is a quantitative geometric framework that maps the trade-off between expected performance and risk across fields like finance and reinforcement learning.
It leverages Markowitz portfolio theory, dynamic programming, and augmented-state strategies to define Pareto-efficient solutions where improvement in mean inherently increases variance.
Dynamic extensions and hybrid models within this framework enable robust benchmark evaluations and risk-sensitive decision-making in online learning and sequential decision processes.

The mean-variance plane of harmony is a quantitative and geometric framework that formalizes the trade-off between expected performance (mean) and variability (variance) across diverse research domains, including financial portfolio selection, sequential decision processes, reinforcement learning, and benchmark evaluation. Originating in Markowitz’s mean-variance portfolio theory and extended across modern reinforcement learning and machine learning evaluation, the concept encapsulates a two-dimensional space where each point corresponds to a combination of mean and variance—the archetypal axes of reward and risk, uniformity, or reliability. This dual-axis perspective enables both the visualization and optimization of Pareto-efficient (optimal in the sense that one cannot improve the mean without increasing variance, and vice versa) solutions, and, in benchmarking applications, the diagnosis of whether high aggregate scores are genuinely reflective of uniform competence.

1. Geometric and Mathematical Foundations

The mean-variance plane is grounded in the geometrization of Markowitz theory, in which efficient portfolios are characterized as points on the boundary of feasible mean-variance pairs, typically forming a convex (often parabolic) frontier in the $(\mathbb{R}^2)$ plane with variance on one axis and mean (expected return or score) on the other (Iliev, 2017). For a portfolio $x \in \mathbb{R}^n$ with covariance matrix $Q \succ 0$ and return vector $p$ , the risk is $v(x) = x^\top Q x$ and the return $T(x) = p^\top x$ . Level sets of constant variance are ellipsoids $Q_a = \{x : x^\top Q x = a\}$ , and feasible portfolios under constraints (such as $T(x) = r$ ) correspond to affine subspaces. The efficient frontier arises from tangency conditions between these ellipsoids and the constraints. The optimal risk-return relationship is reflected in the trace ellipsoid and can often be explicitly characterized as $v(x) = c \cdot T(x)^2$ for some $c > 0$ determined by the intersection geometry.

This geometric viewpoint enables precise parametrization of efficient trade-offs and underpins dynamic extensions, including time-varying selection and multi-period constraints, as well as generalizations to law-invariant convex risk measures for non-Gaussian return models (Sayit, 2022).

2. Policy Classes and Achievability in Sequential Decision Problems

In finite-horizon Markov decision processes (MDPs) and reinforcement learning, the mean-variance plane arises when the objective is not merely to maximize expected cumulative reward $W_T$ but to achieve a desirable mean $E_\pi[W_T]$ while constraining or minimizing variance $V_\pi = E_\pi[W_T^2] - (E_\pi[W_T])^2$ (Mannor et al., 2011). The feasible set of $(\lambda, v)$ pairs generated by policies $\pi$ is typically non-convex due to the quadratic nature of variance, though the set of achievable $(\lambda, q)$ with $q = E_\pi[W_T^2]$ is polyhedral. The boundary of this set forms the efficient frontier in the mean-variance plane.

Key distinctions arise between policy classes:

Markov policies (function of current state) are insufficient for tight variance control; randomized and history-dependent (or augmented-state) policies can strictly improve the achievable set of mean-variance pairs.
Randomization and augmentation with cumulative reward (using the augmented state $(S_t, W_t)$ ) allow for dynamic correction and fine tuning, enlarging $P_{MV}$ .
Computing optimal policies under mean-variance constraints is NP-hard or strongly NP-hard for various policy classes, reflecting the nonlinearity induced by the variance term and the coupling between mean and risk. Pseudopolynomial approximation and exact algorithms are feasible in integer-reward settings through linear programs operating over augmented state-action frequency polytopes.

Thus, the mean-variance plane encodes fundamental limits of risk-sensitive control and decision-making, dictating both achievability and computational complexity.

3. Dynamic Portfolio Optimization and Extensions

Continuous-time extensions generalize the plane of harmony to dynamic settings. In the classical setup, for deterministic terminal time $T$ , the trade-off is solved by minimizing variance for a given mean; extensions include:

Varying Terminal Time: By endogenizing the investment horizon ( $\tau^\pi$ defined as the first hitting time for a moving mean target), the efficient frontier can be improved, with variance minimized by optimally selecting the exit time (Yang, 2019). This entails solving a dynamic programming problem with embedding techniques from stochastic control, yielding explicit expressions for both optimal strategies $\pi^*(t)$ and optimal holding time $\tau^*$ .
Multi-Time State Models: By enforcing variance and mean constraints at multiple time points, the efficient frontier is enriched, and dynamic risk is controlled throughout the investment trajectory. This is formalized with a system of coupled Riccati equations and jump boundary conditions, ensuring a harmonious evolution of risk and return across time (Yang, 2019).
Hybrid Mean-Variance-Quantile Models: The mean-variance plane is extended to incorporate quantile-based (downside) risk measures, such as spectral risk measures and Value-at-Risk, through quantile optimization and martingale representation. The hybrid approach allows for state-dependent allocation, greater sensitivity to market conditions, and superior downside protection, as measured by improved Sortino ratios and reduced tail risk (Wu et al., 2023).

These dynamic extensions maintain the geometric intuition of the mean-variance plane while accommodating multi-period, path-dependent, and downside risk preferences.

4. Monotonicity, Time Consistency, and Refined Efficiency

The classical mean-variance criterion suffers from non-monotonicity and time inconsistency, particularly in dynamic contexts. Several frameworks address these limitations:

Monotone Mean-Variance Utility (MMV): This utility is the minimal monotone modification of classical quadratic utility, constructed by truncating (i.e., setting aside) surplus wealth above a “bliss point.” MMV utility maintains rational (more-is-better) ordering and aligns with agents’ monotonic preferences (Černý et al., 11 Mar 2025). The mean-variance plane is preserved, but the efficiency frontier is now defined over truncated outcomes, often improving risk-adjusted performance as measured by the monotone Sharpe ratio.
Time-Consistent Strategies: Time-inconsistency arises because future variance is non-additive and breaks the dynamic programming principle. Reformulating the problem in the space of wealth distributions allows for deterministic transitions, enabling the Bellman equation to solve for truly time-consistent mean-variance optimal strategies (Bäuerle et al., 2023). Nash equilibrium control and forward-backward SDEs also yield equilibrium solutions in strictly monotone MV settings (Wang et al., 16 Feb 2025).

These refinements ensure that the mean-variance harmony extends to robust, dynamically admissible policies, with equilibrium solutions preserving efficient trade-offs over time.

5. Applications in Online Learning and Benchmark Evaluation

The mean-variance plane of harmony is generalized beyond finance. In online learning and multi-armed bandit problems, the mean-variance criterion $\mathrm{MV}_i = \rho \mu_i - \sigma_i^2$ defines a selection rule balancing exploitation (high average return) and risk aversion (low variance), parameterized by a risk tolerance $\rho > 0$ (Zhu et al., 2020). Thompson Sampling–based algorithms are shown to efficiently span the mean-variance plane, with information-theoretically tight regret bounds and optimal adaptation across risk regimes.

In the evaluation of scientific benchmarks, the mean-variance plane of harmony characterizes the reliability of benchmarks themselves (Uzunoglu et al., 30 Sep 2025). A harmony metric,

$H(\mathcal{G}_f) = -\frac{1}{\log k} \sum_{i=1}^k p_i \log(p_i + \epsilon),$

quantifies the entropy of performance distributions across $k$ subdomains (subsets of the dataset), with $p_i$ computed from kernel-weighted deviations from the mean performance. Aggregating $H$ over a set of models yields mean and variance of harmony scores $(\mu_H, \sigma^2_H)$ , and benchmarks are placed onto a mean-variance plane where high $\mu_H$ (uniform performance) and low $\sigma^2_H$ (stability across models) indicate trustworthy aggregate scores. This approach robustly differentiates between benchmarks whose overall accuracy reflects broad competence and those that obscure subdomain weaknesses via overaggregation.

6. Unified Perspectives and Frontiers

Across these domains, the mean-variance plane of harmony serves as both a theoretical construct and a tool for practical optimization and evaluation:

In geometric and algebraic terms, it provides a locus and frontier for efficient trade-offs and points of tangency (optimality).
In dynamic and stochastic control, it structures achievable pairs under complex policy classes and computational constraints.
In learning and benchmarking, it reframes “average” performance as a bivariate indicator of mean and uniformity, guiding both algorithm and dataset design.

This universality positions the mean-variance plane of harmony as an enduring framework for the analysis of risk, robustness, and efficiency in diverse scientific and engineering applications.