Minimax Weighted Expected Regret (MWER)
- MWER is a decision-theoretic criterion that uses a weighted set of probabilities to minimize expected regret, blending Bayesian and minimax approaches.
- It employs rigorous axiomatic foundations and likelihood-based updating to maintain consistency in both static and dynamic decision settings.
- MWER finds practical application in reinforcement and online learning, providing tractable algorithms with strong theoretical performance guarantees.
Minimax Weighted Expected Regret (MWER) is a decision-theoretic criterion that generalizes classical minimax expected regret to settings where uncertainty is represented not by a single probability measure, nor by an unweighted set, but by a weighted set of probability measures. This framework provides a rigorous approach for robust decision-making under ambiguity, interpolating smoothly between Bayesian expected utility and traditional minimax regret, and supports a fully axiomatic characterization in both static and dynamic (updating) settings (Halpern et al., 2013, Halpern et al., 2012). MWER has been developed across decision theory, reinforcement learning, and online learning, delivering tight theoretical bounds and tractable algorithms for robust yet adaptive choice.
1. Foundations: Weighted Sets of Probabilities and Regret-Based Decision Rules
MWER is defined in terms of a weighted set of probabilities on a finite state space . Let , where each quantifies the significance or credibility of . The normalization constraint ensures comparability across measures.
Given a set of possible prizes and a utility function , a Savage act is . For every feasible act and menu of available acts, the ex post optimal utility in state 0 is 1. The regret of act 2 in 3 is 4, with expected regret under 5 given by 6. The weighted expected regret (WER) for 7 is
8
The MWER decision rule selects any act in 9 minimizing this quantity.
When all weights are unity (i.e., 0 for all 1), MWER reduces to standard minimax regret. When 2 is a singleton, MWER becomes subjective expected utility maximization, making it a natural generalization capable of interpolating between fully robust and fully Bayesian behaviors (Halpern et al., 2013, Halpern et al., 2012).
2. Weight Assignment and Likelihood-Based Updating
Initial weights 3 can arise from subjective confidence, expert judgment, or second-order priors. The only requirement is proper normalization (4).
Upon observing new information 5 (with 6), MWER employs a likelihood updating rule. Each original 7 with 8 is replaced by 9, where
0
and all measures yielding the same conditional are merged by taking the supremum of possible weights. This approach ensures that the updated set is again normalized.
This updating preserves consistency, in that sequential updates commute: 1. Additionally, under repeated observations generated by some 2, the weights concentrate: 3 almost surely, so MWER converges to expected utility under the true 4 (Halpern et al., 2013, Halpern et al., 2012).
3. Axiomatic Characterization: Static and Dynamic
MWER is fully characterized by an axiomatic system within the Anscombe–Aumann framework. For static (single-stage) choice, the following conditions must be satisfied (for every menu 5):
- Transitivity: If 6 and 7, then 8.
- Completeness: For any 9, either 0 or 1.
- Non-triviality: There exist 2 with 3.
- Monotonicity: If 4 state-wise dominates 5, then 6.
- Mixture continuity: Preferences are continuous under convex mixtures.
- Ambiguity aversion: If 7, then 8.
- Independence: Preference over 9 is stable under independent mixing with any 0.
- Menu-independence for constants: For constant acts, preferences do not depend on the menu.
- INA: Adding acts never strictly optimal in any state does not change relative preferences among the rest. 10. Boundedness: Every menu admits a dominating constant act.
There is a representation theorem: preferences satisfying these axioms correspond precisely to MWER, with unique (up to affine transformation) utility and maximal normalized 1 (Halpern et al., 2013, Halpern et al., 2012).
In dynamic settings, with sequential observations, an additional axiom applies:
- Menu-Dependent Dynamic Consistency (MDC): If, after learning 2, 3 is preferred to 4, then before learning 5, the conditional act "play 6 on 7, 8 otherwise" is preferred to the analogous 9-plan.
This extension ensures that, after likelihood updating, preferences continue to admit a MWER representation with the appropriately updated 0.
4. MWER in Robust Sequential Learning and Reinforcement Learning
MWER admits a natural formalization for sequential decision-making problems, notably in reinforcement learning (RL) and online learning (Bongole et al., 2024, Moroshko et al., 2013). Given an unknown Markov Decision Process (MDP) parameterized by 1, the regret of a policy 2 is
3
where 4 is the expected cumulative reward and 5 the value of the optimal policy for 6. Defining a weighted prior 7, the weighted expected regret is 8. The minimax weighted expected regret is then
9
A minimax duality theorem shows that, under standard regularity (convexity, compactness, continuity), MWER coincides with classical minimax regret:
0
MWER enables the direct use of information-theoretic Bayesian regret bounds to obtain robust minimax rates, including for finite-horizon MDPs, linear and contextual bandits. For example, in multi-armed bandits with 1 arms and 2 rounds, MWER achieves 3 regret. This framework reduces robust sequential learning to the optimization of weighted expected regret, facilitating tractable approximation and computation via duality and game-theoretic techniques (Bongole et al., 2024).
5. Algorithmic Realization: Weighted Minimax in Online Learning
In online linear regression with adversarial labels, MWER is instantiated by the Weighted Last-Step Min-Max (WEMM) algorithm (Moroshko et al., 2013). At each round 4, the algorithm predicts using the weighted least-squares solution formed from the history, with weights 5 selected so as to ensure feasibility of the min-max saddle point. The weighted cumulative loss for a comparator 6 is
7
and the algorithm guarantees
8
for any feasible weight sequence, delivering zero weighted minimax regret.
By careful design, including recursive updates and data-driven choice of 9, the difference between 0 and the standard unweighted loss 1 can be controlled, yielding logarithmic or sub-logarithmic regret in 2 when the data or labels are favorable. The approach extends to weakly non-stationary environments, where regret is measured relative to slowly drifting comparators.
Compared to prior last-step min-max forecasters that required known bounds and uniform weights, WEMM achieves improved constants, relaxes the need for a-priori adversarial bounds, and is competitive in environments with mild non-stationarity (Moroshko et al., 2013).
6. Relation to Classical Decision Criteria and Properties
MWER rigorously interpolates between minimax expected regret (MER) and subjective expected utility (SEU):
- When all 3, MWER coincides with MER, fully robust to ambiguity.
- When 4 and 5, MWER coincides with SEU, fully Bayesian.
- As learning progresses and likelihood-based updating concentrates, MWER transitions smoothly from MER to SEU, capturing learning from data (Halpern et al., 2013, Halpern et al., 2012).
Distinctive features of MWER include:
- Ambiguity sensitivity: MWER handles ambiguity aversion through its axiomatic basis.
- Menu dependence: Preferences can depend on the set of available acts, reflecting the regret criterion's sensitivity to alternative actions, in contrast to maximin expected utility (MMEU), which is menu-independent.
- Dynamic consistency: Through its updating rule and dynamic axioms, MWER ensures that plans made before and after new information are revealed are mutually consistent in behavior.
- Overcoming set-model limitations: MWER remedies the inability of pure set-based probability models to learn relative likelihoods through data and avoids the collapse to SEU of second-order probability approaches.
7. Illustrative Example and Implications
A representative example is the delivery robot problem, with two possible states (“1 broken cake”, “10 broken cakes”) and three acts (“continue”, “back”, “check”). Initial symmetric weights 6 lead MWER to behave identically to MER. With repeated favorable observations (e.g., "first 7 cakes are unbroken"), likelihood updating increases the weight on the more plausible 8, and MWER shifts toward SEU-optimal actions for 9. This demonstrates MWER’s ability to interpolate between robust and data-driven decision-making, adapting preferences as weights evolve through learning (Halpern et al., 2013, Halpern et al., 2012).
In summary, MWER unifies and extends foundational approaches to robust choice under ambiguity, admits explicit and tractable updating, supports strong theoretical guarantees across decision theory, online learning, and reinforcement learning, and is characterized through natural and interpretable axioms. The framework enables both principled robust decision-making and smooth adaptation as information accrues.