Papers
Topics
Authors
Recent
2000 character limit reached

Robust Decision Making with Partially Calibrated Forecasts (2510.23471v1)

Published 27 Oct 2025 in stat.ML, cs.AI, and cs.LG

Abstract: Calibration has emerged as a foundational goal in trustworthy machine learning'', in part because of its strong decision theoretic semantics. Independent of the underlying distribution, and independent of the decision maker's utility function, calibration promises that amongst all policies mapping predictions to actions, the uniformly best policy is the one thattrusts the predictions'' and acts as if they were correct. But this is true only of \emph{fully calibrated} forecasts, which are tractable to guarantee only for very low dimensional prediction problems. For higher dimensional prediction problems (e.g. when outcomes are multiclass), weaker forms of calibration have been studied that lack these decision theoretic properties. In this paper we study how a conservative decision maker should map predictions endowed with these weaker (partial'') calibration guarantees to actions, in a way that is robust in a minimax sense: i.e. to maximize their expected utility in the worst case over distributions consistent with the calibration guarantees. We characterize their minimax optimal decision rule via a duality argument, and show that surprisingly,trusting the predictions and acting accordingly'' is recovered in this minimax sense by \emph{decision calibration} (and any strictly stronger notion of calibration), a substantially weaker and more tractable condition than full calibration. For calibration guarantees that fall short of decision calibration, the minimax optimal decision rule is still efficiently computable, and we provide an empirical evaluation of a natural one that applies to any regression model solved to optimize squared error.

Summary

  • The paper develops a duality-based minimax framework that enables robust decision making when forecasts are only partially calibrated.
  • It reveals a sharp transition where, under decision calibration, the robust policy simplifies to a plug-in best response.
  • Empirical evaluations on benchmark datasets show that the robust policy outperforms plug-in methods under adversarial shifts with minimal cost under i.i.d. conditions.

Robust Decision Making with Partially Calibrated Forecasts

Introduction and Motivation

This work addresses the challenge of robust decision making when machine learning forecasts are only partially calibrated. While full calibration ensures that the best policy is to trust the predictions directly, achieving full calibration is computationally intractable in high-dimensional or multiclass settings. The paper introduces a minimax framework for decision making under weaker, more tractable calibration guarantees, characterizing optimal policies and revealing a sharp transition: under decision calibration (a strictly weaker condition than full calibration), the optimal robust policy is again to best respond to the forecast. The analysis extends to generic partial calibration conditions, providing efficient algorithms and empirical validation.

Calibration, Partial Calibration, and Decision Making

Calibration is a statistical property ensuring that, for any predicted value vv, the conditional expectation of the outcome given the prediction equals vv. Formally, for a forecaster ff, full calibration requires E[Yf(X)=v]=v\mathbb{E}[Y \mid f(X) = v] = v for all vv. This property guarantees that the best policy for a downstream decision maker is to act as if the forecast is correct.

However, full calibration is infeasible in high dimensions due to exponential sample complexity and computational barriers. As a result, weaker forms of calibration—such as H\mathcal{H}-calibration—are considered. Here, calibration is enforced only with respect to a set of test functions hHh \in \mathcal{H}, requiring E[h(f(X))(Yf(X))]=0\mathbb{E}[h(f(X)) \cdot (Y - f(X))] = 0 for all hHh \in \mathcal{H}. This generalizes to a spectrum of calibration guarantees, with full calibration as the limiting case when H\mathcal{H} is the set of all measurable functions.

The decision maker, upon receiving a forecast f(x)f(x), considers the set of all possible conditional expectations qq consistent with the calibration constraints. The robust policy is then defined as the one maximizing expected utility in the worst case over all such qq. Figure 1

Figure 1: Schematic of the interpolating property—robust policies interpolate between minimax safety and best-response as calibration strengthens.

Minimax-Optimal Policies under H\mathcal{H}-Calibration

The core technical contribution is a duality-based characterization of the minimax-optimal policy under arbitrary finite-dimensional H\mathcal{H}-calibration. The ambiguity set Q\mathcal{Q} of admissible conditional expectations is defined by the calibration constraints. The robust policy is obtained by solving a saddle-point problem:

arobust()=argmaxa()minqQE[u(a(f(X)),q(f(X)))]a_{\mathrm{robust}}(\cdot) = \arg\max_{a(\cdot)} \min_{q \in \mathcal{Q}} \mathbb{E}[u(a(f(X)), q(f(X)))]

where u(a,y)u(a, y) is the utility function, assumed linear in yy. The solution involves:

  • Computing dual multipliers λ\lambda^* for the calibration constraints.
  • For each forecast vv, finding the worst-case q(v)q^*(v) by minimizing a convex function involving the dual variables.
  • Best-responding to q(v)q^*(v).

This approach is computationally efficient for finite H\mathcal{H} and reduces the robust policy computation to low-dimensional convex programs.

Sharp Transition: Decision Calibration and Best-Response Optimality

A central result is the identification of a sharp transition in the structure of robust policies. When H\mathcal{H} includes the decision calibration class—indicator functions for the regions where each action is optimal—the robust policy collapses to the plug-in best response:

arobust(v)=argmaxaAu(a,v)a_{\mathrm{robust}}(v) = \arg\max_{a \in \mathcal{A}} u(a, v)

This holds for any H\mathcal{H} containing the decision calibration tests, and the result extends to simultaneous calibration for multiple downstream decision problems. Thus, decision calibration is a minimal, task-specific threshold for robust trustworthiness. Figure 2

Figure 2: Schematic of the sharp transition—once decision calibration is included, the robust policy collapses to best-response.

Beyond Decision Calibration: Generic Partial Calibration

The framework accommodates generic partial calibration conditions arising from standard training pipelines:

  • Self-orthogonality from squared-loss regression: For models with a linear last layer trained by squared loss, the first-order optimality conditions induce calibration with respect to linear test functions. The robust policy is efficiently computable via a penalized dual, and the inner minimization is tractable for finite action sets and linear utilities.
  • Bin-wise calibration: Post-hoc recalibration methods (e.g., histogram binning) enforce calibration within bins. The robust policy is piecewise constant: for each bin, best-respond to the mean forecast in that bin.

These results provide practical recipes for robust decision making even when only generic, non-task-specific calibration is available.

Empirical Evaluation

Experiments on the UCI Bike Sharing and California Housing datasets validate the theoretical predictions. A two-layer MLP regressor is trained with squared loss, ensuring self-orthogonality calibration. The robust policy and the plug-in best response are compared under i.i.d. and adversarially shifted test distributions (respecting the calibration constraints).

Key findings:

  • Under adversarial shifts tailored to the plug-in policy, the robust policy secures higher utility.
  • The cost of robustness under i.i.d. conditions is mild.
  • The robust policy dominates the plug-in policy under its own worst-case distribution, as predicted by the minimax theory.

Theoretical and Practical Implications

The results have several implications:

  • Theoretical: The sharp transition at decision calibration clarifies the hierarchy of calibration notions and their decision-theoretic consequences. The minimax framework provides a principled foundation for robust decision making under partial calibration.
  • Practical: Decision calibration is a tractable and minimal requirement for robust trustworthiness. When unattainable, generic calibration properties from standard training or post-hoc recalibration can still be leveraged for robust policies. The algorithms are efficient and compatible with standard ML pipelines.

Limitations and Future Directions

The analysis assumes risk-neutral decision makers (linear utility in outcomes) and finite action sets. Extending the framework to non-linear utilities or infinite action spaces is a natural direction, though some non-linearities can be handled via basis expansions. Further, the approach relies on the availability of calibration guarantees, which may be challenging to verify or enforce in some settings.

Conclusion

This work establishes a robust, minimax-optimal framework for decision making with partially calibrated forecasts. It demonstrates that decision calibration is a sufficient and minimal condition for best-response optimality, and provides efficient algorithms for robust policies under generic partial calibration. The results bridge the gap between theoretical calibration guarantees and practical, trustworthy decision making in high-dimensional and multiclass settings.

Whiteboard

Paper to Video (Beta)

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 72 likes about this paper.