Pareto Regret in Online Learning

Updated 5 July 2026

Pareto regret is a multi-objective performance measure that evaluates algorithm outcomes relative to a Pareto frontier, capturing trade-offs between conflicting criteria.
Techniques such as nonuniform exploration, reduction methods, and biased confidence bounds are employed to achieve Pareto-optimal guarantees in bandits, games, and structured settings.
Fundamental lower-bound results demonstrate that minimizing regret on one objective inherently inflates regret on others, highlighting unavoidable trade-offs in multi-criterion optimization.

Searching arXiv for papers on Pareto regret and closely related formulations. Pareto regret is a family of regret notions in online learning, bandits, and repeated games in which performance is evaluated against a Pareto frontier rather than against a single scalar benchmark. In the literature, the term covers several non-equivalent constructions: a vector of worst-case regrets across actions in finite-armed bandits, joint objectives such as cumulative regret and estimation error in structured bandits, distance-to-front measures in multi-objective bandits, hypervolume deficit in Pareto-front exploration, and asymptotic Pareto-domination relations between learning algorithms in repeated games (Lattimore, 2015, Zuo et al., 31 Jan 2025, Xu et al., 2022, Zhang, 2023, Arunachaleswaran et al., 2024). The unifying idea is that an algorithm is assessed by whether one can improve one performance coordinate without worsening another.

1. Formal meanings of Pareto regret

The common structure is a partial order over outcomes. In the classical finite-armed setting of Lattimore, one fixes a horizon $n$ , defines the pseudo-regret with respect to arm $i$ as

$R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$

and then takes the worst case

$R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$

The regret object is therefore the vector $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ , and a Pareto regret guarantee is a vector $B=(B_1,\dots,B_K)$ such that $R^\pi_i\le B_i$ for all $i$ (Lattimore, 2015).

In structured and multi-objective settings, Pareto regret is instead defined on pairs or vectors of objectives. For the Multinomial Logit Bandit, the paper “On Pareto Optimality for the Multinomial Logistic Bandit” defines regret

$R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$

together with estimation error, either

$E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$

$i$ 0

The Pareto frontier is the set of achievable $i$ 1-pairs such that no other policy has both strictly smaller regret and strictly smaller error (Zuo et al., 31 Jan 2025).

In multi-objective bandits, one frequently measures distance to the Pareto front. “Pareto Regret Analyses in Multi-objective Multi-armed Bandit” defines

$i$ 2

equivalently

$i$ 3

and then defines stochastic Pareto regret by

$i$ 4

The same paper introduces adversarial-style actual and pseudo-Pareto regrets through cumulative reward vectors and their Pareto fronts (Xu et al., 2022).

A further variant is hypervolume regret. “Optimal Scalarizations for Sublinear Hypervolume Regret” defines the dominated hypervolume indicator

$i$ 5

and the regret

$i$ 6

where $i$ 7 is the set of collected objective vectors (Zhang, 2023).

Setting	Performance object	Representative definition
Finite-armed bandit	Vector of worst-case regrets	$i$ 8
MNL-Bandit	Regret-error pair	$i$ 9
MO-MAB	Distance to Pareto front	$R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 0
Hypervolume optimization	Hypervolume deficit	$R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 1

2. Pareto frontiers over regret guarantees in finite-armed bandits

The finite-armed theory in “The Pareto Regret Frontier for Bandits” gives an exact characterization, up to constants, of which worst-case regret vectors are achievable in the stochastic case. The key set is

$R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 2

Its boundary is the trade-off surface. If one insists on a very small regret on one favored arm, then some other arm must incur large regret; specifically, if $R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 3, then $R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 4 (Lattimore, 2015).

The lower bound is information-theoretic. Under Gaussian noise, there is a universal constant $R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 5 such that for any policy $R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 6,

$R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 7

Equivalently,

$R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 8

The proof constructs $R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},$ 9 Gaussian instances in which each arm is made optimal by a small shift and then uses change-of-measure arguments based on Pinsker’s inequality and KL divergence (Lattimore, 2015).

The upper bound is given by Unbalanced MOSS. With $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 0 and a nonuniform index based on $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 1, the algorithm achieves $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 2 for all $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 3 whenever $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 4, where $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 5. Thus the Pareto regret frontier in the stochastic case is characterized exactly up to universal constants. A related adversarial construction, using a minor modification of Exp3- $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 6, yields $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 7 and for $R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 8,

$R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.$ 9

which preserves the same inverse trade-off up to an extra logarithmic factor (Lattimore, 2015).

A central implication is that asymmetry is costly. Uniform minimax guarantees of order $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 0 can hold simultaneously for all arms, but privileging one arm below that scale forces inverse-proportional worst-case regret on others. This is the canonical Pareto-regret phenomenon in single-objective bandits.

3. Regret–inference Pareto frontiers

In structured bandits, the Pareto frontier often relates immediate performance to statistical fidelity. For the MNL-Bandit, the learner chooses assortments $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 1 and is evaluated both by cumulative revenue regret and by the accuracy of parameter or revenue-difference estimation. The paper proves lower bounds showing that if

$R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 2

then necessarily

$R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 3

and vice versa. In particular, for $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 4, one cannot simultaneously make both $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 5 and $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 6. When $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 7, the result sharpens to

$R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 8

To attain the frontier, the paper proposes an epoch-based UCB algorithm with forced exploration probability $R^\pi=(R^\pi_1,\dots,R^\pi_K)$ 9. Ignoring polylogarithmic factors, it achieves

$B=(B_1,\dots,B_K)$ 0

so that $B=(B_1,\dots,B_K)$ 1. The parameter $B=(B_1,\dots,B_K)$ 2 acts as an exploration dial: small $B=(B_1,\dots,B_K)$ 3 yields heavier exploration and lower estimation error, while large $B=(B_1,\dots,B_K)$ 4 yields lighter exploration and lower regret (Zuo et al., 31 Jan 2025).

A related two-objective formulation appears in fixed-budget best-arm identification. “Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits” treats the pair $B=(B_1,\dots,B_K)$ 5, where $B=(B_1,\dots,B_K)$ 6 is pseudo-regret and $B=(B_1,\dots,B_K)$ 7 is BAI failure probability. The BoBW-lil’UCB $B=(B_1,\dots,B_K)$ 8 algorithm uses a single parameter $B=(B_1,\dots,B_K)$ 9 to scale an LIL-style confidence radius. For sufficiently large $R^\pi_i\le B_i$ 0, its regret satisfies

$R^\pi_i\le B_i$ 1

while for $R^\pi_i\le B_i$ 2 above an instance-dependent threshold,

$R^\pi_i\le B_i$ 3

The accompanying lower bound shows that no algorithm can simultaneously achieve optimal regret minimization and optimal best-arm identification; the trade-off is unavoidable (Zhong et al., 2021).

Adaptive combinatorial experimentation yields the same pattern under richer feedback structures. “Adaptive Combinatorial Experimental Design: Pareto Optimality for Decision-Making and Inference” studies regret together with estimation error for super-arm or base-arm gaps. It shows that, both for super-arm gaps and base-arm gaps,

$R^\pi_i\le B_i$ 4

The algorithms MixCombKL for full-bandit feedback and MixCombUCB for semi-bandit feedback both randomize between a best-arm strategy and uniform exploration at rate $R^\pi_i\le B_i$ 5, and both achieve finite-time Pareto-optimal guarantees. The paper also states that richer feedback shrinks the Pareto frontier, with the main gain arising from improved estimation accuracy (Xie et al., 27 Feb 2026).

Pareto-front identification in linear bandits adds a stopping-time perspective. “Learning the Pareto Front Using Bootstrapped Observation Samples” defines one-step Pareto regret through the domination amount

$R^\pi_i\le B_i$ 6

sets

$R^\pi_i\le B_i$ 7

and accumulates

$R^\pi_i\le B_i$ 8

Its algorithm reuses exploration samples multiple times and updates estimates along multiple context directions rather than only along the chosen context. The paper states that the resulting cumulative Pareto regret is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front (Kim et al., 2023).

4. Direct Pareto-front regret in multi-objective bandits

Multi-objective bandits replace a single scalar reward by a vector reward. In this setting, the main conceptual issue is whether regret should be defined through scalarization, through distance to a Pareto front, or through a stronger coverage requirement. “Pareto Regret Analyses in Multi-objective Multi-armed Bandit” takes the distance-based route. Besides the stochastic definition $R^\pi_i\le B_i$ 9, it defines the actual regret

$i$ 0

and the pseudo-regret

$i$ 1

A key structural result is that every Pareto regret is upper bounded by the marginal regret on any single coordinate: $i$ 2 This permits reduction to a one-dimensional MAB on a fixed coordinate. The resulting algorithms MO-KS and MO-US achieve $i$ 3 or $i$ 4 depending on whether the regime is stochastic or adversarial, and the paper also establishes matching lower bounds $i$ 5 and $i$ 6. It further gives an adversarial attack showing that Pareto-UCB can suffer linear regret under adaptive perturbations (Xu et al., 2022).

“Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm” argues that the standard Drugan–Nowé definition has two drawbacks: it measures only a uniform lift in all objectives, and it may declare a single-objective-optimized arm low-regret even when it underperforms badly on other objectives. The paper therefore introduces a stronger vector-wise notion of $i$ 7-regret relative to every Pareto-optimal arm, decomposed into Coverage-Regret and Cumulative Adjustment-Regret. It also defines Efficient Pareto-Optimal arms,

$i$ 8

equivalently the subset of Pareto arms lying on the convex hull of the Pareto front. Its two-phase algorithm explores each arm, discards empirically dominated arms, solves a set-cover problem over domination sets, and then repeatedly plays the selected cover set. The resulting vector-wise regret bound is

$i$ 9

$R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 0

with polynomial-time approximate set cover (Davoodi et al., 16 Jun 2025).

Hypervolume regret studies Pareto-front quality at the level of dominated volume rather than pointwise distance. For a positive weight vector $R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 1, the hypervolume scalarization is

$R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 2

Drawing i.i.d. $R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 3 uniformly on the positive orthant unit sphere and maximizing $R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 4 yields a simple randomized scalarization strategy. The main theorem gives

$R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 5

with high probability. The same work gives an ExploreUCB algorithm for multiobjective stochastic linear bandits with

$R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 6

and a sphere-packing lower-bound argument showing that any set of $R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 7 points misses hypervolume of order $R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 8 on a spherical Pareto frontier (Zhang, 2023).

These formulations are not interchangeable. Distance-to-front regret, vector-wise coverage regret, and hypervolume regret assess different properties of a learned set: local domination gap, simultaneous approximation of all Pareto-optimal arms, and global frontier coverage.

5. Pareto regret beyond standard bandits

Pareto-regret reasoning also appears when the competing objectives are computational efficiency, communication, or strategic robustness rather than multiple reward coordinates. In online portfolio selection and online learning of quantum states, the Pareto frontier is drawn with per-round time and memory on the $R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]$ 9-axis and worst-case regret on the $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 0-axis. “Pushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum States” places Cover’s Universal Portfolios, Soft-Bayes, Ada-BarrONS, and BISONS on this frontier. BISONS attains

$E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 1

with $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 2 time and memory per round, and the paper states that it strictly dominates Ada-BarrONS because $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 3 and $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 4. The same work gives Schrödinger’s BISONS for quantum states, with regret $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 5, and proves a negative result for plain log-barrier FTRL, which can suffer

$E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 6

regret (Zimmert et al., 2022).

Repeated games produce a different notion of Pareto domination, now between learning algorithms. In “Pareto-Optimal Algorithms for Learning in Games”, algorithm $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 7 Pareto-dominates $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 8 if for every optimizer payoff $E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|$ 9, $i$ 00, and for some $i$ 01, $i$ 02. The paper introduces the asymptotic menu $i$ 03, the convex closure of correlated strategy profiles asymptotically implementable by an adversary, and shows that all no-swap-regret algorithms share the same menu $i$ 04. It further states that well-known no-regret algorithms such as Multiplicative Weights and Follow The Regularized Leader are Pareto-dominated, whereas no-swap-regret is a sufficient condition for Pareto-optimality (Arunachaleswaran et al., 2024).

In welfare-oriented games, the regret target can be the unique welfare-maximizing feasible joint action. “Achieving Pareto Optimality in Games via Single-bit Feedback” defines

$i$ 05

for the unique maximizer $i$ 06. Its decentralized algorithm SBC-PE uses exactly one bit $i$ 07 per agent per round and achieves logarithmic expected regret: $i$ 08 The proof sketch uses Chernoff and Hoeffding-type bounds to control the exploration phase and identify the welfare-maximizing joint action with high probability (Kiremitci et al., 30 Sep 2025).

Multi-player bandits without communication yield yet another frontier. “The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication” studies

$i$ 09

where $i$ 10. The paper proves that no uniform $i$ 11 guarantee is achievable across all gap regimes. Instead, for breakpoints $i$ 12, every algorithm must satisfy

$i$ 13

for $i$ 14, while a decentralized collision-free algorithm attains the same rate up to polynomial factors in $i$ 15 and logarithms. The lower bound is based on topological obstructions at multiple scales (Liu et al., 2022).

6. Recurring principles, lower-bound methods, and interpretive points

Across these settings, Pareto regret formalizes a statement of non-simultaneous improvability. In finite-armed bandits, one cannot shrink worst-case regret on one arm without inflating it elsewhere (Lattimore, 2015). In MNL-Bandits, one cannot simultaneously drive regret and estimation error below the $i$ 16 balance point (Zuo et al., 31 Jan 2025). In fixed-budget best-arm identification, optimal regret minimization and exponentially small failure probability are incompatible within a single tuning of the algorithm (Zhong et al., 2021). In no-communication multi-player bandits, improvement at one gap scale necessarily worsens another (Liu et al., 2022).

The algorithmic mechanisms used to reach the frontier are correspondingly diverse but structurally similar. A recurrent device is explicit nonuniform exploration: forced complement exploration in MNL-Bandits, $i$ 17-mixing in adaptive combinatorial experimentation, biased confidence widths in Unbalanced MOSS, and $i$ 18-dependent confidence radii in BoBW-lil’UCB (Zuo et al., 31 Jan 2025, Xie et al., 27 Feb 2026, Lattimore, 2015, Zhong et al., 2021). Another recurrent device is reduction: MO-KS and MO-US reduce vector-valued Pareto regret to one-dimensional marginal regret on a fixed coordinate, while hypervolume scalarization turns frontier coverage into randomized scalar maximization (Xu et al., 2022, Zhang, 2023).

The lower-bound techniques are equally characteristic. The literature uses KL- and Pinsker-based change of measure in finite-armed bandits, Le Cam/Fano arguments in MNL-Bandits, Chernoff and Hoeffding concentration in single-bit coordination, sphere packing for hypervolume regret, and topological obstructions for decentralized multi-player bandits (Lattimore, 2015, Zuo et al., 31 Jan 2025, Kiremitci et al., 30 Sep 2025, Zhang, 2023, Liu et al., 2022). This suggests that Pareto-regret lower bounds often arise not from a single adversarial instance but from families of nearby instances whose distinctions are precisely the distinctions an algorithm must explore to improve one objective.

A persistent misconception is that Pareto regret is merely scalarized regret under a different name. Several papers explicitly distinguish the two. Distance-based and coverage-based Pareto regrets do not depend on any scalarization function, and the multi-objective bandit literature emphasizes that scalarized regret can fail on non-convex fronts (Xu et al., 2022, Davoodi et al., 16 Jun 2025). Conversely, hypervolume regret shows that carefully chosen nonlinear scalarizations can still provide a principled route to frontier exploration (Zhang, 2023). Another misconception is that no-regret alone is sufficient for strategic optimality. The repeated-games literature states that no-regret algorithms such as MWU and FTRL can be Pareto-dominated, while no-swap-regret is sufficient for asymptotic Pareto-optimality (Arunachaleswaran et al., 2024).

Taken together, these results make Pareto regret a unifying language for multi-criterion online learning. It does not denote one canonical scalar quantity. Rather, it denotes a class of frontier-based performance criteria that expose which trade-offs are information-theoretically unavoidable, which algorithms attain those trade-offs, and which apparently strong guarantees are dominated once additional criteria—statistical fidelity, coverage of the Pareto front, runtime, communication, or strategic robustness—are made explicit.