Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pareto Regret in Online Learning

Updated 5 July 2026
  • Pareto regret is a multi-objective performance measure that evaluates algorithm outcomes relative to a Pareto frontier, capturing trade-offs between conflicting criteria.
  • Techniques such as nonuniform exploration, reduction methods, and biased confidence bounds are employed to achieve Pareto-optimal guarantees in bandits, games, and structured settings.
  • Fundamental lower-bound results demonstrate that minimizing regret on one objective inherently inflates regret on others, highlighting unavoidable trade-offs in multi-criterion optimization.

Searching arXiv for papers on Pareto regret and closely related formulations. Pareto regret is a family of regret notions in online learning, bandits, and repeated games in which performance is evaluated against a Pareto frontier rather than against a single scalar benchmark. In the literature, the term covers several non-equivalent constructions: a vector of worst-case regrets across actions in finite-armed bandits, joint objectives such as cumulative regret and estimation error in structured bandits, distance-to-front measures in multi-objective bandits, hypervolume deficit in Pareto-front exploration, and asymptotic Pareto-domination relations between learning algorithms in repeated games (Lattimore, 2015, Zuo et al., 31 Jan 2025, Xu et al., 2022, Zhang, 2023, Arunachaleswaran et al., 2024). The unifying idea is that an algorithm is assessed by whether one can improve one performance coordinate without worsening another.

1. Formal meanings of Pareto regret

The common structure is a partial order over outcomes. In the classical finite-armed setting of Lattimore, one fixes a horizon nn, defines the pseudo-regret with respect to arm ii as

Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},

and then takes the worst case

Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.

The regret object is therefore the vector Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K), and a Pareto regret guarantee is a vector B=(B1,,BK)B=(B_1,\dots,B_K) such that RiπBiR^\pi_i\le B_i for all ii (Lattimore, 2015).

In structured and multi-objective settings, Pareto regret is instead defined on pairs or vectors of objectives. For the Multinomial Logit Bandit, the paper “On Pareto Optimality for the Multinomial Logistic Bandit” defines regret

R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]

together with estimation error, either

Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|

or

ii0

The Pareto frontier is the set of achievable ii1-pairs such that no other policy has both strictly smaller regret and strictly smaller error (Zuo et al., 31 Jan 2025).

In multi-objective bandits, one frequently measures distance to the Pareto front. “Pareto Regret Analyses in Multi-objective Multi-armed Bandit” defines

ii2

equivalently

ii3

and then defines stochastic Pareto regret by

ii4

The same paper introduces adversarial-style actual and pseudo-Pareto regrets through cumulative reward vectors and their Pareto fronts (Xu et al., 2022).

A further variant is hypervolume regret. “Optimal Scalarizations for Sublinear Hypervolume Regret” defines the dominated hypervolume indicator

ii5

and the regret

ii6

where ii7 is the set of collected objective vectors (Zhang, 2023).

Setting Performance object Representative definition
Finite-armed bandit Vector of worst-case regrets ii8
MNL-Bandit Regret-error pair ii9
MO-MAB Distance to Pareto front Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},0
Hypervolume optimization Hypervolume deficit Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},1

2. Pareto frontiers over regret guarantees in finite-armed bandits

The finite-armed theory in “The Pareto Regret Frontier for Bandits” gives an exact characterization, up to constants, of which worst-case regret vectors are achievable in the stochastic case. The key set is

Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},2

Its boundary is the trade-off surface. If one insists on a very small regret on one favored arm, then some other arm must incur large regret; specifically, if Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},3, then Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},4 (Lattimore, 2015).

The lower bound is information-theoretic. Under Gaussian noise, there is a universal constant Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},5 such that for any policy Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},6,

Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},7

Equivalently,

Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},8

The proof constructs Rμ,iπ=nμit=1nμIt,R^\pi_{\mu,i}=n\mu_i-\sum_{t=1}^n \mu_{I_t},9 Gaussian instances in which each arm is made optimal by a small shift and then uses change-of-measure arguments based on Pinsker’s inequality and KL divergence (Lattimore, 2015).

The upper bound is given by Unbalanced MOSS. With Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.0 and a nonuniform index based on Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.1, the algorithm achieves Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.2 for all Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.3 whenever Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.4, where Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.5. Thus the Pareto regret frontier in the stochastic case is characterized exactly up to universal constants. A related adversarial construction, using a minor modification of Exp3-Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.6, yields Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.7 and for Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.8,

Riπ=supμ[0,1]KRμ,iπ.R^\pi_i=\sup_{\mu\in[0,1]^K}R^\pi_{\mu,i}.9

which preserves the same inverse trade-off up to an extra logarithmic factor (Lattimore, 2015).

A central implication is that asymmetry is costly. Uniform minimax guarantees of order Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)0 can hold simultaneously for all arms, but privileging one arm below that scale forces inverse-proportional worst-case regret on others. This is the canonical Pareto-regret phenomenon in single-objective bandits.

3. Regret–inference Pareto frontiers

In structured bandits, the Pareto frontier often relates immediate performance to statistical fidelity. For the MNL-Bandit, the learner chooses assortments Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)1 and is evaluated both by cumulative revenue regret and by the accuracy of parameter or revenue-difference estimation. The paper proves lower bounds showing that if

Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)2

then necessarily

Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)3

and vice versa. In particular, for Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)4, one cannot simultaneously make both Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)5 and Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)6. When Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)7, the result sharpens to

Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)8

To attain the frontier, the paper proposes an epoch-based UCB algorithm with forced exploration probability Rπ=(R1π,,RKπ)R^\pi=(R^\pi_1,\dots,R^\pi_K)9. Ignoring polylogarithmic factors, it achieves

B=(B1,,BK)B=(B_1,\dots,B_K)0

so that B=(B1,,BK)B=(B_1,\dots,B_K)1. The parameter B=(B1,,BK)B=(B_1,\dots,B_K)2 acts as an exploration dial: small B=(B1,,BK)B=(B_1,\dots,B_K)3 yields heavier exploration and lower estimation error, while large B=(B1,,BK)B=(B_1,\dots,B_K)4 yields lighter exploration and lower regret (Zuo et al., 31 Jan 2025).

A related two-objective formulation appears in fixed-budget best-arm identification. “Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits” treats the pair B=(B1,,BK)B=(B_1,\dots,B_K)5, where B=(B1,,BK)B=(B_1,\dots,B_K)6 is pseudo-regret and B=(B1,,BK)B=(B_1,\dots,B_K)7 is BAI failure probability. The BoBW-lil’UCBB=(B1,,BK)B=(B_1,\dots,B_K)8 algorithm uses a single parameter B=(B1,,BK)B=(B_1,\dots,B_K)9 to scale an LIL-style confidence radius. For sufficiently large RiπBiR^\pi_i\le B_i0, its regret satisfies

RiπBiR^\pi_i\le B_i1

while for RiπBiR^\pi_i\le B_i2 above an instance-dependent threshold,

RiπBiR^\pi_i\le B_i3

The accompanying lower bound shows that no algorithm can simultaneously achieve optimal regret minimization and optimal best-arm identification; the trade-off is unavoidable (Zhong et al., 2021).

Adaptive combinatorial experimentation yields the same pattern under richer feedback structures. “Adaptive Combinatorial Experimental Design: Pareto Optimality for Decision-Making and Inference” studies regret together with estimation error for super-arm or base-arm gaps. It shows that, both for super-arm gaps and base-arm gaps,

RiπBiR^\pi_i\le B_i4

The algorithms MixCombKL for full-bandit feedback and MixCombUCB for semi-bandit feedback both randomize between a best-arm strategy and uniform exploration at rate RiπBiR^\pi_i\le B_i5, and both achieve finite-time Pareto-optimal guarantees. The paper also states that richer feedback shrinks the Pareto frontier, with the main gain arising from improved estimation accuracy (Xie et al., 27 Feb 2026).

Pareto-front identification in linear bandits adds a stopping-time perspective. “Learning the Pareto Front Using Bootstrapped Observation Samples” defines one-step Pareto regret through the domination amount

RiπBiR^\pi_i\le B_i6

sets

RiπBiR^\pi_i\le B_i7

and accumulates

RiπBiR^\pi_i\le B_i8

Its algorithm reuses exploration samples multiple times and updates estimates along multiple context directions rather than only along the chosen context. The paper states that the resulting cumulative Pareto regret is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front (Kim et al., 2023).

4. Direct Pareto-front regret in multi-objective bandits

Multi-objective bandits replace a single scalar reward by a vector reward. In this setting, the main conceptual issue is whether regret should be defined through scalarization, through distance to a Pareto front, or through a stronger coverage requirement. “Pareto Regret Analyses in Multi-objective Multi-armed Bandit” takes the distance-based route. Besides the stochastic definition RiπBiR^\pi_i\le B_i9, it defines the actual regret

ii0

and the pseudo-regret

ii1

A key structural result is that every Pareto regret is upper bounded by the marginal regret on any single coordinate: ii2 This permits reduction to a one-dimensional MAB on a fixed coordinate. The resulting algorithms MO-KS and MO-US achieve ii3 or ii4 depending on whether the regime is stochastic or adversarial, and the paper also establishes matching lower bounds ii5 and ii6. It further gives an adversarial attack showing that Pareto-UCB can suffer linear regret under adaptive perturbations (Xu et al., 2022).

“Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm” argues that the standard Drugan–Nowé definition has two drawbacks: it measures only a uniform lift in all objectives, and it may declare a single-objective-optimized arm low-regret even when it underperforms badly on other objectives. The paper therefore introduces a stronger vector-wise notion of ii7-regret relative to every Pareto-optimal arm, decomposed into Coverage-Regret and Cumulative Adjustment-Regret. It also defines Efficient Pareto-Optimal arms,

ii8

equivalently the subset of Pareto arms lying on the convex hull of the Pareto front. Its two-phase algorithm explores each arm, discards empirically dominated arms, solves a set-cover problem over domination sets, and then repeatedly plays the selected cover set. The resulting vector-wise regret bound is

ii9

or

R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]0

with polynomial-time approximate set cover (Davoodi et al., 16 Jun 2025).

Hypervolume regret studies Pareto-front quality at the level of dominated volume rather than pointwise distance. For a positive weight vector R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]1, the hypervolume scalarization is

R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]2

Drawing i.i.d. R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]3 uniformly on the positive orthant unit sphere and maximizing R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]4 yields a simple randomized scalarization strategy. The main theorem gives

R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]5

with high probability. The same work gives an ExploreUCB algorithm for multiobjective stochastic linear bandits with

R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]6

and a sphere-packing lower-bound argument showing that any set of R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]7 points misses hypervolume of order R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]8 on a spherical Pareto frontier (Zhang, 2023).

These formulations are not interchangeable. Distance-to-front regret, vector-wise coverage regret, and hypervolume regret assess different properties of a learned set: local domination gap, simultaneous approximation of all Pareto-optimal arms, and global frontier coverage.

5. Pareto regret beyond standard bandits

Pareto-regret reasoning also appears when the competing objectives are computational efficiency, communication, or strategic robustness rather than multiple reward coordinates. In online portfolio selection and online learning of quantum states, the Pareto frontier is drawn with per-round time and memory on the R(T)=t=1T[R(S,v)R(St,v)]R(T)=\sum_{t=1}^T [R(S^*,v)-R(S_t,v)]9-axis and worst-case regret on the Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|0-axis. “Pushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum States” places Cover’s Universal Portfolios, Soft-Bayes, Ada-BarrONS, and BISONS on this frontier. BISONS attains

Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|1

with Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|2 time and memory per round, and the paper states that it strictly dominates Ada-BarrONS because Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|3 and Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|4. The same work gives Schrödinger’s BISONS for quantum states, with regret Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|5, and proves a negative result for plain log-barrier FTRL, which can suffer

Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|6

regret (Zimmert et al., 2022).

Repeated games produce a different notion of Pareto domination, now between learning algorithms. In “Pareto-Optimal Algorithms for Learning in Games”, algorithm Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|7 Pareto-dominates Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|8 if for every optimizer payoff Ev(T)=maxi<jE(vivj)(v^iv^j)E_v(T)=\max_{i<j}\mathbb E |(v_i-v_j)-(\hat v_i-\hat v_j)|9, ii00, and for some ii01, ii02. The paper introduces the asymptotic menu ii03, the convex closure of correlated strategy profiles asymptotically implementable by an adversary, and shows that all no-swap-regret algorithms share the same menu ii04. It further states that well-known no-regret algorithms such as Multiplicative Weights and Follow The Regularized Leader are Pareto-dominated, whereas no-swap-regret is a sufficient condition for Pareto-optimality (Arunachaleswaran et al., 2024).

In welfare-oriented games, the regret target can be the unique welfare-maximizing feasible joint action. “Achieving Pareto Optimality in Games via Single-bit Feedback” defines

ii05

for the unique maximizer ii06. Its decentralized algorithm SBC-PE uses exactly one bit ii07 per agent per round and achieves logarithmic expected regret: ii08 The proof sketch uses Chernoff and Hoeffding-type bounds to control the exploration phase and identify the welfare-maximizing joint action with high probability (Kiremitci et al., 30 Sep 2025).

Multi-player bandits without communication yield yet another frontier. “The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication” studies

ii09

where ii10. The paper proves that no uniform ii11 guarantee is achievable across all gap regimes. Instead, for breakpoints ii12, every algorithm must satisfy

ii13

for ii14, while a decentralized collision-free algorithm attains the same rate up to polynomial factors in ii15 and logarithms. The lower bound is based on topological obstructions at multiple scales (Liu et al., 2022).

6. Recurring principles, lower-bound methods, and interpretive points

Across these settings, Pareto regret formalizes a statement of non-simultaneous improvability. In finite-armed bandits, one cannot shrink worst-case regret on one arm without inflating it elsewhere (Lattimore, 2015). In MNL-Bandits, one cannot simultaneously drive regret and estimation error below the ii16 balance point (Zuo et al., 31 Jan 2025). In fixed-budget best-arm identification, optimal regret minimization and exponentially small failure probability are incompatible within a single tuning of the algorithm (Zhong et al., 2021). In no-communication multi-player bandits, improvement at one gap scale necessarily worsens another (Liu et al., 2022).

The algorithmic mechanisms used to reach the frontier are correspondingly diverse but structurally similar. A recurrent device is explicit nonuniform exploration: forced complement exploration in MNL-Bandits, ii17-mixing in adaptive combinatorial experimentation, biased confidence widths in Unbalanced MOSS, and ii18-dependent confidence radii in BoBW-lil’UCB (Zuo et al., 31 Jan 2025, Xie et al., 27 Feb 2026, Lattimore, 2015, Zhong et al., 2021). Another recurrent device is reduction: MO-KS and MO-US reduce vector-valued Pareto regret to one-dimensional marginal regret on a fixed coordinate, while hypervolume scalarization turns frontier coverage into randomized scalar maximization (Xu et al., 2022, Zhang, 2023).

The lower-bound techniques are equally characteristic. The literature uses KL- and Pinsker-based change of measure in finite-armed bandits, Le Cam/Fano arguments in MNL-Bandits, Chernoff and Hoeffding concentration in single-bit coordination, sphere packing for hypervolume regret, and topological obstructions for decentralized multi-player bandits (Lattimore, 2015, Zuo et al., 31 Jan 2025, Kiremitci et al., 30 Sep 2025, Zhang, 2023, Liu et al., 2022). This suggests that Pareto-regret lower bounds often arise not from a single adversarial instance but from families of nearby instances whose distinctions are precisely the distinctions an algorithm must explore to improve one objective.

A persistent misconception is that Pareto regret is merely scalarized regret under a different name. Several papers explicitly distinguish the two. Distance-based and coverage-based Pareto regrets do not depend on any scalarization function, and the multi-objective bandit literature emphasizes that scalarized regret can fail on non-convex fronts (Xu et al., 2022, Davoodi et al., 16 Jun 2025). Conversely, hypervolume regret shows that carefully chosen nonlinear scalarizations can still provide a principled route to frontier exploration (Zhang, 2023). Another misconception is that no-regret alone is sufficient for strategic optimality. The repeated-games literature states that no-regret algorithms such as MWU and FTRL can be Pareto-dominated, while no-swap-regret is sufficient for asymptotic Pareto-optimality (Arunachaleswaran et al., 2024).

Taken together, these results make Pareto regret a unifying language for multi-criterion online learning. It does not denote one canonical scalar quantity. Rather, it denotes a class of frontier-based performance criteria that expose which trade-offs are information-theoretically unavoidable, which algorithms attain those trade-offs, and which apparently strong guarantees are dominated once additional criteria—statistical fidelity, coverage of the Pareto front, runtime, communication, or strategic robustness—are made explicit.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pareto Regret.