Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Objective Linear Contextual Bandits

Updated 3 December 2025
  • Multi-objective linear contextual bandits are frameworks that optimize several conflicting linear objectives using contextual information in a sequential decision process.
  • Algorithmic approaches like MOGLB-UCB and MOL-TS utilize upper confidence bounds and Thompson sampling to estimate parameters and construct approximate Pareto fronts.
  • Theoretical guarantees and empirical results demonstrate near-optimal Pareto regret bounds and effective trade-off management in applications such as personalized recommendations and resource allocation.

The multi-objective linear contextual bandit problem extends the classical stochastic contextual bandit framework by requiring the simultaneous optimization of multiple, possibly conflicting, linear objectives based on contextual information. At each round, the learner selects an arm associated with a context vector and receives a vector-valued reward. The goal is to minimize Pareto regret, a metric quantifying proximity to the Pareto-optimal set of actions, rather than maximizing a single scalarized reward. This paradigm is central in applications such as personalized recommendations, resource allocation, and other multi-criteria decision processes, especially where explicit trade-offs among objectives must be managed.

1. Formal Problem Statement and Pareto Regret

At time tt, the learner observes a finite or infinite arm set AtRd\mathcal{A}_t\subset\mathbb{R}^d, where each arm aa is associated with a context vector xa,tRdx_{a,t}\in\mathbb{R}^d. Upon selecting arm ata_t, the learner receives a stochastic reward vector rt=(rt,1,,rt,m)Rmr_t=(r_{t,1},\ldots,r_{t,m})\in\mathbb{R}^m, such that for each objective ii: $r_{t,i} = x_{a_t,t}^\top \theta_i + \eta_{t,i},\qquad \eta_{t,i}\ \text{is zero-mean, %%%%7%%%%-subgaussian}$ with unknown parameters θiRd\theta_i\in\mathbb{R}^d, i=1,,mi=1,\dots,m.

The expected reward vector for arm aa at tt is μt(a)=(xa,tθ1,,xa,tθm)\mu_t(a) = (x_{a,t}^\top\theta_1,\ldots, x_{a,t}^\top\theta_m). Pareto dominance is defined as: bb dominates aa (denoted bab\succ a) iff μt,i(b)μt,i(a)\mu_{t,i}(b)\ge\mu_{t,i}(a) for all ii and μt,j(b)>μt,j(a)\mu_{t,j}(b)>\mu_{t,j}(a) for some jj. The Pareto front Pt\mathcal{P}_t consists of arms not dominated by any other: Pt={aAt:bAt with ba}\mathcal{P}_t = \{a\in\mathcal{A}_t : \nexists b\in\mathcal{A}_t \text{ with } b\succ a\} Pareto regret, the key performance metric, is given by

RegretT=t=1TΔt(at)\mathrm{Regret}_T = \sum_{t=1}^T \Delta_t(a_t)

where Δt(at)=minaPt maxi=1,,m [μt,i(a)μt,i(at)]+\Delta_t(a_t) = \min_{a^*\in\mathcal{P}_t}\ \max_{i=1,\ldots,m}\ [\mu_{t,i}(a^*)-\mu_{t,i}(a_t)]_+ and [x]+=max{0,x}[x]_+ = \max\{0, x\}. This represents the minimal uniform increment needed to move ata_t to the Pareto front (Park et al., 30 Nov 2025, Lu et al., 2019).

2. Algorithmic Approaches

Several distinct methodologies have been proposed, notably UCB-based and Thompson sampling–based algorithms, to address the exploration-exploitation tradeoff in this multi-objective setting.

Upper Confidence Bound Approaches (MOGLB-UCB)

For the case where the reward follows a (possibly generalized) linear model, the MOGLB-UCB algorithm maintains, for each objective, an online Newton-type parameter estimate and a confidence ellipsoid. At each round, for each arm and objective, an upper confidence bound (UCB) is constructed: UCBt,i(x)=θ^t,ix+γtxZt12UCB_{t,i}(x) = \hat\theta_{t,i}^\top x + \sqrt{\gamma_t \|x\|_{Z_t^{-1}}^2} where θ^t,i\hat\theta_{t,i} is the online estimate, ZtZ_t the regularization matrix, and γt\gamma_t a parameter scaling with dimension and time. An approximate Pareto front $\widehat\mathcal{O}_t$ is constructed via non-dominance in UCB space, and the algorithm selects an arm uniformly at random from this set. Updates ensue based on observed rewards (Lu et al., 2019).

Thompson Sampling Approaches (MOL-TS)

The MOL-TS algorithm independently samples parameter vectors θ~i,t\widetilde\theta_{i,t} from the posterior for each objective. For each arm, the induced sampled reward vector r~t(a)\widetilde{r}_t(a) defines a “sampled” Pareto front P~t\widetilde{\mathcal{P}}_t. The algorithm selects from P~t\widetilde{\mathcal{P}}_t, observes rewards, and updates Bayesian parameter posteriors. This approach achieves a worst-case Pareto regret bound of O~(d3/2T)\widetilde{O}(d^{3/2}\sqrt{T}), closely paralleling the single-objective randomized linear bandit rate (Park et al., 30 Nov 2025).

3. Theoretical Guarantees and Minimax Bounds

State-of-the-art regret bounds are summarized as follows:

Algorithm Regret Bound Assumptions Reference
MOGLB-UCB O~(dT)\widetilde{O}(d\sqrt{T}) Generalized linear (incl. linear) (Lu et al., 2019)
MOL-TS O~(d3/2T)\widetilde{O}(d^{3/2}\sqrt{T}) Linear reward, subgaussian noise (Park et al., 30 Nov 2025)

For scalarization-based reductions, performance is typically suboptimal in terms of multi-objective regret, particularly in covering the true Pareto front.

Lower bounds indicate that for linear contextual bandits with mm objectives, the minimax rate matches the single-objective case up to factors depending on the parameter space geometry and the number of objectives (Lu et al., 2019, Park et al., 30 Nov 2025). In constrained variants (e.g., linear costs), the regret scales as O~(dTτc0)\widetilde{O}\bigl(\frac{d\sqrt{T}}{\tau - c_0}\bigr), where τ\tau is a constraint threshold and c0c_0 is the known safe cost (Pacchiano et al., 2020, Pacchiano et al., 15 Jan 2024).

4. Extensions: Constraints, Knapsack Structures, and Beyond

Incorporating explicit constraints transforms the problem into a multi-objective control scenario. There are two main constraint models:

  • Stage-wise Linear Constraints: At each round, the selected arm must satisfy cost constraints either with high probability or in expectation. UCB-based “optimistic–pessimistic” methods are used, employing distinct scaling factors for reward and cost confidence sets (Pacchiano et al., 15 Jan 2024).
  • Global Knapsack Constraints: The total accumulated cost across TT rounds must not exceed a budget. Algorithms reduce the multi-objective problem to an instance of contextual bandits with known Lipschitz concave (or linear) objective, employing importance-weighted estimates and epochs-based policy optimization using oracles over policy classes (Agrawal et al., 2015).

Regret analysis incorporates cost slackness, with minimax lower bounds quantifying the "price of safety" in constraint satisfaction (Pacchiano et al., 2020, Pacchiano et al., 15 Jan 2024). Specific algorithms for knapsack constraints achieve computational efficiency by leveraging coordinate-descent solvers with optimization oracles, maintaining near-optimal rates.

5. Empirical Performance and Practical Considerations

Empirical results are reported across synthetic and real-world datasets:

  • MOGLB-UCB outperforms P-UCB, S-UCB (scalarized), and P-TS baselines in both cumulative Pareto regret and Jaccard similarity to the Pareto front (Lu et al., 2019).
  • MOL-TS achieves lower Pareto and hypervolume regret than scalarization-based TS and UCB variants, and empirically converges to the Pareto front more rapidly (Park et al., 30 Nov 2025).
  • Symmetry of random selection from the (approximate) Pareto front ensures fairness among optimal arms, an important property in multi-objective fairness-sensitive contexts (Lu et al., 2019).
  • In constraint scenarios, empirical cost is always maintained below threshold, while regret grows as cost slack reduces (Pacchiano et al., 2020, Pacchiano et al., 15 Jan 2024).

6. Open Questions and Research Directions

Prospective research areas include:

  • Extension to generalized linear and nonlinear reward models beyond the linear parametrization (Park et al., 30 Nov 2025, Lu et al., 2019).
  • Incorporation of dominant or prioritized objectives, and development of instance-dependent (gap-dependent) Pareto regret lower bounds.
  • Efficient arm selection and regret minimization in high-dimensional, combinatorial, or neural-represented action spaces.
  • Batched or parallel multi-objective selection under multi-constraint structures.
  • Algorithmic improvements for cumulative or non-linear cost and resource constraints, including adaptive confidence scaling and efficient support-lemma based exploitations (Pacchiano et al., 15 Jan 2024, Agrawal et al., 2015).

Recent advances provide a cohesive foundation for multi-objective linear contextual bandits with provably efficient, scalable, and fair learning algorithms, but scaling, expressivity, and nuanced trade-offs among objectives remain active and challenging research areas.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Objective Linear Contextual Bandit Problem.