Multi-Objective Linear Contextual Bandits
- Multi-objective linear contextual bandits are frameworks that optimize several conflicting linear objectives using contextual information in a sequential decision process.
- Algorithmic approaches like MOGLB-UCB and MOL-TS utilize upper confidence bounds and Thompson sampling to estimate parameters and construct approximate Pareto fronts.
- Theoretical guarantees and empirical results demonstrate near-optimal Pareto regret bounds and effective trade-off management in applications such as personalized recommendations and resource allocation.
The multi-objective linear contextual bandit problem extends the classical stochastic contextual bandit framework by requiring the simultaneous optimization of multiple, possibly conflicting, linear objectives based on contextual information. At each round, the learner selects an arm associated with a context vector and receives a vector-valued reward. The goal is to minimize Pareto regret, a metric quantifying proximity to the Pareto-optimal set of actions, rather than maximizing a single scalarized reward. This paradigm is central in applications such as personalized recommendations, resource allocation, and other multi-criteria decision processes, especially where explicit trade-offs among objectives must be managed.
1. Formal Problem Statement and Pareto Regret
At time , the learner observes a finite or infinite arm set , where each arm is associated with a context vector . Upon selecting arm , the learner receives a stochastic reward vector , such that for each objective : $r_{t,i} = x_{a_t,t}^\top \theta_i + \eta_{t,i},\qquad \eta_{t,i}\ \text{is zero-mean, %%%%7%%%%-subgaussian}$ with unknown parameters , .
The expected reward vector for arm at is . Pareto dominance is defined as: dominates (denoted ) iff for all and for some . The Pareto front consists of arms not dominated by any other: Pareto regret, the key performance metric, is given by
where and . This represents the minimal uniform increment needed to move to the Pareto front (Park et al., 30 Nov 2025, Lu et al., 2019).
2. Algorithmic Approaches
Several distinct methodologies have been proposed, notably UCB-based and Thompson sampling–based algorithms, to address the exploration-exploitation tradeoff in this multi-objective setting.
Upper Confidence Bound Approaches (MOGLB-UCB)
For the case where the reward follows a (possibly generalized) linear model, the MOGLB-UCB algorithm maintains, for each objective, an online Newton-type parameter estimate and a confidence ellipsoid. At each round, for each arm and objective, an upper confidence bound (UCB) is constructed: where is the online estimate, the regularization matrix, and a parameter scaling with dimension and time. An approximate Pareto front $\widehat\mathcal{O}_t$ is constructed via non-dominance in UCB space, and the algorithm selects an arm uniformly at random from this set. Updates ensue based on observed rewards (Lu et al., 2019).
Thompson Sampling Approaches (MOL-TS)
The MOL-TS algorithm independently samples parameter vectors from the posterior for each objective. For each arm, the induced sampled reward vector defines a “sampled” Pareto front . The algorithm selects from , observes rewards, and updates Bayesian parameter posteriors. This approach achieves a worst-case Pareto regret bound of , closely paralleling the single-objective randomized linear bandit rate (Park et al., 30 Nov 2025).
3. Theoretical Guarantees and Minimax Bounds
State-of-the-art regret bounds are summarized as follows:
| Algorithm | Regret Bound | Assumptions | Reference |
|---|---|---|---|
| MOGLB-UCB | Generalized linear (incl. linear) | (Lu et al., 2019) | |
| MOL-TS | Linear reward, subgaussian noise | (Park et al., 30 Nov 2025) |
For scalarization-based reductions, performance is typically suboptimal in terms of multi-objective regret, particularly in covering the true Pareto front.
Lower bounds indicate that for linear contextual bandits with objectives, the minimax rate matches the single-objective case up to factors depending on the parameter space geometry and the number of objectives (Lu et al., 2019, Park et al., 30 Nov 2025). In constrained variants (e.g., linear costs), the regret scales as , where is a constraint threshold and is the known safe cost (Pacchiano et al., 2020, Pacchiano et al., 15 Jan 2024).
4. Extensions: Constraints, Knapsack Structures, and Beyond
Incorporating explicit constraints transforms the problem into a multi-objective control scenario. There are two main constraint models:
- Stage-wise Linear Constraints: At each round, the selected arm must satisfy cost constraints either with high probability or in expectation. UCB-based “optimistic–pessimistic” methods are used, employing distinct scaling factors for reward and cost confidence sets (Pacchiano et al., 15 Jan 2024).
- Global Knapsack Constraints: The total accumulated cost across rounds must not exceed a budget. Algorithms reduce the multi-objective problem to an instance of contextual bandits with known Lipschitz concave (or linear) objective, employing importance-weighted estimates and epochs-based policy optimization using oracles over policy classes (Agrawal et al., 2015).
Regret analysis incorporates cost slackness, with minimax lower bounds quantifying the "price of safety" in constraint satisfaction (Pacchiano et al., 2020, Pacchiano et al., 15 Jan 2024). Specific algorithms for knapsack constraints achieve computational efficiency by leveraging coordinate-descent solvers with optimization oracles, maintaining near-optimal rates.
5. Empirical Performance and Practical Considerations
Empirical results are reported across synthetic and real-world datasets:
- MOGLB-UCB outperforms P-UCB, S-UCB (scalarized), and P-TS baselines in both cumulative Pareto regret and Jaccard similarity to the Pareto front (Lu et al., 2019).
- MOL-TS achieves lower Pareto and hypervolume regret than scalarization-based TS and UCB variants, and empirically converges to the Pareto front more rapidly (Park et al., 30 Nov 2025).
- Symmetry of random selection from the (approximate) Pareto front ensures fairness among optimal arms, an important property in multi-objective fairness-sensitive contexts (Lu et al., 2019).
- In constraint scenarios, empirical cost is always maintained below threshold, while regret grows as cost slack reduces (Pacchiano et al., 2020, Pacchiano et al., 15 Jan 2024).
6. Open Questions and Research Directions
Prospective research areas include:
- Extension to generalized linear and nonlinear reward models beyond the linear parametrization (Park et al., 30 Nov 2025, Lu et al., 2019).
- Incorporation of dominant or prioritized objectives, and development of instance-dependent (gap-dependent) Pareto regret lower bounds.
- Efficient arm selection and regret minimization in high-dimensional, combinatorial, or neural-represented action spaces.
- Batched or parallel multi-objective selection under multi-constraint structures.
- Algorithmic improvements for cumulative or non-linear cost and resource constraints, including adaptive confidence scaling and efficient support-lemma based exploitations (Pacchiano et al., 15 Jan 2024, Agrawal et al., 2015).
Recent advances provide a cohesive foundation for multi-objective linear contextual bandits with provably efficient, scalable, and fair learning algorithms, but scaling, expressivity, and nuanced trade-offs among objectives remain active and challenging research areas.