BanditLP: LP-Enhanced Bandit Methods

Updated 23 January 2026

BanditLP is a family of methods that combine bandit learning with linear programming to manage complex, multi-level constraints in online decision environments.
It leverages techniques such as neural Thompson sampling and dual decomposition to achieve sublinear regret and strict, per-round constraint satisfaction.
Key applications include web-scale recommendation, restless bandits, extensive-form games, and fidelity reward models, offering scalable solutions for diverse operational challenges.

BanditLP denotes a family of methodologies and models that combine multi-armed or contextual bandit learning with linear programming (LP) formulations, typically to address structural, combinatorial, or large-scale constraint requirements in online decision systems. The term is applied to several distinct but conceptually related strands of work in the bandit literature, spanning large-scale recommendation under multi-stakeholder constraints, fidelity rewards in loyalty points bandits, and linear-program-based policies for restless and combinatorial bandits.

1. Multi-Stakeholder Contextual BanditLP for Large-Scale Constrained Recommendation

BanditLP (Nguyen et al., 22 Jan 2026) formalizes the web-scale, multi-stakeholder recommendation problem as an online contextual bandit with per-round, hard combinatorial constraints. It enables simultaneous optimization for several objectives—user satisfaction, provider fairness, and platform budget—by integrating neural Thompson sampling and LP-based action selection:

Problem Setup: Let $U$ be users, $I$ items, $L$ providers; each round $t$ , for user–item pairs $(u, i)$ , contexts $z_{u,i,t}$ are observed, with an unknown reward $r_{u,i,t}$ and $K$ auxiliary costs $c_{u,i,t}^{(k)}$ .
Constraints: Platform-level (global budgets), provider-level (per-provider quotas/fairness), and user-level (caps) are imposed via linear inequalities over the action variables $x_{u,i} \in [0,1]$ (probabilities of recommending item $i$ to user $u$ at $t$ ).

Workflow and Key Components

Learning (Neural Thompson Sampling): Bayesian neural nets with Laplace approximations generate samples $\tilde r_{u,i}, \tilde c_{u,i}^{(k)}$ for all outcomes and costs.
Action Selection (LP Optimization): At serving time, a massive-scale LP is solved to maximize the sampled cumulative reward subject to all constraints.
Solver (DuaLip): A dual decomposition and partial Lagrangian approach transforms the primal LP into a strongly convex QP. The dual is maximized via efficient first-order and projection methods, exploiting separability and strong duality for scalability to $10^9+$ variables.
Regret & Feasibility: The approach inherits sublinear Bayesian regret bounds for the stochastic contextual bandit, and guarantees per-round constraint satisfaction (zero violation) due to strict enforcement at each epoch.

Empirical Outcomes: In both synthetic benchmarks and production deployments (e.g., LinkedIn email marketing), BanditLP achieves significant cumulative reward gains, tight constraint adherence (<0.5% violation on public datasets), and demonstrates exploration benefits for long-term metrics even in highly constrained production settings (Nguyen et al., 22 Jan 2026).

2. LP-Based Policies for Restless Bandits and Mean-Field Regimes

LP-based decision rules for restless Markovian bandits (Gast et al., 2021) use a linear programming relaxation to approximate the value of the finite- $N$ control problem (with $N$ bandit arms and per-round activation constraints) as $N \to \infty$ :

Finite-Horizon LP Relaxation: For $d$ states, time horizon $T$ , and fraction $\alpha$ of activations per round, the relaxation optimizes expected total reward over $\{y_{s,a}(t)\}$ , subject to mean-field dynamics and budget constraints.
Asymptotic Optimality Hierarchy:
- Asymptotic optimality: per-arm value gap between LP relaxation and actual policy vanishes as $N \to \infty$ .
- $O(1/\sqrt N)$ -rate and exponential-rate optimality are formalized via uniform constants in the convergence rates.
- Necessary and sufficient conditions (LP-compatibility, Lipschitz policy maps, local affinity) are precisely characterized.

LP-Index and LP-Update Policies

LP-Index Policy: Pre-solve the LP and derive dual variables, yielding a set of indices $I_s(t)$ $I_{s} (t)$ that implement Lagrange-prioritized activation—order arms by these per-state indices to satisfy the active budget.
- Guarantees $O(1/\sqrt N)$ gap for arbitrary models and exponential gap for non-degenerate LP solutions.
LP-Update Policy: Re-solve the LP at each epoch based on observed empirical fractions, yielding much smaller practical regret (gap constants), with the same $O(1/\sqrt N)$ asymptotics but improved robustness under model mismatch.
Numerical Verification: LP-update dominates LP-index under model misspecification and achieves empirical reward closer to the theoretical LP upper bound for moderate $N$ (Gast et al., 2021).

3. Bandit Linear Optimization in Sequential Decision and Extensive-Form Games

The BanditLP framework also extends to sequential, tree-form decision problems and extensive-form games (Farina et al., 2021), where the sequence-form strategy space $X$ enforces combinatorial structure and flow conservation:

Algorithmic Foundation: Bandit mirror descent with a dilated-entropy regularizer is employed over the sequence-form polytope, with loss estimation via unbiased estimators built from the sampled actions and observed bandit losses.
Sampling and Regret Analysis: Each round, sampling respects the structure of the tree; regret versus any fixed sequence-form comparator is bounded by $O(\sqrt{T})$ , with all oracles (sampling, gradients, Fenchel conjugate) implementable in $O(|\Sigma|)$ time.

4. Loyalty Points Bandits (Fidelity Rewards) and Regret Structure

'BanditLP' in the fidelity reward context (Lugosi et al., 2021) refers to bandits with loyalty-points or coupon-style reward augmentation:

Model: Each arm $i$ has an extra fidelity-reward $f_i(n)$ granted upon the $n$ th selection; cumulative fidelity is $F_i(n) = \sum_{s=1}^n f_i(s)$ , leading to path-dependent payoffs.
Regret Characterization:
- Stochastic regimes: If $f_i$ is nondecreasing (increasing loyalty), optimal regret is $O\left((K \ln T)^{1/3} T^{2/3}\right)$ .
- Decreasing (rotting) or coupon $f_i$ , classical $O(\sqrt{K T})$ rates are preserved.
- Adversarial regimes: Any nontrivial strictly increasing $f_i$ forces linear regret; coupon/rotting allows weak/mean regret at $\widetilde O(\sqrt{K T})$ .

Algorithmic Approaches

Modified-UCB: For nondecreasing $f_i$ , standard UCB applied to adjusted rewards.
EXP4-Style: For adversarial/decreasing $f_i$ , a finite expert class is constructed.
Lower Bounds and Impossibility: Phase transitions in regret depend sharply on the monotonicity of $f_i$ ; strictly increasing fidelity renders sublinear regret impossible (Lugosi et al., 2021).

5. PAC Battling Bandits under Plackett-Luce: BanditLP as Combinatorial Subset Optimization

Another 'BanditLP' paradigm arises in sample-efficient best-arm identification in the battling-bandit problem under Plackett-Luce (PL) subset choice models (Saha et al., 2018):

Setup: At each round, the learner selects a subset of $k$ arms and receives either a single winner (WI) or top- $m$ ranking (TR $_m$ ) feedback. The goal is $(\epsilon,\delta)$ -PAC identification of a near-best arm.
Sample Complexity:
- WI: $\Theta(n/\epsilon^2 \ln 1/\delta)$ , independent of $k$ .
- TR $_m$ : $\Theta(n/(m\epsilon^2) \ln 1/\delta)$ , achieving multiplicative improvement via rank feedback.
Algorithmic Techniques: Trace-Battle and Divide-Battle approaches rely on adaptive tournaments and rank-breaking, with tight concentration bounds and minimax-optimal guarantees under PL.

6. Synthesis and Practical Impact

All BanditLP methodologies share key similarities:

LP relaxations or combinatorial optimization is leveraged to efficiently respect system-level constraints at scale.
Bandit learning—contextual or classical—is tightly integrated with these selection layers, often with explicit exploration/exploitation mechanisms (Thompson sampling, UCB, EXP-style algorithms).
The frameworks are not restricted to any single application, and are compatible with arbitrary neural or probabilistic predictors, provided they can rapidly yield reward/cost estimates and credible uncertainty quantification for use in online LPs.

Empirical Investigations confirm the theoretical predictions, with web-scale systems demonstrating nontrivial business and fairness improvements (e.g., LinkedIn deployments (Nguyen et al., 22 Jan 2026)), as well as rigorous validation in synthetic and public benchmarks.

BanditLP lies at the intersection of bandit learning, large-scale constrained optimization, and combinatorial/mean-field control. It unifies and extends earlier work in:

Constrained combinatorial bandits and their LP relaxations,
Restless and sleeping bandit models where arms have Markovian or non-stationary dynamics,
Sequence-form bandit optimization for games and tree-structured decisions,
Fidelity rewards for path-dependent payoffs,
Best-arm identification under subset choice with complex feedback (Plackett-Luce, dueling/battling bandits).

The LP-based structure allows for tractability at web and industrial scales, precise characterization of regret and efficiency, and robust adaptation to a broad range of multi-stakeholder online allocation problems. The methodology is not tied to a single algorithmic template, but rather to principled integration of online statistical learning and combinatorial optimization through tractable linear programming relaxations and their scalable solvers.

References: (Nguyen et al., 22 Jan 2026, Gast et al., 2021, Lugosi et al., 2021, Farina et al., 2021, Saha et al., 2018)