Papers
Topics
Authors
Recent
Search
2000 character limit reached

BanditLP: LP-Enhanced Bandit Methods

Updated 23 January 2026
  • BanditLP is a family of methods that combine bandit learning with linear programming to manage complex, multi-level constraints in online decision environments.
  • It leverages techniques such as neural Thompson sampling and dual decomposition to achieve sublinear regret and strict, per-round constraint satisfaction.
  • Key applications include web-scale recommendation, restless bandits, extensive-form games, and fidelity reward models, offering scalable solutions for diverse operational challenges.

BanditLP denotes a family of methodologies and models that combine multi-armed or contextual bandit learning with linear programming (LP) formulations, typically to address structural, combinatorial, or large-scale constraint requirements in online decision systems. The term is applied to several distinct but conceptually related strands of work in the bandit literature, spanning large-scale recommendation under multi-stakeholder constraints, fidelity rewards in loyalty points bandits, and linear-program-based policies for restless and combinatorial bandits.

1. Multi-Stakeholder Contextual BanditLP for Large-Scale Constrained Recommendation

BanditLP (Nguyen et al., 22 Jan 2026) formalizes the web-scale, multi-stakeholder recommendation problem as an online contextual bandit with per-round, hard combinatorial constraints. It enables simultaneous optimization for several objectives—user satisfaction, provider fairness, and platform budget—by integrating neural Thompson sampling and LP-based action selection:

  • Problem Setup: Let UU be users, II items, LL providers; each round tt, for user–item pairs (u,i)(u, i), contexts zu,i,tz_{u,i,t} are observed, with an unknown reward ru,i,tr_{u,i,t} and KK auxiliary costs cu,i,t(k)c_{u,i,t}^{(k)}.
  • Constraints: Platform-level (global budgets), provider-level (per-provider quotas/fairness), and user-level (caps) are imposed via linear inequalities over the action variables xu,i∈[0,1]x_{u,i} \in [0,1] (probabilities of recommending item II0 to user II1 at II2).

Workflow and Key Components

  • Learning (Neural Thompson Sampling): Bayesian neural nets with Laplace approximations generate samples II3 for all outcomes and costs.
  • Action Selection (LP Optimization): At serving time, a massive-scale LP is solved to maximize the sampled cumulative reward subject to all constraints.
  • Solver (DuaLip): A dual decomposition and partial Lagrangian approach transforms the primal LP into a strongly convex QP. The dual is maximized via efficient first-order and projection methods, exploiting separability and strong duality for scalability to II4 variables.
  • Regret & Feasibility: The approach inherits sublinear Bayesian regret bounds for the stochastic contextual bandit, and guarantees per-round constraint satisfaction (zero violation) due to strict enforcement at each epoch.

Empirical Outcomes: In both synthetic benchmarks and production deployments (e.g., LinkedIn email marketing), BanditLP achieves significant cumulative reward gains, tight constraint adherence (<0.5% violation on public datasets), and demonstrates exploration benefits for long-term metrics even in highly constrained production settings (Nguyen et al., 22 Jan 2026).

2. LP-Based Policies for Restless Bandits and Mean-Field Regimes

LP-based decision rules for restless Markovian bandits (Gast et al., 2021) use a linear programming relaxation to approximate the value of the finite-II5 control problem (with II6 bandit arms and per-round activation constraints) as II7:

  • Finite-Horizon LP Relaxation: For II8 states, time horizon II9, and fraction LL0 of activations per round, the relaxation optimizes expected total reward over LL1, subject to mean-field dynamics and budget constraints.
  • Asymptotic Optimality Hierarchy:
    • Asymptotic optimality: per-arm value gap between LP relaxation and actual policy vanishes as LL2.
    • LL3-rate and exponential-rate optimality are formalized via uniform constants in the convergence rates.
    • Necessary and sufficient conditions (LP-compatibility, Lipschitz policy maps, local affinity) are precisely characterized.

LP-Index and LP-Update Policies

  • LP-Index Policy: Pre-solve the LP and derive dual variables, yielding a set of indices LL4 that implement Lagrange-prioritized activation—order arms by these per-state indices to satisfy the active budget.
    • Guarantees LL5 gap for arbitrary models and exponential gap for non-degenerate LP solutions.
  • LP-Update Policy: Re-solve the LP at each epoch based on observed empirical fractions, yielding much smaller practical regret (gap constants), with the same LL6 asymptotics but improved robustness under model mismatch.
  • Numerical Verification: LP-update dominates LP-index under model misspecification and achieves empirical reward closer to the theoretical LP upper bound for moderate LL7 (Gast et al., 2021).

3. Bandit Linear Optimization in Sequential Decision and Extensive-Form Games

The BanditLP framework also extends to sequential, tree-form decision problems and extensive-form games (Farina et al., 2021), where the sequence-form strategy space LL8 enforces combinatorial structure and flow conservation:

  • Algorithmic Foundation: Bandit mirror descent with a dilated-entropy regularizer is employed over the sequence-form polytope, with loss estimation via unbiased estimators built from the sampled actions and observed bandit losses.
  • Sampling and Regret Analysis: Each round, sampling respects the structure of the tree; regret versus any fixed sequence-form comparator is bounded by LL9, with all oracles (sampling, gradients, Fenchel conjugate) implementable in tt0 time.

4. Loyalty Points Bandits (Fidelity Rewards) and Regret Structure

'BanditLP' in the fidelity reward context (Lugosi et al., 2021) refers to bandits with loyalty-points or coupon-style reward augmentation:

  • Model: Each arm tt1 has an extra fidelity-reward tt2 granted upon the tt3th selection; cumulative fidelity is tt4, leading to path-dependent payoffs.
  • Regret Characterization:
    • Stochastic regimes: If tt5 is nondecreasing (increasing loyalty), optimal regret is tt6.
    • Decreasing (rotting) or coupon tt7, classical tt8 rates are preserved.
    • Adversarial regimes: Any nontrivial strictly increasing tt9 forces linear regret; coupon/rotting allows weak/mean regret at (u,i)(u, i)0.

Algorithmic Approaches

  • Modified-UCB: For nondecreasing (u,i)(u, i)1, standard UCB applied to adjusted rewards.
  • EXP4-Style: For adversarial/decreasing (u,i)(u, i)2, a finite expert class is constructed.
  • Lower Bounds and Impossibility: Phase transitions in regret depend sharply on the monotonicity of (u,i)(u, i)3; strictly increasing fidelity renders sublinear regret impossible (Lugosi et al., 2021).

5. PAC Battling Bandits under Plackett-Luce: BanditLP as Combinatorial Subset Optimization

Another 'BanditLP' paradigm arises in sample-efficient best-arm identification in the battling-bandit problem under Plackett-Luce (PL) subset choice models (Saha et al., 2018):

  • Setup: At each round, the learner selects a subset of (u,i)(u, i)4 arms and receives either a single winner (WI) or top-(u,i)(u, i)5 ranking (TR(u,i)(u, i)6) feedback. The goal is (u,i)(u, i)7-PAC identification of a near-best arm.
  • Sample Complexity:
    • WI: (u,i)(u, i)8, independent of (u,i)(u, i)9.
    • TRzu,i,tz_{u,i,t}0: zu,i,tz_{u,i,t}1, achieving multiplicative improvement via rank feedback.
  • Algorithmic Techniques: Trace-Battle and Divide-Battle approaches rely on adaptive tournaments and rank-breaking, with tight concentration bounds and minimax-optimal guarantees under PL.

6. Synthesis and Practical Impact

All BanditLP methodologies share key similarities:

  • LP relaxations or combinatorial optimization is leveraged to efficiently respect system-level constraints at scale.
  • Bandit learning—contextual or classical—is tightly integrated with these selection layers, often with explicit exploration/exploitation mechanisms (Thompson sampling, UCB, EXP-style algorithms).
  • The frameworks are not restricted to any single application, and are compatible with arbitrary neural or probabilistic predictors, provided they can rapidly yield reward/cost estimates and credible uncertainty quantification for use in online LPs.

Empirical Investigations confirm the theoretical predictions, with web-scale systems demonstrating nontrivial business and fairness improvements (e.g., LinkedIn deployments (Nguyen et al., 22 Jan 2026)), as well as rigorous validation in synthetic and public benchmarks.

BanditLP lies at the intersection of bandit learning, large-scale constrained optimization, and combinatorial/mean-field control. It unifies and extends earlier work in:

  • Constrained combinatorial bandits and their LP relaxations,
  • Restless and sleeping bandit models where arms have Markovian or non-stationary dynamics,
  • Sequence-form bandit optimization for games and tree-structured decisions,
  • Fidelity rewards for path-dependent payoffs,
  • Best-arm identification under subset choice with complex feedback (Plackett-Luce, dueling/battling bandits).

The LP-based structure allows for tractability at web and industrial scales, precise characterization of regret and efficiency, and robust adaptation to a broad range of multi-stakeholder online allocation problems. The methodology is not tied to a single algorithmic template, but rather to principled integration of online statistical learning and combinatorial optimization through tractable linear programming relaxations and their scalable solvers.


References: (Nguyen et al., 22 Jan 2026, Gast et al., 2021, Lugosi et al., 2021, Farina et al., 2021, Saha et al., 2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BanditLP.