Papers
Topics
Authors
Recent
Search
2000 character limit reached

BanditLP: LP-Enhanced Bandit Methods

Updated 23 January 2026
  • BanditLP is a family of methods that combine bandit learning with linear programming to manage complex, multi-level constraints in online decision environments.
  • It leverages techniques such as neural Thompson sampling and dual decomposition to achieve sublinear regret and strict, per-round constraint satisfaction.
  • Key applications include web-scale recommendation, restless bandits, extensive-form games, and fidelity reward models, offering scalable solutions for diverse operational challenges.

BanditLP denotes a family of methodologies and models that combine multi-armed or contextual bandit learning with linear programming (LP) formulations, typically to address structural, combinatorial, or large-scale constraint requirements in online decision systems. The term is applied to several distinct but conceptually related strands of work in the bandit literature, spanning large-scale recommendation under multi-stakeholder constraints, fidelity rewards in loyalty points bandits, and linear-program-based policies for restless and combinatorial bandits.

1. Multi-Stakeholder Contextual BanditLP for Large-Scale Constrained Recommendation

BanditLP (Nguyen et al., 22 Jan 2026) formalizes the web-scale, multi-stakeholder recommendation problem as an online contextual bandit with per-round, hard combinatorial constraints. It enables simultaneous optimization for several objectives—user satisfaction, provider fairness, and platform budget—by integrating neural Thompson sampling and LP-based action selection:

  • Problem Setup: Let UU be users, II items, LL providers; each round tt, for user–item pairs (u,i)(u, i), contexts zu,i,tz_{u,i,t} are observed, with an unknown reward ru,i,tr_{u,i,t} and KK auxiliary costs cu,i,t(k)c_{u,i,t}^{(k)}.
  • Constraints: Platform-level (global budgets), provider-level (per-provider quotas/fairness), and user-level (caps) are imposed via linear inequalities over the action variables xu,i[0,1]x_{u,i} \in [0,1] (probabilities of recommending item ii to user uu at tt).

Workflow and Key Components

  • Learning (Neural Thompson Sampling): Bayesian neural nets with Laplace approximations generate samples r~u,i,c~u,i(k)\tilde r_{u,i}, \tilde c_{u,i}^{(k)} for all outcomes and costs.
  • Action Selection (LP Optimization): At serving time, a massive-scale LP is solved to maximize the sampled cumulative reward subject to all constraints.
  • Solver (DuaLip): A dual decomposition and partial Lagrangian approach transforms the primal LP into a strongly convex QP. The dual is maximized via efficient first-order and projection methods, exploiting separability and strong duality for scalability to 109+10^9+ variables.
  • Regret & Feasibility: The approach inherits sublinear Bayesian regret bounds for the stochastic contextual bandit, and guarantees per-round constraint satisfaction (zero violation) due to strict enforcement at each epoch.

Empirical Outcomes: In both synthetic benchmarks and production deployments (e.g., LinkedIn email marketing), BanditLP achieves significant cumulative reward gains, tight constraint adherence (<0.5% violation on public datasets), and demonstrates exploration benefits for long-term metrics even in highly constrained production settings (Nguyen et al., 22 Jan 2026).

2. LP-Based Policies for Restless Bandits and Mean-Field Regimes

LP-based decision rules for restless Markovian bandits (Gast et al., 2021) use a linear programming relaxation to approximate the value of the finite-NN control problem (with NN bandit arms and per-round activation constraints) as NN \to \infty:

  • Finite-Horizon LP Relaxation: For dd states, time horizon TT, and fraction α\alpha of activations per round, the relaxation optimizes expected total reward over {ys,a(t)}\{y_{s,a}(t)\}, subject to mean-field dynamics and budget constraints.
  • Asymptotic Optimality Hierarchy:
    • Asymptotic optimality: per-arm value gap between LP relaxation and actual policy vanishes as NN \to \infty.
    • O(1/N)O(1/\sqrt N)-rate and exponential-rate optimality are formalized via uniform constants in the convergence rates.
    • Necessary and sufficient conditions (LP-compatibility, Lipschitz policy maps, local affinity) are precisely characterized.

LP-Index and LP-Update Policies

  • LP-Index Policy: Pre-solve the LP and derive dual variables, yielding a set of indices Is(t)I_s(t) that implement Lagrange-prioritized activation—order arms by these per-state indices to satisfy the active budget.
    • Guarantees O(1/N)O(1/\sqrt N) gap for arbitrary models and exponential gap for non-degenerate LP solutions.
  • LP-Update Policy: Re-solve the LP at each epoch based on observed empirical fractions, yielding much smaller practical regret (gap constants), with the same O(1/N)O(1/\sqrt N) asymptotics but improved robustness under model mismatch.
  • Numerical Verification: LP-update dominates LP-index under model misspecification and achieves empirical reward closer to the theoretical LP upper bound for moderate NN (Gast et al., 2021).

3. Bandit Linear Optimization in Sequential Decision and Extensive-Form Games

The BanditLP framework also extends to sequential, tree-form decision problems and extensive-form games (Farina et al., 2021), where the sequence-form strategy space XX enforces combinatorial structure and flow conservation:

  • Algorithmic Foundation: Bandit mirror descent with a dilated-entropy regularizer is employed over the sequence-form polytope, with loss estimation via unbiased estimators built from the sampled actions and observed bandit losses.
  • Sampling and Regret Analysis: Each round, sampling respects the structure of the tree; regret versus any fixed sequence-form comparator is bounded by O(T)O(\sqrt{T}), with all oracles (sampling, gradients, Fenchel conjugate) implementable in O(Σ)O(|\Sigma|) time.

4. Loyalty Points Bandits (Fidelity Rewards) and Regret Structure

'BanditLP' in the fidelity reward context (Lugosi et al., 2021) refers to bandits with loyalty-points or coupon-style reward augmentation:

  • Model: Each arm ii has an extra fidelity-reward fi(n)f_i(n) granted upon the nnth selection; cumulative fidelity is Fi(n)=s=1nfi(s)F_i(n) = \sum_{s=1}^n f_i(s), leading to path-dependent payoffs.
  • Regret Characterization:
    • Stochastic regimes: If fif_i is nondecreasing (increasing loyalty), optimal regret is O((KlnT)1/3T2/3)O\left((K \ln T)^{1/3} T^{2/3}\right).
    • Decreasing (rotting) or coupon fif_i, classical O(KT)O(\sqrt{K T}) rates are preserved.
    • Adversarial regimes: Any nontrivial strictly increasing fif_i forces linear regret; coupon/rotting allows weak/mean regret at O~(KT)\widetilde O(\sqrt{K T}).

Algorithmic Approaches

  • Modified-UCB: For nondecreasing fif_i, standard UCB applied to adjusted rewards.
  • EXP4-Style: For adversarial/decreasing fif_i, a finite expert class is constructed.
  • Lower Bounds and Impossibility: Phase transitions in regret depend sharply on the monotonicity of fif_i; strictly increasing fidelity renders sublinear regret impossible (Lugosi et al., 2021).

5. PAC Battling Bandits under Plackett-Luce: BanditLP as Combinatorial Subset Optimization

Another 'BanditLP' paradigm arises in sample-efficient best-arm identification in the battling-bandit problem under Plackett-Luce (PL) subset choice models (Saha et al., 2018):

  • Setup: At each round, the learner selects a subset of kk arms and receives either a single winner (WI) or top-mm ranking (TRm_m) feedback. The goal is (ϵ,δ)(\epsilon,\delta)-PAC identification of a near-best arm.
  • Sample Complexity:
    • WI: Θ(n/ϵ2ln1/δ)\Theta(n/\epsilon^2 \ln 1/\delta), independent of kk.
    • TRm_m: Θ(n/(mϵ2)ln1/δ)\Theta(n/(m\epsilon^2) \ln 1/\delta), achieving multiplicative improvement via rank feedback.
  • Algorithmic Techniques: Trace-Battle and Divide-Battle approaches rely on adaptive tournaments and rank-breaking, with tight concentration bounds and minimax-optimal guarantees under PL.

6. Synthesis and Practical Impact

All BanditLP methodologies share key similarities:

  • LP relaxations or combinatorial optimization is leveraged to efficiently respect system-level constraints at scale.
  • Bandit learning—contextual or classical—is tightly integrated with these selection layers, often with explicit exploration/exploitation mechanisms (Thompson sampling, UCB, EXP-style algorithms).
  • The frameworks are not restricted to any single application, and are compatible with arbitrary neural or probabilistic predictors, provided they can rapidly yield reward/cost estimates and credible uncertainty quantification for use in online LPs.

Empirical Investigations confirm the theoretical predictions, with web-scale systems demonstrating nontrivial business and fairness improvements (e.g., LinkedIn deployments (Nguyen et al., 22 Jan 2026)), as well as rigorous validation in synthetic and public benchmarks.

BanditLP lies at the intersection of bandit learning, large-scale constrained optimization, and combinatorial/mean-field control. It unifies and extends earlier work in:

  • Constrained combinatorial bandits and their LP relaxations,
  • Restless and sleeping bandit models where arms have Markovian or non-stationary dynamics,
  • Sequence-form bandit optimization for games and tree-structured decisions,
  • Fidelity rewards for path-dependent payoffs,
  • Best-arm identification under subset choice with complex feedback (Plackett-Luce, dueling/battling bandits).

The LP-based structure allows for tractability at web and industrial scales, precise characterization of regret and efficiency, and robust adaptation to a broad range of multi-stakeholder online allocation problems. The methodology is not tied to a single algorithmic template, but rather to principled integration of online statistical learning and combinatorial optimization through tractable linear programming relaxations and their scalable solvers.


References: (Nguyen et al., 22 Jan 2026, Gast et al., 2021, Lugosi et al., 2021, Farina et al., 2021, Saha et al., 2018)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BanditLP.