Combinatorial Bandits

Updated 22 December 2025

Combinatorial bandits are sequential decision models that involve selecting structured subsets of base arms under constraints, incorporating diverse feedback (semi-bandit, cascading, full-bandit) to optimize cumulative rewards.
They employ techniques like arm elimination, upper-confidence bounds, and hierarchical exploration to efficiently balance explicit exploration and exploitation across interdependent actions.
Applications in recommendation systems, online advertising, and network routing underscore their practical impact by addressing complex, structured decision-making problems in uncertain environments.

A combinatorial bandit is a sequential decision model in which, at each round, the learner selects a set or structured subset of base arms from a ground set and receives observations and rewards determined by the chosen set and often the reward structure across arms. This paradigm generalizes classical multi-armed bandits (MAB) by posing a combinatorial decision at each round, subject to constraints (e.g., cardinality, matroid, graph), and admits diverse feedback models (semi-bandit, cascading, full-bandit, graph feedback). Combinatorial bandits are foundational in applications such as recommendation systems, online advertising, resource allocation, network routing, assortment planning, and influence maximization, where the system must simultaneously select (and learn about) multiple actions with interdependent effects.

1. Mathematical Framework and Models

A general combinatorial bandit instance is specified by:

A set of base arms, $[K] = \{1,\ldots,K\}$ .
At each round $t$ , the learner selects a subset $V_t \subseteq [K]$ from a family of feasible actions $\mathcal{F}$ , often constrained by cardinality ( $|V_t|=S$ ), independence (matroid), matching, or other combinatorial structure.
Each base arm $a$ produces an i.i.d. reward $r_{t,a}$ with mean $\mu_a$ (potentially contextual: $\mathbb{E}[r_{t,a}|x_{t,a}] = \theta_*^\top x_{t,a}$ ).
The overall reward $R_t$ may be additive ( $\sum_{a\in V_t} r_{t,a}$ ), non-additive (e.g., max, product, or submodular function), or even stochastic submodular.

Feedback models include:

Semi-bandit: observe $r_{t,a}$ for each $a\in V_t$ .
Full-bandit: only the aggregate reward $R_t$ is observed.
Cascading/Partial Feedback: only partial information about selected arms, e.g., observations up to the first failure in a path.
Graph Feedback: rewards are revealed for the union of out-neighbors of $V_t$ in a feedback graph.

Performance is measured via regret: $R_T = \mathbb{E}\Bigg[\sum_{t=1}^T \left(\mu(V_*) - \mu(V_t)\right)\Bigg]$ for additive settings or, more generally, the gap between the achievable and optimal expected reward under the given feedback and decision constraints.

2. Core Algorithmic Approaches and Structural Insights

Arm Elimination and Optimistic Algorithms

A principal strategy is upper-confidence-based selection, where arms’ means are estimated while maintaining high-probability confidence bounds; combinatorial actions are selected to maximize a surrogate function using these estimates. Notable advancements include:

Explicit Exploration and Arm Elimination: Instead of naively playing the top- $S$ arms by UCB index, optimal methods partition the base arms into confirmed, active, and eliminated sets. At each round, confirmed arms are always selected, while a carefully chosen exploratory set of actives ensures uniform evidence collection and explicit exploration. After sufficient evidence, new arms are confirmed or eliminated based on empirical means and statistical confidence widths (Wen et al., 28 Oct 2025).
Hierarchical Exploration in Contextual Linear Settings: Structured batched elimination, staged at geometrically decreasing confidence width thresholds, balances sample complexity and aggressive elimination. This approach is crucial for attaining minimax-optimal rates in linear contextual combinatorial bandits (Wen et al., 28 Oct 2025).

Feedback-Dependent Methodology

Semi-bandit Feedback: Standard UCB-based extensions or MLE/OFU in linear parameter regimes achieve instance-dependent $\tilde{O}(K\log T)$ or minimax $O(\sqrt{KS T})$ regret, with further improvements via elimination and tight sample allocation (Kveton et al., 2014, Jourdan et al., 2021, Wen et al., 28 Oct 2025).
Cascading Feedback: In settings where the learner only partially observes selected arms (e.g., up to first “failure”), UCB-style methods must adapt to nonlinear rewards and partial monitoring. Algorithms like CombCascade construct item-level upper bounds and optimize the non-linear objective via log-sum-exp reduction, producing regret rates that match semi-bandit rates modulo a factor reflecting observability (Kveton et al., 2015).
Full-bandit Feedback: When only the aggregate sum is observed, individual arm mean estimation requires structured experimental design; Hadamard-matrix-based schemes enable unbiased estimation under perfectly confounded feedback, as in the CSAR methods for top- $k$ selection and regret minimization (Rejwan et al., 2019).

3. Regret Theory and Optimality

A fundamental goal is matching upper and lower regret bounds across problem settings. Key results include:

Setting	Regret Bound	Reference
Graph feedback, $S$ arms	$O((\alpha\log^2K + S)\log T/\Delta_*)$ (gap), $O(\sqrt{\alpha S T} + S\sqrt{T})$ (minimax)	(Wen et al., 28 Oct 2025)
Semi-bandit, matroid	$O(\sum_{e\notin A^*} \frac{\log n}{\Delta_e})$ (gap), $O(\sqrt{KLn\log n})$ (gap-free)	(Kveton et al., 2014)
Contextual linear	$O( \log(ST) \sqrt{\log(KT)} ( \sqrt{d S T} + d S ))$	(Wen et al., 28 Oct 2025)
Full-bandit, top- $k$	$O( n k \log T /\Delta )$ , $O( k\sqrt{ nT \log T } )$	(Rejwan et al., 2019)
Cascading feedback	$O( (K/f^) \sum_{e\notin A^} \log n/\Delta_{e,\min} )$	(Kveton et al., 2015)
Sleeping bandits	$O(\log T)$ (instance-dependent), $O( \sqrt{T\log T} )$	(Abhishek et al., 2021)

Here, $\alpha$ is the independence number of the feedback graph, $f^*$ is the optimal all-up reward probability, and $K$ typically denotes the base-arm set size.

Matching lower bounds are established for all regimes, demonstrating optimality up to logarithmic or small polynomial factors (Wen et al., 28 Oct 2025, Combes et al., 2015, Rejwan et al., 2019).

Naive UCB-based methods can suffer strictly suboptimal rates except in the simplest settings, due to lack of explicit exploration and exploitation delineation. In particular, failing to split the $S$ pulls between confirmed (greedy) arms and exploration leads to minimax regret that is suboptimal by factors proportional to $\sqrt{S}$ in key settings (Wen et al., 28 Oct 2025).

4. Extensions: Contextual, Structured, and Adversarial Regimes

Beyond i.i.d. stochastic reward models, recent work has expanded combinatorial bandits to:

Contextual and Linear Models: When arm rewards are functions of observed feature vectors and an unknown parameter, hierarchical and variance-adaptive algorithms (e.g., OFU, regularized least squares) permit adaptive confidence intervals and batched elimination, achieving rates strictly better than naive per-arm UCBs (Wen et al., 28 Oct 2025).
Graph/Network Structure and Matroids: Matroid bandits harness greedy selection on independence constraints, while network flow and path constraints appear in online routing and influence maximization (Kveton et al., 2014).
Causal and Reinforcement Learning Connections: Combinatorial causal bandits involve interventions on observed variables in parametric causal DAGs, where rewards propagate via Markovian dynamics. OFU- and regression-based solutions, combined with do-calculus reduction for hidden variables, yield regret rates scaling as $O(n\sqrt{D T} \log T)$ , $n$ variables, $D$ parental degree (Feng et al., 2022).
Stochastic Submodular Functions: For non-linear, monotone submodular reward settings with only full-bandit feedback, optimized stochastic-explore-then-commit (e.g., SGB) significantly improves dependence on the selection budget $k$ over previous methods, constructing size- $k$ actions with reduced exploration and achieving state-of-the-art $O(n^{1/3}k^{2/3}T^{2/3}(\log T)^{2/3})$ regret (Fourati et al., 2023).
Strategic and Non-stationary Environments: Models with adversarial or strategically manipulated reporting, such as bounded strategic budget per arm, introduce an $O(m\log T + mB_{\max})$ regret scaling, showing the impossibility of sublinear regret if arm-side manipulation budgets are superlogarithmic (Dong et al., 2021).
Sleeping and Dynamic Availabilities: Volatile availability of arms (sleeping bandits) is handled by adapting CUCB to dynamically varying base-arm sets, with logarithmic regret scaling persisting under only mild smoothness and combinatorial assumptions (Abhishek et al., 2021).

5. Pure Exploration, Sample Complexity, and Oracle-Efficient Algorithms

Best-Arm Identification (Pure Exploration): In combinatorial settings, the fixed-confidence best-arm or best-set identification problem requires sample-efficient strategies. Recent game-theoretic meta-algorithms, such as CombGame, achieve oracle-efficient and asymptotically optimal log(1/δ) sample complexity via projection-free online learners, best response oracles, and confidence-regulated stopping (Jourdan et al., 2021).
Sample Complexity Under Full-Bandit Feedback: Hadamard-matrix-based estimation, as in CSAR, achieves optimal per-arm estimation rates, and successively accepts or rejects arms based on precise confidence intervals (Rejwan et al., 2019).

6. Applications and Practical Implications

Combinatorial bandits form the algorithmic backbone for:

Slated recommendation and ad-placement: Where exploration is expensive per round and explicit exploitation/exploration splitting is essential to maintain optimal regret (Wen et al., 28 Oct 2025).
Network routing with cascading failures: The combinatorial cascading bandit model captures settings where partial observation (e.g., only up to first failed link) is available and reward is a product over arm states (Kveton et al., 2015).
Assortment optimization in retail: The task of selecting optimal product subsets under observe-sale-only feedback, where combinatorial constraints and feedback structure critically impact algorithmic design (Wen et al., 28 Oct 2025).
Energy management and resource allocation: Assigning actions to many independent agents/bandits combines parallelized exploration and integer programming-based assignment, leveraging full semi-bandit feedback for tractable regret (Jacobs et al., 2020, Zuo et al., 2021).
Influence maximization: Combinatorial optimization over nodes and edges, under triggering-based and partial feedback, where submodular reward functions and their stochastic analogs are central (Fourati et al., 2023, Liu et al., 2023).

7. Open Problems and Future Directions

Several research directions remain open:

Adversarial and Non-stationary Extensions: Attaining minimax-optimal regret in adversarial or non-i.i.d. settings, especially under partial or cascading feedback.
Computational Efficiency for Covering Steps: The greedy dominating-set covering required for optimal explicit exploration is polynomial but may be large; scalable approximations and efficient heuristics remain active areas of study (Wen et al., 28 Oct 2025).
Richer Feedback Models: Extensions to pairwise, ranking, or partially observed outcomes.
Tighter Logarithmic Dependencies: Further tightening of regret bounds with respect to log factors and batch sizes.
Bridging with Reinforcement Learning: CMAB frameworks are being unified with episodic reinforcement learning, with value-weighted smoothness and triggering-probability modulation enabling improved regret bounds in both fields (Liu et al., 3 Jun 2024).
Approximation Oracle Requirements: Broader regimes for efficient approximation oracles and best-action oracles, especially in non-modular, submodular, or structured rewards.

These directions reflect the diverse and rapidly evolving landscape of combinatorial bandit research, spanning algorithmic innovation, structural analysis, and new connections to broader areas of statistical decision theory and online learning.