Cascading Bandits: Scalable Online Learning

Updated 7 November 2025

Cascading bandits model is an online learning framework where users sequentially examine items until making a click, providing partial feedback.
It adapts multi-armed bandits to combinatorial actions by only observing outcomes up to the first click, streamlining the learning process.
Linear generalization via item features enables scalable, efficient recommendations with regret independent of the number of items.

The cascading bandits model is an influential framework in online learning to rank and recommendation systems, motivated by the need to model user interactions when presented with ordered lists from a large candidate set. In this model, a user examines recommended items one-by-one and typically acts by clicking the first attractive item—after which the examination process stops. This structure introduces partial feedback, where only the outcomes for items up to and including the first click are observable. Cascading bandits generalize classical multi-armed bandits to ordered, combinatorial actions and underpin modern contextual and scalable approaches for large-scale recommendation.

1. Cascade Model: Principles and Formalization

The cascade model (Craswell et al., 2008) offers a tractable abstraction for user behavior in ranked lists, formally specifying how attractiveness is realized and examined. Given a ground set of $L$ items, the learner recommends a list $A = (a_1, \ldots, a_K)$ . Each item $e$ possesses an unknown but fixed attraction probability $\bar{w}(e)$ .

The probability that the $k^\text{th}$ item is examined is $\prod_{i=1}^{k-1} (1 - \bar{w}(a_i))$ .
If $a_k$ is attractive ( $w(a_k) = 1$ ), the user clicks and the process terminates.
The reward at time $t$ is $r_t = 1 - \prod_{i=1}^K (1 - w_t(a_i))$ , capturing the event that at least one item is clicked.
Feedback is partial—only indices up to the first click (or end of list) are revealed; items after the first attractive item remain unobserved.

This abstraction has proved highly amenable for both theoretical analysis and practical algorithm design, particularly in web search and recommendation contexts.

2. Cascading Bandits: Online Learning Problem

The cascading bandit problem formalizes learning-to-rank as an online process: at each round, the agent must select a list of $K$ from $L$ candidates, seeking to maximize cumulative expected reward (total clicks) across $n$ rounds. The central objective is minimization of regret—the difference between the reward accrued and that which would be obtained by always recommending the optimal $K$ items.

Previous algorithms (e.g., CascadeUCB1, CascadeKL-UCB (Kveton et al., 2015)) assigned independent learning processes to each item. This suffers from poor scalability, as regret and sample complexity grow at least linearly with $L$ . Such methods rapidly become impractical as $L$ approaches or exceeds thousands—the typical scale in modern recommendation applications.

3. Linear Generalization and Large-Scale Cascading Bandits

To achieve scalability, the model is augmented by a linear generalization assumption: the attraction probability is assumed to depend linearly on item features,

$\bar{w}(e) \approx x_e^\top \theta^*$

where $x_e \in \mathbb{R}^d$ is the known feature vector for item $e$ and $\theta^* \in \mathbb{R}^d$ is an unknown parameter shared across items. This approach allows feedback from one item to inform predictions for many others—exploiting the structure of the feature space.

Algorithmic Frameworks

Two primary algorithms are proposed, each leveraging linear generalization:

Cascading Linear Thompson Sampling (CLTS): Samples $\theta_t$ from a multivariate Gaussian centered at the posterior mean $\bar{\theta}_{t-1}$ , with covariance $M_{t-1}^{-1}$ , selects greedily according to sampled attraction probabilities.
Cascading Linear UCB (CLUCB): For each item, computes a UCB estimate $x_e^\top \bar{\theta}_{t-1} + c \sqrt{x_e^\top M_{t-1}^{-1} x_e}$ , and recommends the top $K$ according to these bounds.

After partial feedback is observed, sufficient statistics ( $M_t$ , $B_t$ ) are updated using the observed features and outcomes (click indicators).

Regret Guarantees

The CLUCB algorithm achieves regret bounded by

$R(n) \leq 2c K \sqrt{\frac{dn \log (1 + nK/d\sigma^2)}{\log (1 + 1/\sigma^2)}} + 1$

with $c$ set according to model and variance parameters. Crucially, regret is independent of the number of items $L$ , scaling only with feature dimension $d$ and the number of recommendations $K$ . This matches (up to logarithmic terms) the best known results for linear and combinatorial bandits with semi-bandit feedback—despite operating with only cascading, position-dependent feedback.

For CLTS, the paper provides comprehensive empirical evidence supporting similar scaling.

4. Feature-Based Learning and Generalization

Learning the predictor $\theta^*$ is performed online, via partial feedback and feature vectors. In practice, features are often constructed using collaborative filtering or low-rank matrix factorization (e.g., SVD decomposition of the user-item interaction matrix).

Implication: When features encode similarity among items, the learning of attraction probabilities generalizes rapidly—performance for new, rarely presented items is inferred from observed similar items. This facilitates sample efficiency: only $O(d)$ samples are required to estimate $\theta$ , with learning complexity largely agnostic to $L$ .

5. Empirical Evaluation and Scalability

Experiments on large recommendation datasets (Yelp, Million Song, MovieLens) reveal that linear cascading bandit algorithms dramatically outperform baselines:

Baseline regret grows sharply with $L$ ; algorithms with linear generalization maintain low, stable regret even as $L$ approaches thousands.
Performance remains robust so long as feature dimension $d$ is sufficiently expressive and the number of items recommended $K$ is moderate.
Linear cascade bandits learn effective recommendations orders of magnitude faster than non-generalizing multi-armed bandit baselines, particularly in truly large-scale settings.

These findings confirm that feature-based generalization is essential for practical deployment in industry-scale ranking and recommendation applications.

6. Comparative Summary

Aspect	Classic Cascading Bandit	Linear Cascading Bandit (This Paper)
Learns	Attraction for each item	Shared parameter via features
Regret	$O(L \log n)$	$O(K d \sqrt{n})$ (independent of $L$ )
Scalability	Poor for large $L$	Good for very large $L$
Feedback	Cascade feedback	Cascade feedback, linear generalization
Algorithms	UCB, Thompson Sampling	CLUCB, CLTS (linear UCB/TS)

7. Impact, Limitations, and Extensions

This work advances bandit learning-to-rank by enabling scalable, efficient learning for cascade-model user behaviors and large candidate sets. The theoretical independence of regret from item count—tied only to feature-space dimension—expands the practical utility of cascading bandits for modern recommender systems. However, efficacy is tied to the quality and informativeness of the feature construction; poor features can compromise generalization and performance.

Subsequent research has extended the cascading bandit framework along several dimensions: privacy constraints (Wang et al., 2021), exposure bias (Mansoury et al., 8 Aug 2024), robust estimation under corruption (Xie et al., 12 Feb 2025, Ghaffari et al., 4 Nov 2025), reinforcement learning with state-dependent actions (Du et al., 17 Jan 2024), combinatorial constraints (Kveton et al., 2015), minimax-regret analysis (Vial et al., 2022), and hybrid objectives balancing relevance and diversity (Li et al., 2019).

The cascading bandits model continues to be a core tool for interactive learning-to-rank, inspiring methodological developments in scalability, fairness, privacy, robustness, and generalization for recommendation systems and information retrieval.