Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contextual Linear Reward Model

Updated 20 January 2026
  • The model defines expected rewards as a linear function of known context–action features and an unknown parameter, enabling rigorous analysis of sequential decision-making.
  • It underpins various settings including classical bandits, hybrid models, and constrained environments, with algorithms like LinUCB achieving near-optimal regret rates.
  • Its flexibility extends to nonstationary, adversarial, and privacy-aware scenarios, offering vital insights for reinforcement learning and causal inference.

The contextual linear reward model defines a broad and rigorously analyzed class of stochastic sequential decision problems, encompassing contextual bandits and extending into reinforcement learning, constrained resource settings, and models with hybrid, adversarial, or even nonstationary components. Central to this framework is the assumption that the expected reward for each action taken in a specific context is a linear function of a known context–action feature mapping and an unknown parameter vector, with the learner’s objective being to maximize cumulative reward or minimize regret over a fixed or possibly unbounded time horizon. The structure, assumptions, and technical analysis of the contextual linear reward model underpin algorithmic approaches and theoretical guarantees throughout contemporary bandit and RL research.

1. Formal Definition and Statistical Structure

Let t=1,2,,nt=1,2,\dots, n denote sequential rounds. At each round tt, the learner observes a context ctc_t (possibly chosen adversarially). The available action space AA is fixed, and each context–action pair (ct,a)(c_t,a), for aAa\in A, is mapped into Rd\R^d via a known embedding

ϕ(ct,a)    Rd,\phi(c_t, a)\;\in\;\R^d,

defining the set of feasible decision vectors

Dt={ϕ(ct,a)aA}Rd.D_t = \left\{\phi(c_t, a)\mid a \in A\right\}\subset \R^d.

Upon choosing xtDtx_t\in D_t (identifying arm atAa_t\in A), the learner receives a noisy stochastic reward: yt=θ,xt+ηt,y_t = \langle\theta^*, x_t\rangle + \eta_t, where θRd\theta^* \in \R^d is a fixed but unknown parameter vector and ηt\eta_t is a conditionally σ2\sigma^2-subgaussian noise term, satisfying for all λR\lambda\in\R: $\E\left[e^{\lambda\,\eta_t}\mid x_{1:t},y_{1:t-1}\right] \leq \exp\Bigl(\frac{\lambda^2 \sigma^2}{2}\Bigr).$ Bounding conditions are imposed on features and parameters: \begin{align*} |x| &\leq L, \quad\forall x\in D_t \ |\theta*| &\leq S \ |\langle\theta*,x\rangle| &\leq B\leq L S,\quad\forall x\in D_t. \end{align*}

This model covers nonstationary or adversarial context generation, and its reward structure underpins both regret minimization and statistical estimation tasks (Shariff et al., 2018).

2. Regret Formulation and Performance Metrics

A principal metric is the cumulative pseudo-regret over nn rounds: R^n=t=1n[maxxDtθ,xθ,xt]=t=1nmaxxDtθ,xxt.\widehat R_n = \sum_{t=1}^n \Bigl[\,\max_{x\in D_t}\langle\theta^*,x\rangle - \langle\theta^*,x_t\rangle\,\Bigr] = \sum_{t=1}^n \max_{x\in D_t}\langle\theta^*,\,x - x_t\rangle. Since $\E[y_t\mid x_t]=\langle\theta^*,x_t\rangle$, this equals the expected regret. Regret is analyzed in several regimes:

  • Worst-case (minimax) regret: Focuses on adversarial context.
  • Instance-dependent or simple regret: Relevant in RLHF and dueling settings (Scheid et al., 2024), where the best-per-context arm is sought.
  • Contextual “bandits with knapsacks” regret adds constraints on cumulative resource consumption (Agrawal et al., 2015).

UCB-style algorithms exploit confidence sets for θ\theta^*, often ellipsoidal, constructed from observed data and regularized Gram matrices: Gt=s<txsxs,Vt=Gt+Ht,G_t = \sum_{s<t} x_s x_s^\top,\qquad V_t = G_t + H_t, where the regularizer Ht0H_t \succeq 0 controls invertibility and eigenvalue spread, with the guarantees

ρminIdHtρmaxId,VtGt+ρminId.\rho_{\min}I_d \preceq H_t \preceq \rho_{\max}I_d,\qquad V_t\succeq G_t + \rho_{\min}I_d.

Such structures enable tight regret bounds scaling as O~(dn)\tilde O(d \sqrt n), depending on the feature/parameter dimension and horizon (Shariff et al., 2018).

3. Algorithmic Realizations and Analytical Tools

3.1 Linear-UCB and Variants

The canonical algorithm maintains an estimator

θ^t=Vt1s<txsys,\hat\theta_t = V_t^{-1} \sum_{s<t} x_s y_s,

and selects actions optimistically from the confidence ellipsoid

Et={θ:θθ^tVtβt}.E_t = \{\theta:\|\theta-\hat\theta_t\|_{V_t} \leq \beta_t\}.

Optimistic action choice is

xt=argmaxxDtmaxθEtθ,x,x_t = \arg\max_{x\in D_t}\max_{\theta\in E_t}\langle\theta, x\rangle,

computable as xx maximizing θ^t,x+βtxVt1\langle\hat\theta_t, x\rangle + \beta_t \|x\|_{V_t^{-1}}.

3.2 Regret Control via Elliptical Potentials

Analysis depends on bounding the elliptical potential: t=1nmin{1,xtVt12}O~(dlog(1+nL2ρmin)).\sum_{t=1}^n \min\{1, \|x_t\|_{V_t^{-1}}^2\}\leq \tilde O\left(d\log\left(1+\frac{n L^2}{\rho_{\min}}\right)\right). This bound is fundamental in proving O~(dn)\tilde O(d\sqrt n) regret in the base model and generalizes to contextual and hybrid settings (Das et al., 2024).

3.3 Adaptations for Constraints and Privacy

  • For bandits with knapsacks, the reward model is augmented with context-linear resource consumption; resource constraints are enforced via primal-dual algorithms and adjusted reward maximization (Agrawal et al., 2015).
  • Differential privacy is addressed via joint or tree-based privacy mechanisms, with regularizer noise injected so the confidence ellipsoid analysis still holds (Shariff et al., 2018).

4. Model Variations and Generalizations

4.1 Hybrid and Disjoint Linear Models

  • Hybrid reward models allow per-arm parameter vectors (disjoint), shared global parameters, or a combination. This unifies the “shared” (only global parameter) and “disjoint” (one linear model per arm) cases.
  • Action embedding is via sparse expansion, and LinUCB/DisLinUCB are special cases. Algorithms such as HyLinUCB adapt exploration by exploiting sparsity in the parameter vector dimension (Das et al., 2024).

4.2 Long-Horizon Dependencies

  • Models where yty_t depends on a linear filter over shs\ll h past context–action pairs, yielding high-dimensional but sparse reward models. Block-sparse recovery and restricted isometry properties for block-circulant designs enable horizon-independent regret (Qin et al., 2023).

4.3 Contextual MDPs and RL

  • When transitions and rewards in Markov decision processes are context-dependent and linear in context features or weights, generalizations include models with context-varying weights and features (Deng et al., 2024).

4.4 Interference and Multi-unit Contexts

  • In interference-aware linear contextual bandits, each unit's reward may be affected by the actions of others, requiring matrix-valued weights to quantify spillover. Linear model structure persists, now embedded in larger covariate spaces tracking joint actions (Xu et al., 2024).

4.5 Representation Learning

  • The embedding ϕ\phi may be unknown and learned online from a family Φ={ϕj}\Phi=\{\phi_j\}; regret then depends on both parameter estimation and identification/discrimination between candidate representations. The asymptotic cost is characterized by an explicit optimization, and in some regimes, “representation learning is for free,” while in others, it is provably as hard as tabular learning (Tirinzoni et al., 2022).

5. Connections to Classical Bandit Frameworks

The contextual linear reward model nests several classic paradigms: | Setting | Feature set DtD_t | Model Parameters | |----------------------------|-----------------------------|------------------| | kk-armed MAB | {e1,,ek}\{e_1,\ldots,e_k\} | θRk\theta^*\in\R^k| | Contextual linear bandit | arbitrary DtRdD_t\subset\R^d | θRd\theta^*\in\R^d| | Disjoint linear bandit | block-diagonal ϕ\phi | {θa}\{\theta^*_a\} | | Bandits with Knapsacks | reward plus linear resources| θ,W\theta^*, W^* |

When DtD_t is the set of canonical basis vectors, contextual bandits reduce to kk-armed bandits with fixed means μi=θi\mu_i = \theta^*_i.

  • Dueling bandits / RLHF: The contextual linear model underpins ranking tasks, e.g., in RLHF pairwise comparisons, with preferences modeled by [a1a2]=σ(θ,a1a2)\P[a_1\succ a_2] = \sigma(\langle\theta^*, a_1-a_2\rangle) and design-based approaches for optimal labeling batches (Scheid et al., 2024).
  • Action-centered models: Linear treatment effects are estimated in the presence of unknown and possibly adversarial baseline rewards, by randomization and centering, with regret bounds depending only on the treatment effect dimension (Greenewald et al., 2017).
  • Unknown link / Single index bandits: The linear contextual reward model is generalized via an unknown (possibly monotone) link function ff, i.e., yt=f(xtθ)+ηty_t = f(x_t^\top \theta^*) + \eta_t, yielding single-index models with specialized parameter estimation techniques (Stein’s estimator), achieving sublinear regret in both monotonic and general link settings (Kang et al., 15 Jun 2025).

7. Theoretical Guarantees and Limiting Behavior

Under mild boundedness and subgaussianity, contextual linear reward models admit

  • O~(dn)\tilde O(d\sqrt n) regret for LinUCB and related algorithms,
  • instance-dependent (logarithmic) regret lower and upper bounds when embedding or representation are to be learned (Tirinzoni et al., 2022),
  • horizon-independent regret rates in long-memory models with ss-sparse temporal filters (Qin et al., 2023),
  • minimax-optimal simple-regret in offline dueling RLHF designs (Scheid et al., 2024).

These guarantees and the associated analytical techniques—self-normalized martingale inequalities, elliptical potential arguments, primal-dual optimization, optimal experimental design, and block-sparse recovery—establish the contextual linear reward model as a powerful and adaptable theoretical backbone across online learning, sequential decision making, and algorithmic causal inference.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contextual Linear Reward Model.