Contextual Linear Reward Model

Updated 20 January 2026

The model defines expected rewards as a linear function of known context–action features and an unknown parameter, enabling rigorous analysis of sequential decision-making.
It underpins various settings including classical bandits, hybrid models, and constrained environments, with algorithms like LinUCB achieving near-optimal regret rates.
Its flexibility extends to nonstationary, adversarial, and privacy-aware scenarios, offering vital insights for reinforcement learning and causal inference.

The contextual linear reward model defines a broad and rigorously analyzed class of stochastic sequential decision problems, encompassing contextual bandits and extending into reinforcement learning, constrained resource settings, and models with hybrid, adversarial, or even nonstationary components. Central to this framework is the assumption that the expected reward for each action taken in a specific context is a linear function of a known context–action feature mapping and an unknown parameter vector, with the learner’s objective being to maximize cumulative reward or minimize regret over a fixed or possibly unbounded time horizon. The structure, assumptions, and technical analysis of the contextual linear reward model underpin algorithmic approaches and theoretical guarantees throughout contemporary bandit and RL research.

1. Formal Definition and Statistical Structure

Let $t=1,2,\dots, n$ denote sequential rounds. At each round $t$ , the learner observes a context $c_t$ (possibly chosen adversarially). The available action space $A$ is fixed, and each context–action pair $(c_t,a)$ , for $a\in A$ , is mapped into $\R^d$ via a known embedding

$\phi(c_t, a)\;\in\;\R^d,$

defining the set of feasible decision vectors

$D_t = \left\{\phi(c_t, a)\mid a \in A\right\}\subset \R^d.$

Upon choosing $x_t\in D_t$ (identifying arm $a_t\in A$ ), the learner receives a noisy stochastic reward: $y_t = \langle\theta^*, x_t\rangle + \eta_t,$ where $\theta^* \in \R^d$ is a fixed but unknown parameter vector and $\eta_t$ is a conditionally $\sigma^2$ -subgaussian noise term, satisfying for all $\lambda\in\R$ : $\E\left[e^{\lambda\,\eta_t}\mid x_{1:t},y_{1:t-1}\right] \leq \exp\Bigl(\frac{\lambda^2 \sigma^2}{2}\Bigr).$ Bounding conditions are imposed on features and parameters: \begin{align*} |x| &\leq L, \quad\forall x\in D_t \ |\theta^*| &\leq S \ |\langle\theta^*,x\rangle| &\leq B\leq L S,\quad\forall x\in D_t. \end{align*}

This model covers nonstationary or adversarial context generation, and its reward structure underpins both regret minimization and statistical estimation tasks (Shariff et al., 2018).

2. Regret Formulation and Performance Metrics

A principal metric is the cumulative pseudo-regret over $n$ rounds: $\widehat R_n = \sum_{t=1}^n \Bigl[\,\max_{x\in D_t}\langle\theta^*,x\rangle - \langle\theta^*,x_t\rangle\,\Bigr] = \sum_{t=1}^n \max_{x\in D_t}\langle\theta^*,\,x - x_t\rangle.$ Since $\E[y_t\mid x_t]=\langle\theta^*,x_t\rangle$, this equals the expected regret. Regret is analyzed in several regimes:

Worst-case (minimax) regret: Focuses on adversarial context.
Instance-dependent or simple regret: Relevant in RLHF and dueling settings (Scheid et al., 2024), where the best-per-context arm is sought.
Contextual “bandits with knapsacks” regret adds constraints on cumulative resource consumption (Agrawal et al., 2015).

UCB-style algorithms exploit confidence sets for $\theta^*$ , often ellipsoidal, constructed from observed data and regularized Gram matrices: $G_t = \sum_{s<t} x_s x_s^\top,\qquad V_t = G_t + H_t,$ where the regularizer $H_t \succeq 0$ controls invertibility and eigenvalue spread, with the guarantees

$\rho_{\min}I_d \preceq H_t \preceq \rho_{\max}I_d,\qquad V_t\succeq G_t + \rho_{\min}I_d.$

Such structures enable tight regret bounds scaling as $\tilde O(d \sqrt n)$ , depending on the feature/parameter dimension and horizon (Shariff et al., 2018).

3. Algorithmic Realizations and Analytical Tools

3.1 Linear-UCB and Variants

The canonical algorithm maintains an estimator

$\hat\theta_t = V_t^{-1} \sum_{s<t} x_s y_s,$

and selects actions optimistically from the confidence ellipsoid

$E_t = \{\theta:\|\theta-\hat\theta_t\|_{V_t} \leq \beta_t\}.$

Optimistic action choice is

$x_t = \arg\max_{x\in D_t}\max_{\theta\in E_t}\langle\theta, x\rangle,$

computable as $x$ maximizing $\langle\hat\theta_t, x\rangle + \beta_t \|x\|_{V_t^{-1}}$ .

3.2 Regret Control via Elliptical Potentials

Analysis depends on bounding the elliptical potential: $\sum_{t=1}^n \min\{1, \|x_t\|_{V_t^{-1}}^2\}\leq \tilde O\left(d\log\left(1+\frac{n L^2}{\rho_{\min}}\right)\right).$ This bound is fundamental in proving $\tilde O(d\sqrt n)$ regret in the base model and generalizes to contextual and hybrid settings (Das et al., 2024).

3.3 Adaptations for Constraints and Privacy

For bandits with knapsacks, the reward model is augmented with context-linear resource consumption; resource constraints are enforced via primal-dual algorithms and adjusted reward maximization (Agrawal et al., 2015).
Differential privacy is addressed via joint or tree-based privacy mechanisms, with regularizer noise injected so the confidence ellipsoid analysis still holds (Shariff et al., 2018).

4. Model Variations and Generalizations

4.1 Hybrid and Disjoint Linear Models

Hybrid reward models allow per-arm parameter vectors (disjoint), shared global parameters, or a combination. This unifies the “shared” (only global parameter) and “disjoint” (one linear model per arm) cases.
Action embedding is via sparse expansion, and LinUCB/DisLinUCB are special cases. Algorithms such as HyLinUCB adapt exploration by exploiting sparsity in the parameter vector dimension (Das et al., 2024).

4.2 Long-Horizon Dependencies

Models where $y_t$ depends on a linear filter over $s\ll h$ past context–action pairs, yielding high-dimensional but sparse reward models. Block-sparse recovery and restricted isometry properties for block-circulant designs enable horizon-independent regret (Qin et al., 2023).

4.3 Contextual MDPs and RL

When transitions and rewards in Markov decision processes are context-dependent and linear in context features or weights, generalizations include models with context-varying weights and features (Deng et al., 2024).

4.4 Interference and Multi-unit Contexts

In interference-aware linear contextual bandits, each unit's reward may be affected by the actions of others, requiring matrix-valued weights to quantify spillover. Linear model structure persists, now embedded in larger covariate spaces tracking joint actions (Xu et al., 2024).

4.5 Representation Learning

The embedding $\phi$ may be unknown and learned online from a family $\Phi=\{\phi_j\}$ ; regret then depends on both parameter estimation and identification/discrimination between candidate representations. The asymptotic cost is characterized by an explicit optimization, and in some regimes, “representation learning is for free,” while in others, it is provably as hard as tabular learning (Tirinzoni et al., 2022).

5. Connections to Classical Bandit Frameworks

The contextual linear reward model nests several classic paradigms: | Setting | Feature set $D_t$ | Model Parameters | |----------------------------|-----------------------------|------------------| | $k$ -armed MAB | $\{e_1,\ldots,e_k\}$ | $\theta^*\in\R^k$ | | Contextual linear bandit | arbitrary $D_t\subset\R^d$ | $\theta^*\in\R^d$ | | Disjoint linear bandit | block-diagonal $\phi$ | $\{\theta^*_a\}$ | | Bandits with Knapsacks | reward plus linear resources| $\theta^*, W^*$ |

When $D_t$ is the set of canonical basis vectors, contextual bandits reduce to $k$ -armed bandits with fixed means $\mu_i = \theta^*_i$ .

6. Extensions: Dueling, Hybrid, and Nonlinear Links

Dueling bandits / RLHF: The contextual linear model underpins ranking tasks, e.g., in RLHF pairwise comparisons, with preferences modeled by $\P[a_1\succ a_2] = \sigma(\langle\theta^*, a_1-a_2\rangle)$ and design-based approaches for optimal labeling batches (Scheid et al., 2024).
Action-centered models: Linear treatment effects are estimated in the presence of unknown and possibly adversarial baseline rewards, by randomization and centering, with regret bounds depending only on the treatment effect dimension (Greenewald et al., 2017).
Unknown link / Single index bandits: The linear contextual reward model is generalized via an unknown (possibly monotone) link function $f$ , i.e., $y_t = f(x_t^\top \theta^*) + \eta_t$ , yielding single-index models with specialized parameter estimation techniques (Stein’s estimator), achieving sublinear regret in both monotonic and general link settings (Kang et al., 15 Jun 2025).

7. Theoretical Guarantees and Limiting Behavior

Under mild boundedness and subgaussianity, contextual linear reward models admit

$\tilde O(d\sqrt n)$ regret for LinUCB and related algorithms,
instance-dependent (logarithmic) regret lower and upper bounds when embedding or representation are to be learned (Tirinzoni et al., 2022),
horizon-independent regret rates in long-memory models with $s$ -sparse temporal filters (Qin et al., 2023),
minimax-optimal simple-regret in offline dueling RLHF designs (Scheid et al., 2024).

These guarantees and the associated analytical techniques—self-normalized martingale inequalities, elliptical potential arguments, primal-dual optimization, optimal experimental design, and block-sparse recovery—establish the contextual linear reward model as a powerful and adaptable theoretical backbone across online learning, sequential decision making, and algorithmic causal inference.