Generalized Functional UCB (GFUCB)

Updated 10 December 2025

GFUCB is a framework for multitask representation learning that generalizes optimism-based exploration to nonlinear, nonparametric function classes in decision processes.
It employs adaptive confidence sets quantified by covering and eluder dimensions to achieve significant sample efficiency gains over isolated single-task approaches.
The framework underpins practical algorithms for both contextual bandits and episodic MDPs, offering provable finite-sample regret guarantees and supporting transfer learning.

Generalized Functional UCB (GFUCB) is a theoretical and algorithmic framework for multitask representation learning in stochastic decision processes with general (potentially nonlinear, nonparametric) function classes. It provides the first finite-sample regret guarantees that rigorously extend beyond linear settings to multitask contextual bandits and episodic Markov Decision Processes (MDPs) with shared but unknown representation functions, such as neural networks. GFUCB achieves provably improved sample efficiency over independent single-task approaches by exploiting shared structure via adaptive confidence sets in function space, quantified by covering and eluder dimensions. The framework underpins recent advances in multitask and transfer learning, offering both practical algorithms and a unified analysis for online and transfer scenarios in RL and bandits (Lu et al., 2022, Lu et al., 1 Mar 2025).

1. Multitask Decision Process Model and Representation Structure

The GFUCB framework considers M multitask contextual bandits (and, by extension, finite-horizon MDPs) where each task i receives sequential context–action pairs and noisy rewards. Each unknown task-specific reward (or value) function $f^{(i)}$ is modeled as linear in a (typically unknown and nonlinear) shared representation $\phi \in \Phi$ :

$f^{(i)}(x) = \langle \phi(x), \theta_i \rangle, \quad \Vert \phi(x) \Vert_2 \leq 1,\, \Vert \theta_i \Vert_2 \leq \sqrt{k}.$

Here, $\Phi$ is a non-convex function class that may contain deep networks or kernel-based mappings; the induced class $F = L \circ \Phi$ is assumed to have finite covering number $N(\Phi, \alpha)$ and bounded eluder dimension $d_E(F, \epsilon)$ for all $\epsilon > 0$ .

The agent plays all M tasks jointly, observing context sets and choosing actions for each, with the goal of minimizing total regret:

$\mathrm{Reg}(T) = \sum_{t=1}^{T} \sum_{i=1}^M \left(f^{(i)}(C_{t,i}, A^*_{t,i}) - f^{(i)}(C_{t,i}, A_{t,i})\right),$

where $A^*_{t,i}$ is the optimal action on task i at round t.

This multitask structure is also instantiated episodically in MDPs, where for each task MDP, a shared $\phi$ gives optimal Q-functions for each horizon step as $Q_h^{(i)}(s,a) = \langle \phi(s,a), \theta_h^{(i)} \rangle$ for task-specific heads $\theta_h^{(i)}$ .

2. Generalized Functional UCB Principle

GFUCB generalizes optimism-based exploration (UCB) to the setting where the function class is nonparametric or highly expressive. The essential steps are:

Empirical Risk Minimization: Pool all M-task experience $(x_{s,i}, R_{s,i})$ up to round t and compute the empirical risk minimizer over function class $\mathcal{F}^{\otimes M}$ :

$\hat{f}_t = \arg\min_f \sum_{i=1}^M \sum_{s=1}^{t-1} (f^{(i)}(x_{s,i}) - R_{s,i})^2,$

subject to $f^{(i)}(x) = \langle \phi(x), w_i \rangle$ , $\phi \in \Phi$ , $\Vert w_i \Vert_2 \leq \sqrt{k}$ .

Functional Confidence Set Construction: Define a confidence set

$F_t = \left\{ f\in \mathcal{F}^{\otimes M}: \Vert f - \hat{f}_t \Vert_{2,E_t}^2 \leq \beta_t, \; |f^{(i)}(x)| \leq 1 \right\},$

where the confidence radius $\beta_t$ depends on covering numbers and task count:

$\beta_t \asymp M k + \log N(\Phi, \alpha, \Vert \cdot \Vert_\infty),$

for discretization parameter $\alpha$ .

Optimistic Joint Action Selection: For each round, choose for all tasks jointly:

$(f_t, \{A_{t,i}\}_{i=1}^M) = \arg\max_{f\in F_t,\, A_{i} \in \mathcal{A}_{t,i}} \sum_{i=1}^M f^{(i)}(C_{t,i}, A_i),$

naturally encouraging exploration of actions with large functional uncertainty.

This functional–space extension circumvents the limitations of ellipsoidal confidence sets that are only feasible in linear or RKHS settings and can be adapted for layered or deep representations, provided the function class has controlled covering and eluder dimension (Lu et al., 2022, Lu et al., 1 Mar 2025).

3. Regret Analysis and Sample Efficiency

The central theoretical result is that GFUCB admits a regret bound that scales as:

$\mathrm{Reg}(T)\;=\;\tilde{O} \left(\sqrt{M\,d\,T\,[M k + \log N(\Phi, \alpha)]}\right),$

where $d = d_E(F, \epsilon)$ is the eluder dimension of the induced value class $F$ , and $N(\Phi, \alpha)$ is the $\alpha$ -covering number of $\Phi$ . In contrast, independent treatment of each bandit (or MDP) would generate regret scaling with $M$ outside the square root—i.e., $\tilde{O}(M \sqrt{d T \log N(\Phi, \alpha)})$ —which is suboptimal when M is large.

In the special case of linear function classes, this recovers prior bounds:

$\mathrm{Reg}(T) = \tilde{O}(M \sqrt{dT k} + d\sqrt{M T k}),$

but for arbitrary nonlinear function classes, it newly quantifies the joint benefit of shared representation learning in online, general-function, multitask settings.

In the multitask MDP setting, the corresponding result is:

$\mathrm{Reg}(T) = \tilde{O}\left(M H \sqrt{T d k} + H \sqrt{M T d \log N(\Phi, \alpha)} + M H T\, I \sqrt{d}\right),$

where $I$ denotes the inherent Bellman error due to model class mismatch.

This improvement is fundamentally enabled by joint confidence set shrinkage—learning the shared $\phi$ across all tasks—thereby concentrating uncertainty reduction in the feature space at a rate accelerated by $M$ , in contrast to $M$ -fold redundant exploration in single-task methods (Lu et al., 2022, Lu et al., 1 Mar 2025).

4. Algorithmic Instantiation and Extension

The practical implementation of GFUCB encompasses both contextual bandits and episodic RL algorithms (e.g., LSVI-UCB for MDPs). The key steps are:

Maintain rolling experience replay across all M tasks.
Solve joint empirical risk minimization (least-squares) to update the shared representation $\phi$ and task-heads $w_i$ at each round.
Construct the confidence set $F_t$ via empirical L2-norm in function space.
Optimize for the most optimistic (highest-reward) combination of actions and functions permitted by $F_t$ for all tasks in parallel.
In deep or kernelized instantiations, solve optimization subproblems via SGD or Adam, and use Lagrangian penalties to approximately enforce confidence-ball constraints.

This flexible approach is generic for any function class where covering number and eluder dimension can be bounded. The same principle is extensible to transfer learning: after pretraining on M tasks, a learned $\hat{\phi}$ can be fixed and a linear (or small) head $w_{\mathrm{target}}$ rapidly learned for an unseen target task with regret scaling only as $O(\sqrt{T})$ rather than $O(T)$ , enabled by the pre-established accuracy of $\hat{\phi}$ (Lu et al., 1 Mar 2025).

5. Empirical Evaluation in Neural and Nonlinear Regimes

Theoretical findings are supported by experiments (Section 6 of (Lu et al., 2022, Lu et al., 1 Mar 2025)):

GFUCB implemented with neural network representations significantly outpaces $\varepsilon$ -greedy or parallel single-task methods in M-task contextual bandits.
Regret curves empirically demonstrate that adding more tasks (joint training) sharpens the confidence shrinkage and reduces overall regret.
UCB "bonus" terms decrease as a function of the pooled data, aligning with theoretical confidence set tightness.
Robustness is observed even for “moderate” deviations from strict linear-in- $\phi$ structure, supporting the generality of the functional analysis.

This underscores the practicality of the approach in nonparametric scenarios, especially for multitask action-selection and transfer in deep RL.

6. Connections to Prior Theory and Key Novelty

GFUCB bridges and extends several distinct themes in multitask and representation learning:

It generalizes bandit and RL confidence-set construction and regret decomposition to arbitrary function classes.
Covering numbers and eluder dimensions are rigorously used to control uncertainty and inform exploration in highly expressive settings.
It enables provable multitask gains (regret improvement) that interpolate between purely linear and fully nonparametric cases—retaining statistical advantages as models become wider or deeper, as long as capacity measures are finite.
Unlike prior multitask algorithms that require convexity, known linear features, or tractable inverse-covariance computations, GFUCB's function-space approach overcomes these practical and theoretical barriers (Lu et al., 2022, Lu et al., 1 Mar 2025).

7. Limitations, Transfer Conditions, and Future Directions

The effectiveness of GFUCB depends on several key conditions:

Finite covering number and eluder dimension for $\Phi$ (restricting to “sufficiently regular” deep nets or kernel classes).
Effective update oracles for non-convex least-squares and optimistic max-action steps (this is computationally nontrivial for arbitrarily complex networks).
Inherent Bellman error must be suitably small for near-linear representability.
Transferability is strongest when the target task either belongs to or is close to the spanned task set ( $\phi$ generalizes).

Open avenues include tightening computational efficiency, extending to scenarios with partial sharing of $\phi$ , further characterizing transfer beyond the multitask regime, and integrating with self-supervised or unsupervised representation learning frameworks to handle extremely large or continuously varying function classes.

References

"Provable General Function Class Representation Learning in Multitask Bandits and MDPs" (Lu et al., 2022)
"Towards Understanding the Benefit of Multitask Representation Learning in Decision Process" (Lu et al., 1 Mar 2025)