Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomized Q-Learning: Scalable and Efficient

Updated 21 January 2026
  • Randomized Q-Learning is a model-free RL approach that randomizes learning rates and maximization steps to drive efficient exploration in complex environments.
  • It unifies ensemble methods, stochastic subset selection, and Bayesian-inspired techniques to achieve provable regret bounds and tractable computations.
  • Empirical evaluations show RandQL converges rapidly in high-dimensional action spaces while reducing computational burden compared to classical Q-Learning.

Randomized Q-Learning (RandQL), also referred to as RandomizedQ or Stochastic Q-learning, denotes a class of model-free reinforcement learning (RL) algorithms characterized by the use of randomization as a principal mechanism for both exploration and computational efficiency. Unlike classical Q-learning, which relies on deterministic or scheduled learning rates and maximization across all actions, RandQL leverages randomized learning rates and/or stochastic maximization procedures. This approach enables efficient posterior sampling-based exploration, provable regret minimization, and tractable scaling to environments with large or structured state-action spaces (Wang et al., 30 Jun 2025, Tiapkin et al., 2023, Fourati et al., 2024). RandQL unifies several algorithmic strands, including “Thompson-style” exploration, stochastic subset selection for maximization, and learning-rate randomization, with robust theoretical and empirical properties.

1. Algorithmic Foundations and Variants

RandQL algorithms target episodic MDPs with state space S\mathcal{S}, action space A\mathcal{A}, finite horizon HH, and transition kernel Ph(ss,a)P_h(s'|s,a). Standard Q-learning proceeds via

Qh+1(s,a)(1αn)Qh(s,a)+αn[r(s,a)+maxaQh+1(s,a)],Q_{h+1}(s, a) \leftarrow (1 - \alpha_n) Q_h(s, a) + \alpha_n [r(s, a) + \max_{a'} Q_{h+1}(s', a')],

without explicit optimism or systematic posterior sampling. RandQL introduces randomization at different algorithmic loci, yielding several families:

  • Randomized Learning Rate Q-Learning: An ensemble of JJ Q-functions is maintained. At each visit to (s,a,h)(s, a, h), each head jj uses an independent random learning rate wBeta(,)w \sim \operatorname{Beta}(\cdot, \cdot). A typical update is

Qhj(s,a)(1w)Qhj(s,a)+w[rh(s,a)+Vh+1(s)],Q^j_h(s, a) \leftarrow (1 - w) Q^j_h(s, a) + w \Big[r_h(s, a) + V_{h+1}(s')\Big],

with ww drawn according to parameters that depend on the number of prior visits mm and episode horizon HH (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).

  • Optimism via Aggregation: The policy QQ-value is set as the maximum across heads, or as an “optimistic mixture”:

Qh(s,a)(11/H)maxj[J]Qhj(s,a)+(1/H)maxj[J]Q~hj(s,a),Q_h(s, a) \leftarrow (1 - 1/H) \max_{j \in [J]} Q_h^j(s, a) + (1/H) \max_{j \in [J]} \widetilde Q_h^j(s, a),

wherein one ensemble is “fast-forgetting” and the other “slow-forgetting” to preserve optimism and anti-concentration for exploration purposes (Wang et al., 30 Jun 2025).

  • Stochastic Maximization in Large Action Spaces: For A=n1|\mathcal{A}|=n \gg 1, RandQL may replace maxaQ(s,a)\max_{a'} Q(s', a') with maximization over a small random action subset CtA\mathcal{C}_t \subset \mathcal{A} of size st=O(logn)s_t = \mathcal{O}(\log n):

Qt+1(st,at)=(1αt)Qt(st,at)+αt[rt+γmaxbCtMtQt(st+1,b)].Q_{t+1}(s_t, a_t) = (1-\alpha_t)Q_t(s_t,a_t) + \alpha_t[r_t + \gamma \max_{b \in \mathcal{C}_t \cup \mathcal{M}_t} Q_t(s_{t+1}, b)].

Here, Mt\mathcal{M}_t is a small memory buffer of top-performing actions (Fourati et al., 2024).

These principles can be combined: e.g., randomizing both learning rates and action maximization within the same update.

2. Theoretical Properties and Regret Guarantees

RandQL algorithms possess provable regret bounds under various MDP settings:

  • Tabular Episodic MDPs: For state space S=SS=|\mathcal{S}|, action space A=AA=|\mathcal{A}|, episode horizon HH, and TT episodes, the best-known bound is

RegretT=t=1T[V1(s1t)V1πt(s1t)]O~(H5SAT),\mathrm{Regret}_T = \sum_{t=1}^T [V^\star_1(s^t_1) - V^{\pi^t}_1(s^t_1)] \leq \widetilde O(\sqrt{H^5 S A T}),

holding with high probability for parameters J=O(log(SAHT/δ))J = O(\log (SAHT/\delta)), κ=O(log(SAHT/δ)+logT)\kappa = O(\log(SAHT/\delta) + \log T), n0=O(κlogT)n_0 = O(\kappa \log T) (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).

  • Gap-Dependent Regret: Under “positive sub-optimality gap”, i.e.,

Δmin=min{Vh(s)Qh(s,a)0}>0,\Delta_{\min} = \min\{V_h^\star(s) - Q_h^\star(s, a) \neq 0\} > 0,

the expected regret satisfies E[RegretT]O(H6SAlog5(SAHT)Δmin)\mathbb{E}[\mathrm{Regret}_T] \leq O\left(\frac{H^6 S A \log^5(SAHT)}{\Delta_{\min}}\right) (Wang et al., 30 Jun 2025).

  • Metric/Continuous State-Action Spaces: Under Lipschitz and zooming-dimension assumptions, regret scales as

O~(H5/2T(dz+1)/(dz+2))\widetilde O \left( H^{5/2} T^{(d_z + 1)/(d_z + 2)} \right)

where dzd_z is the zooming dimension (Tiapkin et al., 2023).

  • Stochastic Maximization Convergence: For large nn, RandQL with random subset size st=O(logn)s_t = O(\log n) converges to the fixed point of the “Rand-Bellman” operator,

(ΦQ)(s,a)=Er,s,CP[r+γmaxbCQ(s,b)s,a],(\Phi Q)(s, a) = \mathbb{E}_{r, s', \mathcal{C} \sim \mathbb{P}} \Big[r + \gamma \max_{b \in \mathcal{C}} Q(s', b) \,\big|\, s, a \Big],

and under standard Robbins–Monro conditions and persistent exploration, QtQQ_t \to Q^* almost surely (Fourati et al., 2024).

3. Algorithmic Implementation Details

The computational pipeline of modern RandQL is defined by the following crucial elements:

  • Ensemble Q-function Architecture: Each (h,s,a)(h, s, a) is associated with JJ pairs of “fast-forgetting” and “slow-forgetting” Q-estimates. Their learning rates, wmw_m, are independently drawn from Beta distributions with shape parameters reflecting visit counts, pseudo-counts (n0n_0), and inflation (κ\kappa) (Wang et al., 30 Jun 2025).
  • Optimistic Mixture and Policy Derivation: Policy values are computed as a convex or max-mixed combination of ensemble heads—precisely outlined in the LaTeX pseudocode (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).
  • Stochastic Subset Maximization for Large Action Sets:
    • At each update, only sts_t actions are sampled uniformly from A\mathcal{A} for maximization, optionally augmented by a memory buffer Mt\mathcal{M}_t of the most recently selected or highest-value actions (Fourati et al., 2024).
    • This reduces per-update complexity from O(n)O(n) to O(logn)O(\log n), with practical subset sizes st=logns_t = \lceil \log n \rceil.
  • Parameter Selection: Standard choices are J=O(logT)J=O(\log T), κlogT\kappa \approx \log T, n0=O(κ)n_0 = O(\kappa), and learning rates as per the Beta distribution; subset size sts_t is selected to balance computational cost and underestimation bias.

RandQL is distinguished from prior approaches along both algorithmic and theoretical lines:

Method Exploration Update Complexity Regret Bound
UCB-Q / OptQL Bonus-based O(HSA)O(HSA) O~(H5SAT)\widetilde O(\sqrt{H^5 S A T})
PSRL Posterior-sample O(S2AH)O(S^2A H) O~(H3SAT)\widetilde O(\sqrt{H^3 S A T})
RandQL (ensemble, Beta noise) Rand. weights O(HSA)O(HSA) O~(H5SAT)\widetilde O(\sqrt{H^5 S A T})
RandQL (stoch. maximization) Subset sampling O(HSlogn)O(H S \log n) Converges to Rand-Bellman fixed pt.

RandQL offers the sample efficiency of PSRL and OptQL while avoiding the computational bottleneck of explicit posterior inference or bonus computation. Empirical comparisons on grid-world, chain, and synthetic high-dimensional MDPs illustrate that RandQL achieves either lower or comparable regret to bonus-based and model-based methods, at significantly reduced sample or wall-clock cost (Wang et al., 30 Jun 2025, Tiapkin et al., 2023, Fourati et al., 2024).

5. Empirical Evaluation and Practical Guidelines

RandQL has been systematically evaluated in both tabular and deep RL settings:

  • Tabular Grid-world and Chain Benchmarks: RandQL demonstrates lower total regret versus UCB-Q and naïve/randomized-rate variants, and approaches the sample-efficiency of model-based PSRL and RLSVI approaches (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).
  • High-Dimensional Action Spaces: In synthetic MDPs with n=256n=256 actions, RandQL matches optimal performance in roughly $1/10$th of the time required by standard Q-learning (Fourati et al., 2024).
  • Deep RL (e.g., InvertedPendulum-v4, HalfCheetah-v4): RandQL-based variants (RandDQN/RandDDQN) converge more rapidly and with 10-60×\times per-step speedup over standard DQN/Double DQN in discretized large-action regimes. Under the standard metric of average return vs.\ wall-clock time, RandQL outperforms DQN and approaches model-based methods.

Practical guidelines for application:

  • Subset Size: s=O(lnn)s = O(\ln n) balances computation and approximation accuracy; sns \approx \sqrt{n} reduces underestimation bias.
  • Ensemble Size: J=O(logT)J = O(\log T), typically 10–20.
  • Memory Usage: Use small per-state buffers in tabular; buffer actions in deep RL.
  • Randomization Distribution: Uniform over actions suffices for most settings; structure-aware sampling is possible when action-space features are available.
  • Exploration Schedule: Standard decaying ϵ\epsilon-greedy policies ensure persistent, unbiased exploration if ϵt\epsilon_t is strictly positive.

6. Limitations and Open Research Directions

RandQL’s main limitations and avenues for future work include:

  • Underestimation Bias in Stochastic Maximization: Subset-based maximization introduces a lower bound bias to the true max\max. While mitigated via memory buffers, in highly peaked Q landscapes this effect can slow convergence (Fourati et al., 2024).
  • Finite-Sample Regret with Stochastic Maximization: Explicit sample-complexity and regret guarantees for the stochastic maximization variant are not fully characterized.
  • Function Approximation: Almost sure convergence for the tabular case is established, but rigorous extension to nonlinear function approximation (deep networks) remains an open problem.
  • Adaptive Subset Sizing: Dynamic adjustment of sts_t based on value uncertainty may yield improved trade-offs between approximation and cost.
  • Continuous and Structured Action Spaces: Adapting the random subset paradigm to continuous actions via stochastic gradient maximization or to combinatorial/embedded action sets is an active area of exploration.

7. Summary and Significance

Randomized Q-Learning provides a unified framework for model-free RL agents that achieve efficient exploration and sample efficiency via randomized learning rates and stochastic maximization. This encompasses theoretical guarantees—provably near-optimal regret in tabular and metric state-action settings—and demonstrably practical gains in environments with large and/or continuous action spaces. The approach matches or outperforms classical optimism-bonus and posterior-sampling methods in both theoretical and empirical dimensions, while maintaining space and time complexity at O(HSA)O(HSA) per step for tabular MDPs, and O(logn)O(\log n) per step in large action spaces (Wang et al., 30 Jun 2025, Tiapkin et al., 2023, Fourati et al., 2024). RandQL thus offers a robust algorithmic recipe for tractable and principled exploration in contemporary RL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Q-Learning (RandQL).