Randomized Q-Learning: Scalable and Efficient

Updated 21 January 2026

Randomized Q-Learning is a model-free RL approach that randomizes learning rates and maximization steps to drive efficient exploration in complex environments.
It unifies ensemble methods, stochastic subset selection, and Bayesian-inspired techniques to achieve provable regret bounds and tractable computations.
Empirical evaluations show RandQL converges rapidly in high-dimensional action spaces while reducing computational burden compared to classical Q-Learning.

Randomized Q-Learning (RandQL), also referred to as RandomizedQ or Stochastic Q-learning, denotes a class of model-free reinforcement learning (RL) algorithms characterized by the use of randomization as a principal mechanism for both exploration and computational efficiency. Unlike classical Q-learning, which relies on deterministic or scheduled learning rates and maximization across all actions, RandQL leverages randomized learning rates and/or stochastic maximization procedures. This approach enables efficient posterior sampling-based exploration, provable regret minimization, and tractable scaling to environments with large or structured state-action spaces (Wang et al., 30 Jun 2025, Tiapkin et al., 2023, Fourati et al., 2024). RandQL unifies several algorithmic strands, including “Thompson-style” exploration, stochastic subset selection for maximization, and learning-rate randomization, with robust theoretical and empirical properties.

1. Algorithmic Foundations and Variants

RandQL algorithms target episodic MDPs with state space $\mathcal{S}$ , action space $\mathcal{A}$ , finite horizon $H$ , and transition kernel $P_h(s'|s,a)$ . Standard Q-learning proceeds via

$Q_{h+1}(s, a) \leftarrow (1 - \alpha_n) Q_h(s, a) + \alpha_n [r(s, a) + \max_{a'} Q_{h+1}(s', a')],$

without explicit optimism or systematic posterior sampling. RandQL introduces randomization at different algorithmic loci, yielding several families:

Randomized Learning Rate Q-Learning: An ensemble of $J$ Q-functions is maintained. At each visit to $(s, a, h)$ , each head $j$ uses an independent random learning rate $w \sim \operatorname{Beta}(\cdot, \cdot)$ . A typical update is

$Q^j_h(s, a) \leftarrow (1 - w) Q^j_h(s, a) + w \Big[r_h(s, a) + V_{h+1}(s')\Big],$

with $w$ drawn according to parameters that depend on the number of prior visits $m$ and episode horizon $H$ (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).

Optimism via Aggregation: The policy $Q$ -value is set as the maximum across heads, or as an “optimistic mixture”:

$Q_h(s, a) \leftarrow (1 - 1/H) \max_{j \in [J]} Q_h^j(s, a) + (1/H) \max_{j \in [J]} \widetilde Q_h^j(s, a),$

wherein one ensemble is “fast-forgetting” and the other “slow-forgetting” to preserve optimism and anti-concentration for exploration purposes (Wang et al., 30 Jun 2025).

Stochastic Maximization in Large Action Spaces: For $|\mathcal{A}|=n \gg 1$ , RandQL may replace $\max_{a'} Q(s', a')$ with maximization over a small random action subset $\mathcal{C}_t \subset \mathcal{A}$ of size $s_t = \mathcal{O}(\log n)$ :

$Q_{t+1}(s_t, a_t) = (1-\alpha_t)Q_t(s_t,a_t) + \alpha_t[r_t + \gamma \max_{b \in \mathcal{C}_t \cup \mathcal{M}_t} Q_t(s_{t+1}, b)].$

Here, $\mathcal{M}_t$ is a small memory buffer of top-performing actions (Fourati et al., 2024).

These principles can be combined: e.g., randomizing both learning rates and action maximization within the same update.

2. Theoretical Properties and Regret Guarantees

RandQL algorithms possess provable regret bounds under various MDP settings:

Tabular Episodic MDPs: For state space $S=|\mathcal{S}|$ , action space $A=|\mathcal{A}|$ , episode horizon $H$ , and $T$ episodes, the best-known bound is

$\mathrm{Regret}_T = \sum_{t=1}^T [V^\star_1(s^t_1) - V^{\pi^t}_1(s^t_1)] \leq \widetilde O(\sqrt{H^5 S A T}),$

holding with high probability for parameters $J = O(\log (SAHT/\delta))$ , $\kappa = O(\log(SAHT/\delta) + \log T)$ , $n_0 = O(\kappa \log T)$ (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).

Gap-Dependent Regret: Under “positive sub-optimality gap”, i.e.,

$\Delta_{\min} = \min\{V_h^\star(s) - Q_h^\star(s, a) \neq 0\} > 0,$

the expected regret satisfies $\mathbb{E}[\mathrm{Regret}_T] \leq O\left(\frac{H^6 S A \log^5(SAHT)}{\Delta_{\min}}\right)$ (Wang et al., 30 Jun 2025).

Metric/Continuous State-Action Spaces: Under Lipschitz and zooming-dimension assumptions, regret scales as

$\widetilde O \left( H^{5/2} T^{(d_z + 1)/(d_z + 2)} \right)$

where $d_z$ is the zooming dimension (Tiapkin et al., 2023).

Stochastic Maximization Convergence: For large $n$ , RandQL with random subset size $s_t = O(\log n)$ converges to the fixed point of the “Rand-Bellman” operator,

$(\Phi Q)(s, a) = \mathbb{E}_{r, s', \mathcal{C} \sim \mathbb{P}} \Big[r + \gamma \max_{b \in \mathcal{C}} Q(s', b) \,\big|\, s, a \Big],$

and under standard Robbins–Monro conditions and persistent exploration, $Q_t \to Q^*$ almost surely (Fourati et al., 2024).

3. Algorithmic Implementation Details

The computational pipeline of modern RandQL is defined by the following crucial elements:

Ensemble Q-function Architecture: Each $(h, s, a)$ is associated with $J$ pairs of “fast-forgetting” and “slow-forgetting” Q-estimates. Their learning rates, $w_m$ , are independently drawn from Beta distributions with shape parameters reflecting visit counts, pseudo-counts ( $n_0$ ), and inflation ( $\kappa$ ) (Wang et al., 30 Jun 2025).
Optimistic Mixture and Policy Derivation: Policy values are computed as a convex or max-mixed combination of ensemble heads—precisely outlined in the LaTeX pseudocode (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).
Stochastic Subset Maximization for Large Action Sets:
- At each update, only $s_t$ actions are sampled uniformly from $\mathcal{A}$ for maximization, optionally augmented by a memory buffer $\mathcal{M}_t$ of the most recently selected or highest-value actions (Fourati et al., 2024).
- This reduces per-update complexity from $O(n)$ to $O(\log n)$ , with practical subset sizes $s_t = \lceil \log n \rceil$ .
Parameter Selection: Standard choices are $J=O(\log T)$ , $\kappa \approx \log T$ , $n_0 = O(\kappa)$ , and learning rates as per the Beta distribution; subset size $s_t$ is selected to balance computational cost and underestimation bias.

RandQL is distinguished from prior approaches along both algorithmic and theoretical lines:

Method	Exploration	Update Complexity	Regret Bound
UCB-Q / OptQL	Bonus-based	$O(HSA)$	$\widetilde O(\sqrt{H^5 S A T})$
PSRL	Posterior-sample	$O(S^2A H)$	$\widetilde O(\sqrt{H^3 S A T})$
RandQL (ensemble, Beta noise)	Rand. weights	$O(HSA)$	$\widetilde O(\sqrt{H^5 S A T})$
RandQL (stoch. maximization)	Subset sampling	$O(H S \log n)$	Converges to Rand-Bellman fixed pt.

RandQL offers the sample efficiency of PSRL and OptQL while avoiding the computational bottleneck of explicit posterior inference or bonus computation. Empirical comparisons on grid-world, chain, and synthetic high-dimensional MDPs illustrate that RandQL achieves either lower or comparable regret to bonus-based and model-based methods, at significantly reduced sample or wall-clock cost (Wang et al., 30 Jun 2025, Tiapkin et al., 2023, Fourati et al., 2024).

5. Empirical Evaluation and Practical Guidelines

RandQL has been systematically evaluated in both tabular and deep RL settings:

Tabular Grid-world and Chain Benchmarks: RandQL demonstrates lower total regret versus UCB-Q and naïve/randomized-rate variants, and approaches the sample-efficiency of model-based PSRL and RLSVI approaches (Wang et al., 30 Jun 2025, Tiapkin et al., 2023).
High-Dimensional Action Spaces: In synthetic MDPs with $n=256$ actions, RandQL matches optimal performance in roughly $1/10$th of the time required by standard Q-learning (Fourati et al., 2024).
Deep RL (e.g., InvertedPendulum-v4, HalfCheetah-v4): RandQL-based variants (RandDQN/RandDDQN) converge more rapidly and with 10-60 $\times$ per-step speedup over standard DQN/Double DQN in discretized large-action regimes. Under the standard metric of average return vs.\ wall-clock time, RandQL outperforms DQN and approaches model-based methods.

Practical guidelines for application:

Subset Size: $s = O(\ln n)$ balances computation and approximation accuracy; $s \approx \sqrt{n}$ reduces underestimation bias.
Ensemble Size: $J = O(\log T)$ , typically 10–20.
Memory Usage: Use small per-state buffers in tabular; buffer actions in deep RL.
Randomization Distribution: Uniform over actions suffices for most settings; structure-aware sampling is possible when action-space features are available.
Exploration Schedule: Standard decaying $\epsilon$ -greedy policies ensure persistent, unbiased exploration if $\epsilon_t$ is strictly positive.

6. Limitations and Open Research Directions

RandQL’s main limitations and avenues for future work include:

Underestimation Bias in Stochastic Maximization: Subset-based maximization introduces a lower bound bias to the true $\max$ . While mitigated via memory buffers, in highly peaked Q landscapes this effect can slow convergence (Fourati et al., 2024).
Finite-Sample Regret with Stochastic Maximization: Explicit sample-complexity and regret guarantees for the stochastic maximization variant are not fully characterized.
Function Approximation: Almost sure convergence for the tabular case is established, but rigorous extension to nonlinear function approximation (deep networks) remains an open problem.
Adaptive Subset Sizing: Dynamic adjustment of $s_t$ based on value uncertainty may yield improved trade-offs between approximation and cost.
Continuous and Structured Action Spaces: Adapting the random subset paradigm to continuous actions via stochastic gradient maximization or to combinatorial/embedded action sets is an active area of exploration.

7. Summary and Significance

Randomized Q-Learning provides a unified framework for model-free RL agents that achieve efficient exploration and sample efficiency via randomized learning rates and stochastic maximization. This encompasses theoretical guarantees—provably near-optimal regret in tabular and metric state-action settings—and demonstrably practical gains in environments with large and/or continuous action spaces. The approach matches or outperforms classical optimism-bonus and posterior-sampling methods in both theoretical and empirical dimensions, while maintaining space and time complexity at $O(HSA)$ per step for tabular MDPs, and $O(\log n)$ per step in large action spaces (Wang et al., 30 Jun 2025, Tiapkin et al., 2023, Fourati et al., 2024). RandQL thus offers a robust algorithmic recipe for tractable and principled exploration in contemporary RL.

Markdown Upgrade to Chat

References (3)

Provably Efficient and Agile Randomized Q-Learning (2025)

Model-free Posterior Sampling via Learning Rate Randomization (2023)

Stochastic Q-learning for Large Discrete Action Spaces (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Q-Learning (RandQL).