Randomized Ensembled Double Q-Learning (REDQ)

Updated 20 April 2026

REDQ is a model-free, off-policy deep reinforcement learning algorithm that integrates a high update-to-data ratio, an ensemble of Q-function critics, and randomized target minimization to control bias and variance.
The algorithm minimizes overestimation bias by randomly sampling a subset of critics for target evaluation, thereby stabilizing the learning process even at aggressive update rates.
Empirical results on continuous control benchmarks like MuJoCo demonstrate that REDQ achieves comparable or superior sample efficiency compared to state-of-the-art model-based methods.

Randomized Ensembled Double Q-Learning (REDQ) is a model-free, off-policy deep reinforcement learning (DRL) algorithm that achieves state-of-the-art sample efficiency in continuous control tasks by integrating three key techniques: a high update-to-data (UTD) ratio, an ensemble of Q-function critics, and randomized in-target minimization over critic subsets. REDQ enables aggressive reuse of off-policy data without instabilities arising from overestimation bias, and its practical implementations have closed the performance gap between model-free and state-of-the-art model-based approaches across standard benchmarks.

1. Foundations and Algorithmic Structure

REDQ extends the maximum-entropy off-policy actor–critic framework by maintaining an ensemble of $N$ Q-function critics $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ and performs aggressive critic updates relative to the rate of environment interaction. For each environment step, REDQ typically performs $G \gg 1$ critic updates ("high UTD ratio"). Each critic update samples a minibatch from the replay buffer, then—distinctively—draws a random subset $\mathcal{M} \subset \{1, \dots, N\}$ of size $M \ll N$ for target calculation. The Bellman target for the next state is

$y = r + \gamma \left( \min_{i\in\mathcal{M}} Q_{\bar\phi_i}(s', a') - \alpha \log \pi_\theta(a'|s') \right), \quad a' \sim \pi_\theta(\cdot|s')$

where $\alpha$ is the entropy coefficient and $\pi_\theta$ is the stochastic actor. Each critic is updated with the mean-squared Bellman error. The parameters of the target networks are updated via Polyak averaging. The policy gradient step uses the average Q-value across the ensemble, further regularized by the entropy term:

$J(\theta) = \mathbb{E}_{s\sim\mathcal{D}, a\sim\pi_\theta} \left[ \frac{1}{N} \sum_{i=1}^N Q_{\phi_i}(s, a) - \alpha \log \pi_\theta(a|s) \right]$

The mechanisms for minimizing over a small, random subset, together with a sufficiently large ensemble and high UTD ratio, are central to controlling bias and variance and enabling robust learning at unprecedented data efficiency (Chen et al., 2021, Wu et al., 2021).

2. Bias and Variance Control via Randomized Ensembles

The motivation for REDQ arises from the limitations of standard Double Q-learning and Soft Actor-Critic (SAC), which use only two critics and become unstable at high UTD ratios due to error accumulation and overestimation bias. By maintaining a larger ensemble ( $N \approx 10$ is typical) and computing targets as the minimum over a random subset of size $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 0 (frequently $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 1), REDQ enforces a near-constant negative bias on Bellman backups, directly controllable through $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 2. Empirical and tabular theoretical analyses demonstrate:

The bias induced by the min-over-random-subset estimator is a decreasing function of $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 3, independent of $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 4 and $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 5.
Variance, by contrast, can be decoupled from bias: increasing $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 6 reduces variance while $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 7 adjusts bias (Chen et al., 2021).
High UTD ratios, when combined with these randomization strategies, do not result in diverging bias or value blow-up as in single- or dual-critic settings.

These properties allow stable exploitation of replay buffers at high update frequencies, yielding high sample efficiency even in non-linear function approximation regimes.

3. Empirical Performance and Benchmarks

Empirical evaluations on MuJoCo suite (Hopper, Walker2d, Ant, Humanoid) and DeepMind Control Suite show that REDQ achieves comparable or superior sample efficiency to model-based algorithms such as MBPO. REDQ reaches benchmark performance in $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 8-- $\{Q_{\phi_i}(s,a)\}_{i=1}^N$ 9 fewer environment steps compared to SAC and matches or exceeds MBPO's wall-clock efficiency despite using a smaller parameter budget.

A summary of sample efficiency on MuJoCo (environment steps to reach fixed return) (Wu et al., 2021):

Environment	SAC	REDQ	REDQ/SAC Speedup
Hopper@3500	933K	116K	$G \gg 1$ 0
Walker2d@3500	440K	141K	$G \gg 1$ 1
Ant@5000	771K	153K	$G \gg 1$ 2
Humanoid@5000	945K	255K	$G \gg 1$ 3

Ablation studies demonstrate diminishing returns past $G \gg 1$ 4 critics, and performance is robust for $G \gg 1$ 5. REDQ's high sample efficiency also extends to sparse-reward goal-conditioned tasks and real-world robotics, provided appropriate component adaptations (Hiraoka, 2023).

4. Key Variants and Extensions

4.1. Dropout Q-Functions (DroQ)

DroQ is a computationally efficient REDQ variant that substitutes a large ensemble of Q-networks with a small ensemble of dropout Q-functions, each equipped with dropout layers (typical dropout rate $G \gg 1$ 6) and layer normalization. This architectural choice enables efficient uncertainty propagation, with only $G \gg 1$ 7 dropout critics needed per update, resulting in $G \gg 1$ 8 faster runtime and $G \gg 1$ 9 memory requirements compared to classic REDQ. Importantly, DroQ preserves the low-bias, high-sample-efficiency properties of REDQ while matching the computational footprint of SAC (Hiraoka et al., 2021).

4.2. Sparse-Reward and Goal-Conditioned Adaptations

REDQ has been successfully adapted to sparse-reward, goal-conditioned RL by integrating hindsight experience replay (HER) to generate informative transitions and bounding target Q-values to prevent value explosion. Clamping targets to theoretical support intervals and, optionally, replacing min over subset with average further stabilizes training, resulting in up to $\mathcal{M} \subset \{1, \dots, N\}$ 0 improvement in sample efficiency over previous state-of-the-art methods on robotic fetch tasks (Hiraoka, 2023).

5. Theoretical Insights and Comparisons

In tabular settings, REDQ's random min-over-subset estimator is shown to have an expected bias independent of the number of critics $\mathcal{M} \subset \{1, \dots, N\}$ 1 or the update ratio $\mathcal{M} \subset \{1, \dots, N\}$ 2, uniquely controlling overestimation or underestimation via $\mathcal{M} \subset \{1, \dots, N\}$ 3. Unlike Maxmin Q-learning, which uses min over all critics (resulting in excessive underestimation for large $\mathcal{M} \subset \{1, \dots, N\}$ 4), REDQ's randomized subset maintains a uniform bias–variance profile even under high-frequency critic updates.

REDQ avoids the instability at high UTD ratios that afflict classic Q-learning and SAC by mitigating the "deadly triad" (off-policy data, function approximation, bootstrap targets). Empirical results confirm REDQ's robustness across random seeds and its state-of-the-art early-episode performance. However, the algorithm's asymptotic returns, particularly on the most challenging tasks, remain slightly below distributional critic methods such as Truncated Quantile Critics (TQC) and Aggressive Q-Learning with Ensembles (AQE), which employ orthogonal regularization mechanisms (Wu et al., 2021, Chen et al., 2021).

6. Practical Considerations and Implementation Details

Canonical hyperparameters for MuJoCo tasks and vehicle control benchmarks include: $\mathcal{M} \subset \{1, \dots, N\}$ 5 critics, $\mathcal{M} \subset \{1, \dots, N\}$ 6, $\mathcal{M} \subset \{1, \dots, N\}$ 7, two hidden layers of 256 units per critic/actor, learning rates $\mathcal{M} \subset \{1, \dots, N\}$ 8, Polyak factor $\mathcal{M} \subset \{1, \dots, N\}$ 9, replay buffer size $M \ll N$ 0, and batch size 256–512. Layer normalization within critics improves regularization, particularly under aggressive replay usage.

In practical deployments, REDQ's substantial computational cost (factor of $M \ll N$ 1 in both forward/backward passes and memory) is mitigated in variants like DroQ, which uses a minimal set of dropout network heads. Hyperparameter tuning for $M \ll N$ 2, $M \ll N$ 3, and $M \ll N$ 4 is required for optimal bias–variance tradeoff, although the algorithm is robust near the canonical settings (Hiraoka et al., 2021).

7. Applications, Limitations, and Open Problems

REDQ has demonstrated notable efficacy in domains ranging from continuous control in simulation, real-world robotic manipulation with sparse rewards, to autonomous vehicle trajectory control, matching or exceeding the performance of model-based approaches while maintaining the modularity and simplicity of model-free methods (Frauenknecht et al., 2023).

Current limitations include compute requirements for maintaining and updating large ensembles, as well as suboptimal asymptotic performance relative to advanced distributional methods in the most complex domains. Future directions include adaptive subset selection, efficient ensemble architectures (e.g., multi-head critics), formal sample-complexity analysis under function approximation, and integration with distributional Q-learning or model-based RL components (Chen et al., 2021, Wu et al., 2021).

References

"Randomized Ensembled Double Q-Learning: Learning Fast Without a Model" (Chen et al., 2021)
"Aggressive Q-Learning with Ensembles: Achieving Both High Sample Efficiency and High Asymptotic Performance" (Wu et al., 2021)
"Dropout Q-Functions for Doubly Efficient Reinforcement Learning" (Hiraoka et al., 2021)
"Efficient Sparse-Reward Goal-Conditioned Reinforcement Learning with a High Replay Ratio and Regularization" (Hiraoka, 2023)
"Data-efficient Deep Reinforcement Learning for Vehicle Trajectory Control" (Frauenknecht et al., 2023)