Random Ensemble Mixture (REM) in Offline RL
- REM is a robust value-based RL algorithm that uses random convex mixtures of multiple Q-value estimators to enforce Bellman consistency and mitigate overfitting.
- It leverages random weight sampling during each Bellman update to promote mutual regularization and diverse target estimations across the ensemble.
- Empirical results show REM achieves superior performance and stability in offline RL, often outperforming standard DQN and related ensemble baselines.
The Random Ensemble Mixture (REM) is a robust value-based reinforcement learning (RL) algorithm that leverages random convex combinations of multiple Q-value estimators to enhance generalization, stability, and policy performance, particularly in the offline RL setting. REM was introduced in the context of challenges faced by off-policy deep RL when limited to fixed datasets, where classical ensembling techniques from supervised learning are extended by enforcing Bellman optimality on mixtures of estimators rather than on each individually (Agarwal et al., 2019).
1. Motivation and Conceptual Foundation
Offline RL is characterized by the constraint of learning from a static dataset of tuples without environment interaction to correct for extrapolation errors. Standard Q-learning, when exposed to a fixed data distribution, tends to overfit: errors in target estimation propagate and amplify, leading to poorly calibrated Q-values. Ensembling, a classical method to reduce overfitting in supervised learning, is adapted in REM by forming random convex combinations of several independently trained Q-functions at every Bellman update.
The central innovation of REM is to enforce Bellman consistency on a continuum of random mixtures rather than only on individual ensemble members or their average. For a degree- ensemble , a random convex weight vector is sampled at each update, yielding a mixture . Each sampled mixture is then required to satisfy the Bellman equation, exposing the ensemble to a broader target family, regularizing heads to mutual consistency, and attenuating variance and bias from fixed data. The approach is empirically similar in effect to dropout regularization and forms a robust defense against distributional shift in the offline RL dataset (Agarwal et al., 2019).
2. Mathematical Formulation
REM operates by maintaining an ensemble of Q-functions , implemented either as:
- separate networks, or
- a shared network backbone with output heads.
At each gradient step:
- Sample a random convex weight vector by normalizing 0 i.i.d. draws 1 from 2: 3.
- Form the mixture Q-value:
4
- For a minibatch transition 5, the target is
6
where 7 are frozen target network parameters.
- The REM loss is then
8
with 9 typically chosen as the Huber or L2 loss.
The gradient of 0 is backpropagated through all heads, encouraging both diversity and mutual regularization. The Bellman equation is enforced on a random subspace of the ensemble at each update, systematically reducing overfitting and yielding greater target diversity than head-wise or averaged ensembles (Agarwal et al., 2019).
3. Implementation Protocol
High-level REM training proceeds as follows:
- Ensemble Initialization: Instantiate 1 Q-functions, either as independent networks or as a multi-head network. Target networks 2 are initialized as copies of 3.
- Batch Update Loop:
- Sample a minibatch of transitions 4.
- For each transition, sample a convex weight vector 5 as above.
- Compute 6 and 7 for all candidate 8.
- Calculate the Bellman target 9 using the mixture at 0.
- Evaluate loss and perform a gradient step.
- Every 1 steps, synchronize 2 for all heads.
- Key hyperparameters and architectural details:
| Component | Setting | Context |
|---|---|---|
| Ensemble size (3) | 200 (multi-head REM), 4 (separate heads for online REM) | Recommended by ablation (Agarwal et al., 2019) |
| Optimizer | Adam, lr=4, 5 | Standard in DQN-like RL |
| Loss | Huber (δ=1) or L2 | For target-calibration |
| Mini-batch size | 32 | For offline stochastic optimization |
| Target update freq | Every 2000 gradient steps | For target stability |
| Architecture | 3 conv. layers (32/64/64), filter sizes 8x8/4x4/3x3, strides 4/2/1, FC 512, 6 heads | Standard Atari input pipeline |
| Replay buffer | Full offline dataset, random sampling, no prioritization | All experience from pre-trained DQN |
| State preprocessing | Grayscale, 84x84 resize, stack 4 frames, action repeat, sticky actions | DQN standard pipeline |
Evaluation is performed every 1 million frames (125,000 eval steps), using 7 for exploration, with results averaged over 5 seeds. This setup provides reproducibility and enables straightforward transfer to other offline RL domains (Agarwal et al., 2019).
4. Theoretical Properties
REM's loss functional admits the following key theoretical property: If the random mixture distribution covers the full simplex, and the class of Q-functions is rich enough to realize 8, then \emph{every global minimizer} of the REM objective corresponds to the Bellman-optimal solution, and all 9 heads collapse to 0. Thus, REM regularizes towards the true fixed point solution by forcing Bellman consistency through a high-dimensional continuum of mixtures, not merely on average or per-head. This mechanism imparts an "optimistic" bias in offline Q-learning: REM systematically resists the type of under- or over-estimation that can compromise Q-learning on fixed, non-stationary datasets (Agarwal et al., 2019).
A plausible implication is that REM's dropout-like stochasticity continually exposes the learning process to a spectrum of counterfactual Bellman errors, which in turn constrains the empirical risk minimizer much more tightly than conventional headwise or averaged ensemble objectives.
5. Empirical Performance and Ablations
Experiments conducted on the DQN Replay Dataset—comprising approximately 50 million 1 samples per game across 60 Atari games—demonstrate the significant practical advantage of REM in offline RL:
- REM achieves a median normalized score of ~123.8% relative to the fully trained online DQN agent and outperforms baseline DQN in 49 out of 60 games.
- Competing baselines such as QR-DQN (median 118.9%, 45/60 wins), Ensemble-DQN, and DQN+Adam perform slightly worse; average ensembles do not yield comparable policy quality.
- REM remains robust with smaller datasets: using 10% of the data, REM matches online DQN performance. Performance degrades catastrophically only at <1% dataset size.
- On sub-optimal datasets (first 20M DQN frames), REM and QR-DQN produce superior policies than the best seen in the logged data.
- Separate networks for heads converge faster and achieve slightly better scores than a single multi-head model, indicating measurable benefit from diversity.
- In online RL (with 2 and 3-greedy with random head selection), REM matches QR-DQN and outperforms Bootstrapped-DQN.
- In continuous control, offline REM with TD3 on a DDPG buffer matches BCQ, indicating transferability and robustness without explicit behavior regularization.
Ablation studies corroborate the above claims and clarify critical dependencies on dataset size, diversity, and architecture (Agarwal et al., 2019).
6. Broader Significance and Practical Implications
REM demonstrates that enforcing Bellman consistency on random convex mixtures constitutes a principled, practical means of regularizing value-based RL. The approach mitigates the compounding effect of target miscalibration in offline RL and provides state-of-the-art empirical performance with a simple, codebase-compatible extension to existing Q-learning setups.
REM also highlights the importance of data diversity and head‐diversity (via architecture) in ensemble-based RL. Practitioners using REM can implement it on top of any modern Q-function infrastructure by adding mixture sampling and modifying the Bellman update as outlined.
By facilitating high-quality offline policy learning from sufficiently large and diverse replay datasets, REM also points to optimistic directions in scalable, robust RL—bypassing the traditionally pessimistic view espoused due to the risks of overestimation and distributional shift (Agarwal et al., 2019).