Random Ensemble Mixture (REM) in Offline RL

Updated 1 April 2026

REM is a robust value-based RL algorithm that uses random convex mixtures of multiple Q-value estimators to enforce Bellman consistency and mitigate overfitting.
It leverages random weight sampling during each Bellman update to promote mutual regularization and diverse target estimations across the ensemble.
Empirical results show REM achieves superior performance and stability in offline RL, often outperforming standard DQN and related ensemble baselines.

The Random Ensemble Mixture (REM) is a robust value-based reinforcement learning (RL) algorithm that leverages random convex combinations of multiple Q-value estimators to enhance generalization, stability, and policy performance, particularly in the offline RL setting. REM was introduced in the context of challenges faced by off-policy deep RL when limited to fixed datasets, where classical ensembling techniques from supervised learning are extended by enforcing Bellman optimality on mixtures of estimators rather than on each individually (Agarwal et al., 2019).

1. Motivation and Conceptual Foundation

Offline RL is characterized by the constraint of learning from a static dataset of tuples $(s, a, r, s')$ without environment interaction to correct for extrapolation errors. Standard Q-learning, when exposed to a fixed data distribution, tends to overfit: errors in target estimation propagate and amplify, leading to poorly calibrated Q-values. Ensembling, a classical method to reduce overfitting in supervised learning, is adapted in REM by forming random convex combinations of several independently trained Q-functions at every Bellman update.

The central innovation of REM is to enforce Bellman consistency on a continuum of random mixtures rather than only on individual ensemble members or their average. For a degree- $K$ ensemble $\{Q_i(s,a;\theta_i)\}$ , a random convex weight vector $w \in \Delta^{K-1}$ is sampled at each update, yielding a mixture $Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)$ . Each sampled mixture is then required to satisfy the Bellman equation, exposing the ensemble to a broader target family, regularizing heads to mutual consistency, and attenuating variance and bias from fixed data. The approach is empirically similar in effect to dropout regularization and forms a robust defense against distributional shift in the offline RL dataset (Agarwal et al., 2019).

2. Mathematical Formulation

REM operates by maintaining an ensemble of $K$ Q-functions $(Q_1, Q_2, ..., Q_K)$ , implemented either as:

$K$ separate networks, or
a shared network backbone with $K$ output heads.

At each gradient step:

Sample a random convex weight vector $w$ by normalizing $K$ 0 i.i.d. draws $K$ 1 from $K$ 2: $K$ 3.
Form the mixture Q-value:

$K$ 4

For a minibatch transition $K$ 5, the target is

$K$ 6

where $K$ 7 are frozen target network parameters.

The REM loss is then

$K$ 8

with $K$ 9 typically chosen as the Huber or L2 loss.

The gradient of $\{Q_i(s,a;\theta_i)\}$ 0 is backpropagated through all heads, encouraging both diversity and mutual regularization. The Bellman equation is enforced on a random subspace of the ensemble at each update, systematically reducing overfitting and yielding greater target diversity than head-wise or averaged ensembles (Agarwal et al., 2019).

3. Implementation Protocol

High-level REM training proceeds as follows:

Ensemble Initialization: Instantiate $\{Q_i(s,a;\theta_i)\}$ 1 Q-functions, either as independent networks or as a multi-head network. Target networks $\{Q_i(s,a;\theta_i)\}$ 2 are initialized as copies of $\{Q_i(s,a;\theta_i)\}$ 3.
Batch Update Loop:
- Sample a minibatch of transitions $\{Q_i(s,a;\theta_i)\}$ 4.
- For each transition, sample a convex weight vector $\{Q_i(s,a;\theta_i)\}$ 5 as above.
- Compute $\{Q_i(s,a;\theta_i)\}$ 6 and $\{Q_i(s,a;\theta_i)\}$ 7 for all candidate $\{Q_i(s,a;\theta_i)\}$ 8.
- Calculate the Bellman target $\{Q_i(s,a;\theta_i)\}$ 9 using the mixture at $w \in \Delta^{K-1}$ 0.
- Evaluate loss and perform a gradient step.
- Every $w \in \Delta^{K-1}$ 1 steps, synchronize $w \in \Delta^{K-1}$ 2 for all heads.
Key hyperparameters and architectural details:

Component	Setting	Context
Ensemble size ( $w \in \Delta^{K-1}$ 3)	200 (multi-head REM), 4 (separate heads for online REM)	Recommended by ablation (Agarwal et al., 2019)
Optimizer	Adam, lr= $w \in \Delta^{K-1}$ 4, $w \in \Delta^{K-1}$ 5	Standard in DQN-like RL
Loss	Huber (δ=1) or L2	For target-calibration
Mini-batch size	32	For offline stochastic optimization
Target update freq	Every 2000 gradient steps	For target stability
Architecture	3 conv. layers (32/64/64), filter sizes 8x8/4x4/3x3, strides 4/2/1, FC 512, $w \in \Delta^{K-1}$ 6 heads	Standard Atari input pipeline
Replay buffer	Full offline dataset, random sampling, no prioritization	All experience from pre-trained DQN
State preprocessing	Grayscale, 84x84 resize, stack 4 frames, action repeat, sticky actions	DQN standard pipeline

Evaluation is performed every 1 million frames (125,000 eval steps), using $w \in \Delta^{K-1}$ 7 for exploration, with results averaged over 5 seeds. This setup provides reproducibility and enables straightforward transfer to other offline RL domains (Agarwal et al., 2019).

4. Theoretical Properties

REM's loss functional admits the following key theoretical property: If the random mixture distribution covers the full simplex, and the class of Q-functions is rich enough to realize $w \in \Delta^{K-1}$ 8, then \emph{every global minimizer} of the REM objective corresponds to the Bellman-optimal solution, and all $w \in \Delta^{K-1}$ 9 heads collapse to $Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)$ 0. Thus, REM regularizes towards the true fixed point solution by forcing Bellman consistency through a high-dimensional continuum of mixtures, not merely on average or per-head. This mechanism imparts an "optimistic" bias in offline Q-learning: REM systematically resists the type of under- or over-estimation that can compromise Q-learning on fixed, non-stationary datasets (Agarwal et al., 2019).

A plausible implication is that REM's dropout-like stochasticity continually exposes the learning process to a spectrum of counterfactual Bellman errors, which in turn constrains the empirical risk minimizer much more tightly than conventional headwise or averaged ensemble objectives.

5. Empirical Performance and Ablations

Experiments conducted on the DQN Replay Dataset—comprising approximately 50 million $Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)$ 1 samples per game across 60 Atari games—demonstrate the significant practical advantage of REM in offline RL:

REM achieves a median normalized score of ~123.8% relative to the fully trained online DQN agent and outperforms baseline DQN in 49 out of 60 games.
Competing baselines such as QR-DQN (median 118.9%, 45/60 wins), Ensemble-DQN, and DQN+Adam perform slightly worse; average ensembles do not yield comparable policy quality.
REM remains robust with smaller datasets: using 10% of the data, REM matches online DQN performance. Performance degrades catastrophically only at <1% dataset size.
On sub-optimal datasets (first 20M DQN frames), REM and QR-DQN produce superior policies than the best seen in the logged data.
Separate networks for heads converge faster and achieve slightly better scores than a single multi-head model, indicating measurable benefit from diversity.
In online RL (with $Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)$ 2 and $Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)$ 3-greedy with random head selection), REM matches QR-DQN and outperforms Bootstrapped-DQN.
In continuous control, offline REM with TD3 on a DDPG buffer matches BCQ, indicating transferability and robustness without explicit behavior regularization.

Ablation studies corroborate the above claims and clarify critical dependencies on dataset size, diversity, and architecture (Agarwal et al., 2019).

6. Broader Significance and Practical Implications

REM demonstrates that enforcing Bellman consistency on random convex mixtures constitutes a principled, practical means of regularizing value-based RL. The approach mitigates the compounding effect of target miscalibration in offline RL and provides state-of-the-art empirical performance with a simple, codebase-compatible extension to existing Q-learning setups.

REM also highlights the importance of data diversity and head‐diversity (via architecture) in ensemble-based RL. Practitioners using REM can implement it on top of any modern Q-function infrastructure by adding mixture sampling and modifying the Bellman update as outlined.

By facilitating high-quality offline policy learning from sufficiently large and diverse replay datasets, REM also points to optimistic directions in scalable, robust RL—bypassing the traditionally pessimistic view espoused due to the risks of overestimation and distributional shift (Agarwal et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

An Optimistic Perspective on Offline Reinforcement Learning (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Ensemble Mixture (REM).

Random Ensemble Mixture (REM) in Offline RL

1. Motivation and Conceptual Foundation

2. Mathematical Formulation

3. Implementation Protocol

4. Theoretical Properties

5. Empirical Performance and Ablations

6. Broader Significance and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Random Ensemble Mixture (REM) in Offline RL

1. Motivation and Conceptual Foundation

2. Mathematical Formulation

3. Implementation Protocol

4. Theoretical Properties

5. Empirical Performance and Ablations

6. Broader Significance and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research