Papers
Topics
Authors
Recent
Search
2000 character limit reached

Random Ensemble Mixture (REM) in Offline RL

Updated 1 April 2026
  • REM is a robust value-based RL algorithm that uses random convex mixtures of multiple Q-value estimators to enforce Bellman consistency and mitigate overfitting.
  • It leverages random weight sampling during each Bellman update to promote mutual regularization and diverse target estimations across the ensemble.
  • Empirical results show REM achieves superior performance and stability in offline RL, often outperforming standard DQN and related ensemble baselines.

The Random Ensemble Mixture (REM) is a robust value-based reinforcement learning (RL) algorithm that leverages random convex combinations of multiple Q-value estimators to enhance generalization, stability, and policy performance, particularly in the offline RL setting. REM was introduced in the context of challenges faced by off-policy deep RL when limited to fixed datasets, where classical ensembling techniques from supervised learning are extended by enforcing Bellman optimality on mixtures of estimators rather than on each individually (Agarwal et al., 2019).

1. Motivation and Conceptual Foundation

Offline RL is characterized by the constraint of learning from a static dataset of tuples (s,a,r,s)(s, a, r, s') without environment interaction to correct for extrapolation errors. Standard Q-learning, when exposed to a fixed data distribution, tends to overfit: errors in target estimation propagate and amplify, leading to poorly calibrated Q-values. Ensembling, a classical method to reduce overfitting in supervised learning, is adapted in REM by forming random convex combinations of several independently trained Q-functions at every Bellman update.

The central innovation of REM is to enforce Bellman consistency on a continuum of random mixtures rather than only on individual ensemble members or their average. For a degree-KK ensemble {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}, a random convex weight vector wΔK1w \in \Delta^{K-1} is sampled at each update, yielding a mixture Qw(s,a)=i=1KwiQi(s,a;θi)Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i). Each sampled mixture is then required to satisfy the Bellman equation, exposing the ensemble to a broader target family, regularizing heads to mutual consistency, and attenuating variance and bias from fixed data. The approach is empirically similar in effect to dropout regularization and forms a robust defense against distributional shift in the offline RL dataset (Agarwal et al., 2019).

2. Mathematical Formulation

REM operates by maintaining an ensemble of KK Q-functions (Q1,Q2,...,QK)(Q_1, Q_2, ..., Q_K), implemented either as:

  • KK separate networks, or
  • a shared network backbone with KK output heads.

At each gradient step:

  1. Sample a random convex weight vector ww by normalizing KK0 i.i.d. draws KK1 from KK2: KK3.
  2. Form the mixture Q-value:

KK4

  1. For a minibatch transition KK5, the target is

KK6

where KK7 are frozen target network parameters.

  1. The REM loss is then

KK8

with KK9 typically chosen as the Huber or L2 loss.

The gradient of {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}0 is backpropagated through all heads, encouraging both diversity and mutual regularization. The Bellman equation is enforced on a random subspace of the ensemble at each update, systematically reducing overfitting and yielding greater target diversity than head-wise or averaged ensembles (Agarwal et al., 2019).

3. Implementation Protocol

High-level REM training proceeds as follows:

  • Ensemble Initialization: Instantiate {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}1 Q-functions, either as independent networks or as a multi-head network. Target networks {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}2 are initialized as copies of {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}3.
  • Batch Update Loop:
    • Sample a minibatch of transitions {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}4.
    • For each transition, sample a convex weight vector {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}5 as above.
    • Compute {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}6 and {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}7 for all candidate {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}8.
    • Calculate the Bellman target {Qi(s,a;θi)}\{Q_i(s,a;\theta_i)\}9 using the mixture at wΔK1w \in \Delta^{K-1}0.
    • Evaluate loss and perform a gradient step.
    • Every wΔK1w \in \Delta^{K-1}1 steps, synchronize wΔK1w \in \Delta^{K-1}2 for all heads.
  • Key hyperparameters and architectural details:
Component Setting Context
Ensemble size (wΔK1w \in \Delta^{K-1}3) 200 (multi-head REM), 4 (separate heads for online REM) Recommended by ablation (Agarwal et al., 2019)
Optimizer Adam, lr=wΔK1w \in \Delta^{K-1}4, wΔK1w \in \Delta^{K-1}5 Standard in DQN-like RL
Loss Huber (δ=1) or L2 For target-calibration
Mini-batch size 32 For offline stochastic optimization
Target update freq Every 2000 gradient steps For target stability
Architecture 3 conv. layers (32/64/64), filter sizes 8x8/4x4/3x3, strides 4/2/1, FC 512, wΔK1w \in \Delta^{K-1}6 heads Standard Atari input pipeline
Replay buffer Full offline dataset, random sampling, no prioritization All experience from pre-trained DQN
State preprocessing Grayscale, 84x84 resize, stack 4 frames, action repeat, sticky actions DQN standard pipeline

Evaluation is performed every 1 million frames (125,000 eval steps), using wΔK1w \in \Delta^{K-1}7 for exploration, with results averaged over 5 seeds. This setup provides reproducibility and enables straightforward transfer to other offline RL domains (Agarwal et al., 2019).

4. Theoretical Properties

REM's loss functional admits the following key theoretical property: If the random mixture distribution covers the full simplex, and the class of Q-functions is rich enough to realize wΔK1w \in \Delta^{K-1}8, then \emph{every global minimizer} of the REM objective corresponds to the Bellman-optimal solution, and all wΔK1w \in \Delta^{K-1}9 heads collapse to Qw(s,a)=i=1KwiQi(s,a;θi)Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)0. Thus, REM regularizes towards the true fixed point solution by forcing Bellman consistency through a high-dimensional continuum of mixtures, not merely on average or per-head. This mechanism imparts an "optimistic" bias in offline Q-learning: REM systematically resists the type of under- or over-estimation that can compromise Q-learning on fixed, non-stationary datasets (Agarwal et al., 2019).

A plausible implication is that REM's dropout-like stochasticity continually exposes the learning process to a spectrum of counterfactual Bellman errors, which in turn constrains the empirical risk minimizer much more tightly than conventional headwise or averaged ensemble objectives.

5. Empirical Performance and Ablations

Experiments conducted on the DQN Replay Dataset—comprising approximately 50 million Qw(s,a)=i=1KwiQi(s,a;θi)Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)1 samples per game across 60 Atari games—demonstrate the significant practical advantage of REM in offline RL:

  • REM achieves a median normalized score of ~123.8% relative to the fully trained online DQN agent and outperforms baseline DQN in 49 out of 60 games.
  • Competing baselines such as QR-DQN (median 118.9%, 45/60 wins), Ensemble-DQN, and DQN+Adam perform slightly worse; average ensembles do not yield comparable policy quality.
  • REM remains robust with smaller datasets: using 10% of the data, REM matches online DQN performance. Performance degrades catastrophically only at <1% dataset size.
  • On sub-optimal datasets (first 20M DQN frames), REM and QR-DQN produce superior policies than the best seen in the logged data.
  • Separate networks for heads converge faster and achieve slightly better scores than a single multi-head model, indicating measurable benefit from diversity.
  • In online RL (with Qw(s,a)=i=1KwiQi(s,a;θi)Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)2 and Qw(s,a)=i=1KwiQi(s,a;θi)Q_w(s, a) = \sum_{i=1}^K w_i Q_i(s, a; \theta_i)3-greedy with random head selection), REM matches QR-DQN and outperforms Bootstrapped-DQN.
  • In continuous control, offline REM with TD3 on a DDPG buffer matches BCQ, indicating transferability and robustness without explicit behavior regularization.

Ablation studies corroborate the above claims and clarify critical dependencies on dataset size, diversity, and architecture (Agarwal et al., 2019).

6. Broader Significance and Practical Implications

REM demonstrates that enforcing Bellman consistency on random convex mixtures constitutes a principled, practical means of regularizing value-based RL. The approach mitigates the compounding effect of target miscalibration in offline RL and provides state-of-the-art empirical performance with a simple, codebase-compatible extension to existing Q-learning setups.

REM also highlights the importance of data diversity and head‐diversity (via architecture) in ensemble-based RL. Practitioners using REM can implement it on top of any modern Q-function infrastructure by adding mixture sampling and modifying the Bellman update as outlined.

By facilitating high-quality offline policy learning from sufficiently large and diverse replay datasets, REM also points to optimistic directions in scalable, robust RL—bypassing the traditionally pessimistic view espoused due to the risks of overestimation and distributional shift (Agarwal et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Ensemble Mixture (REM).