ConserWeightive Behavioral Cloning (CWBC)

Updated 15 January 2026

ConserWeightive Behavioral Cloning (CWBC) is a technique that refines return-conditioned behavioral cloning by reweighting training trajectories and incorporating a conservative regularization to handle out-of-distribution returns.
It significantly improves performance in offline RL, achieving gains of approximately 18% in locomotion tasks and 72% in Atari benchmarks by closing the train-test gap.
By combining trajectory weighting with conservative penalties, CWBC overcomes reliability issues and ensures stable, expert-level performance even under extrapolation beyond observed returns.

ConserWeightive Behavioral Cloning (CWBC) is a methodology for improving the reliability of Behavioral Cloning (BC) in offline reinforcement learning (RL), particularly under return-conditioning frameworks such as Decision Transformer (DT) and Reinforcement Learning via Supervised learning (RvS). It addresses critical failure modes of return-conditioned BC by introducing principled trajectory weighting and a conservative regularization scheme, thereby enhancing both peak performance and robustness to out-of-distribution (OOD) target returns (Nguyen et al., 2022).

1. Motivation and Reliability Failures in Return-Conditioned BC

Recent advances in offline RL have demonstrated that BC methods conditioned on future returns, notably DT and RvS, can perform competitively compared to value-based approaches while maintaining greater simplicity and stability. In practice, it is desirable to set the target return $g_{\rm target}$ during evaluation—ideally beyond the highest return observed in the dataset—to elicit expert-level behavior. However, standard return-conditioned BC models exhibit severe reliability failures: as $g_{\rm target}$ exceeds $r_{\max}$ , the maximum achieved return in the offline data, the policy’s actual performance often precipitously collapses, violating the expectation of monotonic performance increases with higher conditioning (Nguyen et al., 2022).

Two root causes underlie this unreliability:

Data-centric factors: Offline datasets are typically suboptimal and dominated by low-return trajectories, exposing the model to a train–test gap during OOD conditioning.
Model-centric factors: Certain architectures (e.g., MLPs operating on concatenated inputs) are forced to heed OOD return inputs and exhibit poor extrapolation, while Transformers (DT) can partially ignore these tokens, offering greater robustness.

CWBC is introduced to close this reliability gap, ensuring robust extrapolation to unseen or high target returns during test time.

2. Core Methodological Components of CWBC

CWBC augments any return-conditioned BC framework with two synergistic ingredients: trajectory weighting and a conservative regularization penalty.

2.1 Trajectory Weighting

CWBC re-weights the empirical distribution over training trajectories such that high-return samples are favored, reducing the bias between the training and expert-level (test-time) return distributions. Define $r_\tau = \sum_{t=1}^{|\tau|} r_t$ for trajectory $\tau$ , with $f_T(r)$ the observed return density, and $r^\star = \max_{\tau\in T} r_\tau$ .

The corrected sampling density is

$p_{\rm CWBC}(\tau) \propto \frac{f_T(r_\tau)}{f_T(r_\tau) + \lambda} \exp\left(-\frac{|r_\tau - r^\star|}{\kappa}\right),$

where $\lambda > 0$ (variance control) and $\kappa > 0$ (bias/smoothness control). In practice, trajectories are binned into $B$ equally-populated return bins $\{b\}$ , and bin sampling probability is

$p(b) \propto \frac{|b|/|T|}{|b|/|T| + \lambda} \exp\left(-\frac{|\bar r^b - r^\star|}{\kappa}\right),$

with $\bar r^b$ as mean return in bin $b$ . Uniform sampling within the selected bin ensures high-return (and typically rare) trajectories are systematically upweighted.

2.2 Conservative Regularization

For models sensitive to OOD returns (notably RvS), a conservative penalty is introduced to constrain the policy’s output on artificially high return-to-go (RTG) inputs. For trajectories exceeding the $q$ -th return percentile ( $q=95$ in typical usage), a perturbation $\epsilon \sim \mathcal{U}[r^\star - r_\tau,\ r^\star-r_\tau + \sqrt{12}\sigma]$ is applied to the initial RTG, ensuring the perturbed $\tilde g_1 \geq r^\star$ . At timestep $t$ :

$\tilde g_t = g_t + \epsilon, \qquad \tilde\omega_t = \frac{\tilde g_t}{H - t + 1},$

and the conservative regularizer for RvS is

$\mathcal{C}_{\rm RvS}(\theta) = \mathbb{E}_{\substack{\tau\sim T\\epsilon\sim\mathcal{E}_\tau}} \left[\mathbf{1}_{r_\tau > r_q} \frac{1}{|\tau|}\sum_{t=1}^{|\tau|} \|\pi_\theta(s_t, \tilde\omega_t) - a_t\|^2\right].$

This penalty compels the model to produce in-distribution actions for OOD return queries, promoting stability under extrapolation.

3. Combined Objective and Training Protocol

The overall CWBC objective for a policy parameterized by $\theta$ is

$\min_\theta\ \mathcal{L}_{\rm BC}(\theta) + \alpha\,\mathcal{C}(\theta),$

where

$\mathcal{L}_{\rm BC}(\theta) = \mathbb{E}_{\tau\sim p_{\rm CWBC}}\left[\frac{1}{|\tau|}\sum_{t=1}^{|\tau|} \|\pi_\theta(s_t, \omega_t) - a_t\|^2\right]$

is the weighted BC loss and $\mathcal{C}(\theta)$ is the conservative regularizer (zero for DT, nonzero for RvS). The tradeoff parameter $\alpha \geq 0$ balances fidelity to expert-like data versus conservatism for OOD inputs.

RV Implementation parameters in the original study are as follows:

RvS: two-layer MLP (hidden dim $1024$, ReLU), input $(s_t, \omega_t)$ .
DT: Three-layer Transformer (hidden dim $128$, one attention head, context length $20$, dropout $0.1$).

4. Algorithmic Realization

The training loop for RvS+CWBC consists of the following steps:

Initialize parameters $\theta$ .
For each iteration:
- Sample a batch of $S$ trajectories via weighted bin sampling.
- For each trajectory:
  - Compute per-timestep RTGs.
  - If $r_\tau > r_q$ , perturb the RTG as described and compute $\tilde\omega_t$ .
- Compute the empirical BC loss $\widehat{\mathcal{L}_{\rm BC}}$ and conservative penalty $\widehat{\mathcal{C}}$ .
- Update parameters using gradient descent on $\widehat{\mathcal{L}_{\rm BC}} + \alpha\,\widehat{\mathcal{C}}$ .

Pseudocode, hyperparameters, and architectural details for DT and RvS are specified explicitly (Nguyen et al., 2022).

5. Empirical Results and Ablations

CWBC was evaluated on D4RL locomotion (hopper, walker2d, halfcheetah with medium, med-replay, med-expert datasets), Atari (Breakout, Qbert, Pong, Seaquest on 500K DQN-replay transitions), and the AntMaze suite (umaze, medium, large in v0, diverse, play).

Key findings include:

RvS+CWBC exhibits an ≈18% gain over vanilla RvS on locomotion (8/9 tasks) and ≈72% on Atari, matching or exceeding state-of-the-art value-based baselines.
DT+CWBC improves DT by ≈8% overall, notably on low-quality datasets (med-replay).
Ablation studies demonstrate that trajectory weighting alone (RvS+W) elevates returns near $r_{\max}$ but fails for $g > r_{\max}$ ; conservative regularization alone (RvS+C) stabilizes the model under OOD returns but does not utilize rare high-return data fully. The combination (RvS+W+C) yields both stable extrapolation and strong performance.
CWBC outperforms naïve strategies such as max-return clipping (caps $g$ ) and hard filtering (training on only top-quantile trajectories), as these alternatives either impede extrapolation or suffer from high variance and data inefficiency.

6. Practical Guidelines, Hyperparameterization, and Limitations

Recommended hyperparameters for Gym locomotion tasks include: $B=20$ , $\lambda=0.01$ , $\kappa = r^\star - r_{90}$ , $q=95$ , noise std $\sigma=1000$ , regularization $\alpha=1$ , batch size $64$, and train for $100$K iterations (DT: learning rate $1$e $-4$ , RvS: $1$e $-3$ ). At test time, the conditioning RTG should be set to the expert return $r_{\rm expert}^\star$ , with no per-task tuning required.

Limitations:

For datasets containing few or no high-return trajectories, CWBC may not reach expert-level performance.
Aggressive conservative regularization ( $\alpha$ too large) can underfit; $\alpha \in [0.1, 1.0]$ is recommended.
CWBC does not guarantee an ideal monotonic or linear extrapolation curve beyond expert; surpassing expert performance necessitates richer data or online adaptation.

CWBC thus serves as a low-overhead, model-agnostic wrapper that enables reliable and robust return-conditioning in offline RL, effectively shrinking the train-test gap and achieving strong and monotonic performance in extrapolation regimes (Nguyen et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConserWeightive Behavioral Cloning (CWBC).

ConserWeightive Behavioral Cloning (CWBC)

1. Motivation and Reliability Failures in Return-Conditioned BC

2. Core Methodological Components of CWBC

2.1 Trajectory Weighting

2.2 Conservative Regularization

3. Combined Objective and Training Protocol

4. Algorithmic Realization

5. Empirical Results and Ablations

6. Practical Guidelines, Hyperparameterization, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ConserWeightive Behavioral Cloning (CWBC)

1. Motivation and Reliability Failures in Return-Conditioned BC

2. Core Methodological Components of CWBC

2.1 Trajectory Weighting

2.2 Conservative Regularization

3. Combined Objective and Training Protocol

4. Algorithmic Realization

5. Empirical Results and Ablations

6. Practical Guidelines, Hyperparameterization, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research