Papers
Topics
Authors
Recent
Search
2000 character limit reached

ConserWeightive Behavioral Cloning (CWBC)

Updated 15 January 2026
  • ConserWeightive Behavioral Cloning (CWBC) is a technique that refines return-conditioned behavioral cloning by reweighting training trajectories and incorporating a conservative regularization to handle out-of-distribution returns.
  • It significantly improves performance in offline RL, achieving gains of approximately 18% in locomotion tasks and 72% in Atari benchmarks by closing the train-test gap.
  • By combining trajectory weighting with conservative penalties, CWBC overcomes reliability issues and ensures stable, expert-level performance even under extrapolation beyond observed returns.

ConserWeightive Behavioral Cloning (CWBC) is a methodology for improving the reliability of Behavioral Cloning (BC) in offline reinforcement learning (RL), particularly under return-conditioning frameworks such as Decision Transformer (DT) and Reinforcement Learning via Supervised learning (RvS). It addresses critical failure modes of return-conditioned BC by introducing principled trajectory weighting and a conservative regularization scheme, thereby enhancing both peak performance and robustness to out-of-distribution (OOD) target returns (Nguyen et al., 2022).

1. Motivation and Reliability Failures in Return-Conditioned BC

Recent advances in offline RL have demonstrated that BC methods conditioned on future returns, notably DT and RvS, can perform competitively compared to value-based approaches while maintaining greater simplicity and stability. In practice, it is desirable to set the target return gtargetg_{\rm target} during evaluation—ideally beyond the highest return observed in the dataset—to elicit expert-level behavior. However, standard return-conditioned BC models exhibit severe reliability failures: as gtargetg_{\rm target} exceeds rmaxr_{\max}, the maximum achieved return in the offline data, the policy’s actual performance often precipitously collapses, violating the expectation of monotonic performance increases with higher conditioning (Nguyen et al., 2022).

Two root causes underlie this unreliability:

  • Data-centric factors: Offline datasets are typically suboptimal and dominated by low-return trajectories, exposing the model to a train–test gap during OOD conditioning.
  • Model-centric factors: Certain architectures (e.g., MLPs operating on concatenated inputs) are forced to heed OOD return inputs and exhibit poor extrapolation, while Transformers (DT) can partially ignore these tokens, offering greater robustness.

CWBC is introduced to close this reliability gap, ensuring robust extrapolation to unseen or high target returns during test time.

2. Core Methodological Components of CWBC

CWBC augments any return-conditioned BC framework with two synergistic ingredients: trajectory weighting and a conservative regularization penalty.

2.1 Trajectory Weighting

CWBC re-weights the empirical distribution over training trajectories such that high-return samples are favored, reducing the bias between the training and expert-level (test-time) return distributions. Define rτ=t=1τrtr_\tau = \sum_{t=1}^{|\tau|} r_t for trajectory τ\tau, with fT(r)f_T(r) the observed return density, and r=maxτTrτr^\star = \max_{\tau\in T} r_\tau.

The corrected sampling density is

pCWBC(τ)fT(rτ)fT(rτ)+λexp(rτrκ),p_{\rm CWBC}(\tau) \propto \frac{f_T(r_\tau)}{f_T(r_\tau) + \lambda} \exp\left(-\frac{|r_\tau - r^\star|}{\kappa}\right),

where λ>0\lambda > 0 (variance control) and κ>0\kappa > 0 (bias/smoothness control). In practice, trajectories are binned into BB equally-populated return bins {b}\{b\}, and bin sampling probability is

p(b)b/Tb/T+λexp(rˉbrκ),p(b) \propto \frac{|b|/|T|}{|b|/|T| + \lambda} \exp\left(-\frac{|\bar r^b - r^\star|}{\kappa}\right),

with rˉb\bar r^b as mean return in bin bb. Uniform sampling within the selected bin ensures high-return (and typically rare) trajectories are systematically upweighted.

2.2 Conservative Regularization

For models sensitive to OOD returns (notably RvS), a conservative penalty is introduced to constrain the policy’s output on artificially high return-to-go (RTG) inputs. For trajectories exceeding the qq-th return percentile (q=95q=95 in typical usage), a perturbation ϵU[rrτ, rrτ+12σ]\epsilon \sim \mathcal{U}[r^\star - r_\tau,\ r^\star-r_\tau + \sqrt{12}\sigma] is applied to the initial RTG, ensuring the perturbed g~1r\tilde g_1 \geq r^\star. At timestep tt:

g~t=gt+ϵ,ω~t=g~tHt+1,\tilde g_t = g_t + \epsilon, \qquad \tilde\omega_t = \frac{\tilde g_t}{H - t + 1},

and the conservative regularizer for RvS is

CRvS(θ)=EτTepsilonEτ[1rτ>rq1τt=1τπθ(st,ω~t)at2].\mathcal{C}_{\rm RvS}(\theta) = \mathbb{E}_{\substack{\tau\sim T\\epsilon\sim\mathcal{E}_\tau}} \left[\mathbf{1}_{r_\tau > r_q} \frac{1}{|\tau|}\sum_{t=1}^{|\tau|} \|\pi_\theta(s_t, \tilde\omega_t) - a_t\|^2\right].

This penalty compels the model to produce in-distribution actions for OOD return queries, promoting stability under extrapolation.

3. Combined Objective and Training Protocol

The overall CWBC objective for a policy parameterized by θ\theta is

minθ LBC(θ)+αC(θ),\min_\theta\ \mathcal{L}_{\rm BC}(\theta) + \alpha\,\mathcal{C}(\theta),

where

LBC(θ)=EτpCWBC[1τt=1τπθ(st,ωt)at2]\mathcal{L}_{\rm BC}(\theta) = \mathbb{E}_{\tau\sim p_{\rm CWBC}}\left[\frac{1}{|\tau|}\sum_{t=1}^{|\tau|} \|\pi_\theta(s_t, \omega_t) - a_t\|^2\right]

is the weighted BC loss and C(θ)\mathcal{C}(\theta) is the conservative regularizer (zero for DT, nonzero for RvS). The tradeoff parameter α0\alpha \geq 0 balances fidelity to expert-like data versus conservatism for OOD inputs.

RV Implementation parameters in the original study are as follows:

  • RvS: two-layer MLP (hidden dim $1024$, ReLU), input (st,ωt)(s_t, \omega_t).
  • DT: Three-layer Transformer (hidden dim $128$, one attention head, context length $20$, dropout $0.1$).

4. Algorithmic Realization

The training loop for RvS+CWBC consists of the following steps:

  1. Initialize parameters θ\theta.
  2. For each iteration:
    • Sample a batch of SS trajectories via weighted bin sampling.
    • For each trajectory:
      • Compute per-timestep RTGs.
      • If rτ>rqr_\tau > r_q, perturb the RTG as described and compute ω~t\tilde\omega_t.
    • Compute the empirical BC loss LBC^\widehat{\mathcal{L}_{\rm BC}} and conservative penalty C^\widehat{\mathcal{C}}.
    • Update parameters using gradient descent on LBC^+αC^\widehat{\mathcal{L}_{\rm BC}} + \alpha\,\widehat{\mathcal{C}}.

Pseudocode, hyperparameters, and architectural details for DT and RvS are specified explicitly (Nguyen et al., 2022).

5. Empirical Results and Ablations

CWBC was evaluated on D4RL locomotion (hopper, walker2d, halfcheetah with medium, med-replay, med-expert datasets), Atari (Breakout, Qbert, Pong, Seaquest on 500K DQN-replay transitions), and the AntMaze suite (umaze, medium, large in v0, diverse, play).

Key findings include:

  • RvS+CWBC exhibits an ≈18% gain over vanilla RvS on locomotion (8/9 tasks) and ≈72% on Atari, matching or exceeding state-of-the-art value-based baselines.
  • DT+CWBC improves DT by ≈8% overall, notably on low-quality datasets (med-replay).
  • Ablation studies demonstrate that trajectory weighting alone (RvS+W) elevates returns near rmaxr_{\max} but fails for g>rmaxg > r_{\max}; conservative regularization alone (RvS+C) stabilizes the model under OOD returns but does not utilize rare high-return data fully. The combination (RvS+W+C) yields both stable extrapolation and strong performance.
  • CWBC outperforms naïve strategies such as max-return clipping (caps gg) and hard filtering (training on only top-quantile trajectories), as these alternatives either impede extrapolation or suffer from high variance and data inefficiency.

6. Practical Guidelines, Hyperparameterization, and Limitations

Recommended hyperparameters for Gym locomotion tasks include: B=20B=20, λ=0.01\lambda=0.01, κ=rr90\kappa = r^\star - r_{90}, q=95q=95, noise std σ=1000\sigma=1000, regularization α=1\alpha=1, batch size $64$, and train for $100$K iterations (DT: learning rate $1$e4-4, RvS: $1$e3-3). At test time, the conditioning RTG should be set to the expert return rexpertr_{\rm expert}^\star, with no per-task tuning required.

Limitations:

  • For datasets containing few or no high-return trajectories, CWBC may not reach expert-level performance.
  • Aggressive conservative regularization (α\alpha too large) can underfit; α[0.1,1.0]\alpha \in [0.1, 1.0] is recommended.
  • CWBC does not guarantee an ideal monotonic or linear extrapolation curve beyond expert; surpassing expert performance necessitates richer data or online adaptation.

CWBC thus serves as a low-overhead, model-agnostic wrapper that enables reliable and robust return-conditioning in offline RL, effectively shrinking the train-test gap and achieving strong and monotonic performance in extrapolation regimes (Nguyen et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConserWeightive Behavioral Cloning (CWBC).