ConserWeightive Behavioral Cloning (CWBC)
- ConserWeightive Behavioral Cloning (CWBC) is a technique that refines return-conditioned behavioral cloning by reweighting training trajectories and incorporating a conservative regularization to handle out-of-distribution returns.
- It significantly improves performance in offline RL, achieving gains of approximately 18% in locomotion tasks and 72% in Atari benchmarks by closing the train-test gap.
- By combining trajectory weighting with conservative penalties, CWBC overcomes reliability issues and ensures stable, expert-level performance even under extrapolation beyond observed returns.
ConserWeightive Behavioral Cloning (CWBC) is a methodology for improving the reliability of Behavioral Cloning (BC) in offline reinforcement learning (RL), particularly under return-conditioning frameworks such as Decision Transformer (DT) and Reinforcement Learning via Supervised learning (RvS). It addresses critical failure modes of return-conditioned BC by introducing principled trajectory weighting and a conservative regularization scheme, thereby enhancing both peak performance and robustness to out-of-distribution (OOD) target returns (Nguyen et al., 2022).
1. Motivation and Reliability Failures in Return-Conditioned BC
Recent advances in offline RL have demonstrated that BC methods conditioned on future returns, notably DT and RvS, can perform competitively compared to value-based approaches while maintaining greater simplicity and stability. In practice, it is desirable to set the target return during evaluation—ideally beyond the highest return observed in the dataset—to elicit expert-level behavior. However, standard return-conditioned BC models exhibit severe reliability failures: as exceeds , the maximum achieved return in the offline data, the policy’s actual performance often precipitously collapses, violating the expectation of monotonic performance increases with higher conditioning (Nguyen et al., 2022).
Two root causes underlie this unreliability:
- Data-centric factors: Offline datasets are typically suboptimal and dominated by low-return trajectories, exposing the model to a train–test gap during OOD conditioning.
- Model-centric factors: Certain architectures (e.g., MLPs operating on concatenated inputs) are forced to heed OOD return inputs and exhibit poor extrapolation, while Transformers (DT) can partially ignore these tokens, offering greater robustness.
CWBC is introduced to close this reliability gap, ensuring robust extrapolation to unseen or high target returns during test time.
2. Core Methodological Components of CWBC
CWBC augments any return-conditioned BC framework with two synergistic ingredients: trajectory weighting and a conservative regularization penalty.
2.1 Trajectory Weighting
CWBC re-weights the empirical distribution over training trajectories such that high-return samples are favored, reducing the bias between the training and expert-level (test-time) return distributions. Define for trajectory , with the observed return density, and .
The corrected sampling density is
where (variance control) and (bias/smoothness control). In practice, trajectories are binned into equally-populated return bins , and bin sampling probability is
with as mean return in bin . Uniform sampling within the selected bin ensures high-return (and typically rare) trajectories are systematically upweighted.
2.2 Conservative Regularization
For models sensitive to OOD returns (notably RvS), a conservative penalty is introduced to constrain the policy’s output on artificially high return-to-go (RTG) inputs. For trajectories exceeding the -th return percentile ( in typical usage), a perturbation is applied to the initial RTG, ensuring the perturbed . At timestep :
and the conservative regularizer for RvS is
This penalty compels the model to produce in-distribution actions for OOD return queries, promoting stability under extrapolation.
3. Combined Objective and Training Protocol
The overall CWBC objective for a policy parameterized by is
where
is the weighted BC loss and is the conservative regularizer (zero for DT, nonzero for RvS). The tradeoff parameter balances fidelity to expert-like data versus conservatism for OOD inputs.
RV Implementation parameters in the original study are as follows:
- RvS: two-layer MLP (hidden dim $1024$, ReLU), input .
- DT: Three-layer Transformer (hidden dim $128$, one attention head, context length $20$, dropout $0.1$).
4. Algorithmic Realization
The training loop for RvS+CWBC consists of the following steps:
- Initialize parameters .
- For each iteration:
- Sample a batch of trajectories via weighted bin sampling.
- For each trajectory:
- Compute per-timestep RTGs.
- If , perturb the RTG as described and compute .
- Compute the empirical BC loss and conservative penalty .
- Update parameters using gradient descent on .
Pseudocode, hyperparameters, and architectural details for DT and RvS are specified explicitly (Nguyen et al., 2022).
5. Empirical Results and Ablations
CWBC was evaluated on D4RL locomotion (hopper, walker2d, halfcheetah with medium, med-replay, med-expert datasets), Atari (Breakout, Qbert, Pong, Seaquest on 500K DQN-replay transitions), and the AntMaze suite (umaze, medium, large in v0, diverse, play).
Key findings include:
- RvS+CWBC exhibits an ≈18% gain over vanilla RvS on locomotion (8/9 tasks) and ≈72% on Atari, matching or exceeding state-of-the-art value-based baselines.
- DT+CWBC improves DT by ≈8% overall, notably on low-quality datasets (med-replay).
- Ablation studies demonstrate that trajectory weighting alone (RvS+W) elevates returns near but fails for ; conservative regularization alone (RvS+C) stabilizes the model under OOD returns but does not utilize rare high-return data fully. The combination (RvS+W+C) yields both stable extrapolation and strong performance.
- CWBC outperforms naïve strategies such as max-return clipping (caps ) and hard filtering (training on only top-quantile trajectories), as these alternatives either impede extrapolation or suffer from high variance and data inefficiency.
6. Practical Guidelines, Hyperparameterization, and Limitations
Recommended hyperparameters for Gym locomotion tasks include: , , , , noise std , regularization , batch size $64$, and train for $100$K iterations (DT: learning rate $1$e, RvS: $1$e). At test time, the conditioning RTG should be set to the expert return , with no per-task tuning required.
Limitations:
- For datasets containing few or no high-return trajectories, CWBC may not reach expert-level performance.
- Aggressive conservative regularization ( too large) can underfit; is recommended.
- CWBC does not guarantee an ideal monotonic or linear extrapolation curve beyond expert; surpassing expert performance necessitates richer data or online adaptation.
CWBC thus serves as a low-overhead, model-agnostic wrapper that enables reliable and robust return-conditioning in offline RL, effectively shrinking the train-test gap and achieving strong and monotonic performance in extrapolation regimes (Nguyen et al., 2022).