Return-Conditioned Behavior Cloning
- Return-Conditioned Behavior Cloning is an offline RL method that reframes policy learning as supervised imitation conditioned on return-to-go targets.
- It simplifies learning by directly imitating demonstrated actions, avoiding the need for value function estimation and enabling stable optimization.
- Enhancements like ConserWeightive Behavioral Cloning upweight high-return trajectories and apply conservative regularization to address out-of-distribution return challenges.
Return-Conditioned Behavioral Cloning (RCBC) is an offline reinforcement learning (RL) paradigm that recasts policy learning as supervised learning over trajectories, with the key innovation of conditioning the learned policy on a user-specified measure of future return. Rather than inferring value functions or optimizing expected returns through dynamic programming, RCBC directly trains a policy to imitate demonstrated behaviors with additional context in the form of return-to-go (RTG) targets, thereby enabling offline RL with simplified objectives and stable optimization. Extensions such as ConserWeightive Behavioral Cloning (CWBC) address the limitations of naïve return conditioning, particularly in the presence of out-of-distribution (OOD) return requests that require extrapolation beyond the coverage of offline data (Nguyen et al., 2022).
1. Formalism and Training Objective
Given an offline dataset of trajectories , each trajectory , the return-to-go at time is defined as
The return-conditioned policy has the form
where is a user-specified target return, reflecting the intended return level at test time.
Training is conducted via supervised learning by treating each tuple as a labeled training example. The typical loss function is the negative log-likelihood of demonstrated actions under the learned policy:
Alternatively, if is Gaussian and actions are real-valued, a mean squared error can be used:
2. Challenges of OOD Return Conditioning
A key deficiency in standard RCBC arises under high-value return conditioning that lies outside the empirical support of the offline dataset. In practice, dataset returns are bounded above by some ; if the policy is queried at test time with exceeding this bound, it must extrapolate. This train–test distribution shift in pairs often leads to abrupt performance collapse, with realized returns substantially below the requested value. This vulnerability stems from the lack of high-return supervision in training, as well as architectural limitations that hinder generalization to unseen return contexts (Nguyen et al., 2022).
3. ConserWeightive Behavioral Cloning (CWBC): Methodology
CWBC augments RCBC with two principal mechanisms: trajectory weighting and conservative regularization.
a. Trajectory Weighting
Each trajectory is assigned a nonnegative weight that increases with total return:
with temperature parameter . This amplifies the influence of high-return trajectories, mitigating the bias toward suboptimal returns endemic to the original offline data distribution.
b. Conservative Regularization
For trajectories with returns exceeding a high percentile (e.g., the 95th percentile), pseudo-OOD contexts are constructed by adding noise so that . Define perturbed RTG as
The conservative penalty enforced is
$R(\theta) = \mathbb{E}_{\substack{\tau\sim\mathcal{D}\r_\tau > r_q,\; \varepsilon \sim \mathcal{E}}}\left[\sum_{t=1}^T \|\pi_\theta(s_t,\omega_t')-a_t\|^2\right].$
This regularizer constrains the policy’s behavior under OOD returns to remain close to trajectories seen in high-quality data.
c. Combined Objective
The final objective function is:
where governs the tradeoff between maximum likelihood imitation and conservative OOD regularization.
4. Implementation Details
The CWBC algorithm involves the following procedural steps (Nguyen et al., 2022):
- Precompute the returns for all .
- Partition trajectories into return-quantile bins.
- At each iteration:
- Sample a mini-batch: first select a bin with probability proportional to , then sample uniformly from trajectories in .
- For each sampled trajectory, compute the standard BC loss. For those with , inject noise into and evaluate the conservative penalty.
- Update via gradient descent on the combined (weighted + regularized) loss.
Recommended hyperparameters include:
| Hyperparameter | Value | Context |
|---|---|---|
| Trajectory-weight bins () | $20$ | Robust across tasks |
| Smoothing () | $0.01$ | Training stability |
| Conservative percentile () | $95$ | OOD regularization |
| Noise std () | $1000$ | Return perturbation |
| Regularization weight | $1.0$ | Final loss function |
| Test-time conditioning () | Expert return (no per-task tuning) | Evaluation |
5. Theoretical Foundations
Appendix C of (Nguyen et al., 2022) provides a bias–variance bound for the gradient discrepancy between the reweighted objective (as used in trajectory weighting) and an ideal expert distribution:
where is the reweighted return distribution, is the number of trajectories with return , and the expert distribution. Exponential weighting is derived as minimizer of this upper bound, balancing bias due to underrepresentation of high-returns and variance introduced by aggressive weighting. Conservative regularization acts as an additional control on extrapolation error for OOD return contexts.
6. Empirical Evaluation
CWBC was evaluated on D4RL locomotion benchmarks (hopper, walker2d, halfcheetah) using “medium,” “med-replay,” and “med-expert” datasets, as well as Atari replay data. Primary metrics included normalized return (relative to expert) and success rates on AntMaze.
Key results (average over 10 seeds):
RvS baseline: → RvS+CWBC: (+18 points)
- Decision Transformer baseline: → DT+CWBC: (+5.2 points)
- CWBC maintained strong performance even when conditioned on out-of-distribution high returns, often matching or exceeding CQL and IQL.
- On low-quality “med-replay” datasets, standard RvS crashed on OOD conditioning, whereas RvS+CWBC exhibited consistent reliability.
7. Practical Considerations and Limitations
Empirical evidence supports that CWBC is a robust augmentation to any conditional BC framework, requiring minimal tuning and exhibiting strong generalization for in-distribution and modestly out-of-distribution return conditioning (Nguyen et al., 2022). Robust default hyperparameters further facilitate practical application. However, perfect linear extrapolation to arbitrarily high OOD returns—i.e., guaranteeing proportional performance for any user-specified —remains elusive, and generalization beyond the support of offline data is an unresolved research challenge.
In summary, Return-Conditioned Behavioral Cloning reframes offline RL as supervised learning conditioned on a desired return. ConserWeightive Behavioral Cloning further enhances reliability by (1) upweighting high-return trajectories and (2) imposing a conservative penalty for OOD contexts, collectively yielding a robust and practical recipe for offline RL with minimal complexity (Nguyen et al., 2022).