Return-Conditioned Behavior Cloning

Updated 7 February 2026

Return-Conditioned Behavior Cloning is an offline RL method that reframes policy learning as supervised imitation conditioned on return-to-go targets.
It simplifies learning by directly imitating demonstrated actions, avoiding the need for value function estimation and enabling stable optimization.
Enhancements like ConserWeightive Behavioral Cloning upweight high-return trajectories and apply conservative regularization to address out-of-distribution return challenges.

Return-Conditioned Behavioral Cloning (RCBC) is an offline reinforcement learning (RL) paradigm that recasts policy learning as supervised learning over trajectories, with the key innovation of conditioning the learned policy on a user-specified measure of future return. Rather than inferring value functions or optimizing expected returns through dynamic programming, RCBC directly trains a policy to imitate demonstrated behaviors with additional context in the form of return-to-go (RTG) targets, thereby enabling offline RL with simplified objectives and stable optimization. Extensions such as ConserWeightive Behavioral Cloning (CWBC) address the limitations of naïve return conditioning, particularly in the presence of out-of-distribution (OOD) return requests that require extrapolation beyond the coverage of offline data (Nguyen et al., 2022).

1. Formalism and Training Objective

Given an offline dataset of trajectories $\mathcal{D}=\{\tau\}$ , each trajectory $\tau=(s_1,a_1,r_1,\dots,s_T,a_T,r_T)$ , the return-to-go at time $t$ is defined as

$g_t = \sum_{t'=t}^T r_{t'}\,.$

The return-conditioned policy has the form

$\pi_\theta(a\mid s_t,g_t)\quad\text{or more generally}\quad\pi_\theta(a\mid s_t,G),$

where $G$ is a user-specified target return, reflecting the intended return level at test time.

Training is conducted via supervised learning by treating each $(s_t, g_t)\rightarrow a_t$ tuple as a labeled training example. The typical loss function is the negative log-likelihood of demonstrated actions under the learned policy:

$\mathcal{L}_{\rm BC}(\theta) = - \mathbb{E}_{\tau\sim\mathcal{D}}\left[\sum_{t=1}^T\log \pi_\theta(a_t\mid s_t,g_t)\right].$

Alternatively, if $\pi_\theta$ is Gaussian and actions are real-valued, a mean squared error can be used:

$\mathcal{L}_{\rm BC}(\theta) = \mathbb{E}_{\tau\sim\mathcal{D}}\left[\frac1T\sum_{t=1}^T \|a_t-\pi_\theta(s_t,g_t)\|^2\right].$

2. Challenges of OOD Return Conditioning

A key deficiency in standard RCBC arises under high-value return conditioning that lies outside the empirical support of the offline dataset. In practice, dataset returns are bounded above by some $\max_\tau r_\tau$ ; if the policy is queried at test time with $G$ exceeding this bound, it must extrapolate. This train–test distribution shift in $(s, G)$ pairs often leads to abrupt performance collapse, with realized returns substantially below the requested value. This vulnerability stems from the lack of high-return supervision in training, as well as architectural limitations that hinder generalization to unseen return contexts (Nguyen et al., 2022).

3. ConserWeightive Behavioral Cloning (CWBC): Methodology

CWBC augments RCBC with two principal mechanisms: trajectory weighting and conservative regularization.

a. Trajectory Weighting

Each trajectory $\tau$ is assigned a nonnegative weight that increases with total return:

$w(\tau)\propto \exp(\beta\cdot r_\tau),\quad r_\tau=\sum_{t=1}^T r_t,$

with temperature parameter $\beta > 0$ . This amplifies the influence of high-return trajectories, mitigating the bias toward suboptimal returns endemic to the original offline data distribution.

b. Conservative Regularization

For trajectories with returns exceeding a high percentile $r_q$ (e.g., the 95th percentile), pseudo-OOD contexts are constructed by adding noise $\varepsilon$ so that $g_1 + \varepsilon \geq r^*$ . Define perturbed RTG as

$\omega_t' = \frac{g_t + \varepsilon}{T - t + 1},\,$

The conservative penalty enforced is

$R(\theta) = \mathbb{E}_{\substack{\tau\sim\mathcal{D}\r_\tau > r_q,\; \varepsilon \sim \mathcal{E}}}\left[\sum_{t=1}^T \|\pi_\theta(s_t,\omega_t')-a_t\|^2\right].$

This regularizer constrains the policy’s behavior under OOD returns to remain close to trajectories seen in high-quality data.

c. Combined Objective

The final objective function is:

$\mathcal{L}_{\rm CWBC}(\theta) = \mathbb{E}_{\tau\sim\mathcal{D}}\left[w(\tau)\, \mathcal{L}_{\rm BC}(\tau;\theta)\right] + \lambda\,R(\theta),$

where $\lambda \geq 0$ governs the tradeoff between maximum likelihood imitation and conservative OOD regularization.

4. Implementation Details

The CWBC algorithm involves the following procedural steps (Nguyen et al., 2022):

Precompute the returns $r_\tau$ for all $\tau \in \mathcal{D}$ .
Partition trajectories into $B$ return-quantile bins.
At each iteration:
1. Sample a mini-batch: first select a bin $b$ with probability proportional to $\exp(\beta\cdot \text{mean\_return}(b))$ , then sample uniformly from trajectories in $b$ .
2. For each sampled trajectory, compute the standard BC loss. For those with $r_\tau > r_q$ , inject noise into $g_t$ and evaluate the conservative penalty.
3. Update $\theta$ via gradient descent on the combined (weighted + regularized) loss.

Recommended hyperparameters include:

Hyperparameter	Value	Context
Trajectory-weight bins ( $B$ )	$20$	Robust across tasks
Smoothing ( $\lambda$ )	$0.01$	Training stability
Conservative percentile ( $q$ )	$95$	OOD regularization
Noise std ( $\sigma$ )	$1000$	Return perturbation
Regularization weight	$1.0$	Final loss function
Test-time conditioning ( $G$ )	Expert return (no per-task tuning)	Evaluation

5. Theoretical Foundations

Appendix C of (Nguyen et al., 2022) provides a bias–variance bound for the gradient discrepancy between the reweighted objective (as used in trajectory weighting) and an ideal expert distribution:

$\mathbb{E}\left\|\nabla L_q - \nabla L_{p^*}\right\|^2 \leq C_1 \mathbb{E}_{r\sim q}[1/N_r] + C_2 D_2(q\|p_D)/|\mathcal{D}| + C_3 D_{\rm TV}(p^*, q)^2,$

where $q$ is the reweighted return distribution, $N_r$ is the number of trajectories with return $r$ , and $p^*$ the expert distribution. Exponential weighting is derived as minimizer of this upper bound, balancing bias due to underrepresentation of high-returns and variance introduced by aggressive weighting. Conservative regularization acts as an additional control on extrapolation error for OOD return contexts.

6. Empirical Evaluation

CWBC was evaluated on D4RL locomotion benchmarks (hopper, walker2d, halfcheetah) using “medium,” “med-replay,” and “med-expert” datasets, as well as Atari replay data. Primary metrics included normalized return (relative to expert) and success rates on AntMaze.

Key results (average over 10 seeds):

RvS baseline: $\sim64.6\%$ → RvS+CWBC: $\sim76.5\%$ (+18 points)
Decision Transformer baseline: $\sim66.7\%$ → DT+CWBC: $\sim71.9\%$ (+5.2 points)
CWBC maintained strong performance even when conditioned on out-of-distribution high returns, often matching or exceeding CQL and IQL.
On low-quality “med-replay” datasets, standard RvS crashed on OOD conditioning, whereas RvS+CWBC exhibited consistent reliability.

7. Practical Considerations and Limitations

Empirical evidence supports that CWBC is a robust augmentation to any conditional BC framework, requiring minimal tuning and exhibiting strong generalization for in-distribution and modestly out-of-distribution return conditioning (Nguyen et al., 2022). Robust default hyperparameters further facilitate practical application. However, perfect linear extrapolation to arbitrarily high OOD returns—i.e., guaranteeing proportional performance for any user-specified $G$ —remains elusive, and generalization beyond the support of offline data is an unresolved research challenge.

In summary, Return-Conditioned Behavioral Cloning reframes offline RL as supervised learning conditioned on a desired return. ConserWeightive Behavioral Cloning further enhances reliability by (1) upweighting high-return trajectories and (2) imposing a conservative penalty for OOD contexts, collectively yielding a robust and practical recipe for offline RL with minimal complexity (Nguyen et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Return Conditioned Behavior Cloning (RCBC).

Return-Conditioned Behavior Cloning

1. Formalism and Training Objective

2. Challenges of OOD Return Conditioning

3. ConserWeightive Behavioral Cloning (CWBC): Methodology

4. Implementation Details

5. Theoretical Foundations

6. Empirical Evaluation

7. Practical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Return-Conditioned Behavior Cloning

1. Formalism and Training Objective

2. Challenges of OOD Return Conditioning

3. ConserWeightive Behavioral Cloning (CWBC): Methodology

4. Implementation Details

5. Theoretical Foundations

6. Empirical Evaluation

7. Practical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research