Papers
Topics
Authors
Recent
Search
2000 character limit reached

Return-Conditioned Behavior Cloning

Updated 7 February 2026
  • Return-Conditioned Behavior Cloning is an offline RL method that reframes policy learning as supervised imitation conditioned on return-to-go targets.
  • It simplifies learning by directly imitating demonstrated actions, avoiding the need for value function estimation and enabling stable optimization.
  • Enhancements like ConserWeightive Behavioral Cloning upweight high-return trajectories and apply conservative regularization to address out-of-distribution return challenges.

Return-Conditioned Behavioral Cloning (RCBC) is an offline reinforcement learning (RL) paradigm that recasts policy learning as supervised learning over trajectories, with the key innovation of conditioning the learned policy on a user-specified measure of future return. Rather than inferring value functions or optimizing expected returns through dynamic programming, RCBC directly trains a policy to imitate demonstrated behaviors with additional context in the form of return-to-go (RTG) targets, thereby enabling offline RL with simplified objectives and stable optimization. Extensions such as ConserWeightive Behavioral Cloning (CWBC) address the limitations of naïve return conditioning, particularly in the presence of out-of-distribution (OOD) return requests that require extrapolation beyond the coverage of offline data (Nguyen et al., 2022).

1. Formalism and Training Objective

Given an offline dataset of trajectories D={τ}\mathcal{D}=\{\tau\}, each trajectory τ=(s1,a1,r1,,sT,aT,rT)\tau=(s_1,a_1,r_1,\dots,s_T,a_T,r_T), the return-to-go at time tt is defined as

gt=t=tTrt.g_t = \sum_{t'=t}^T r_{t'}\,.

The return-conditioned policy has the form

πθ(ast,gt)or more generallyπθ(ast,G),\pi_\theta(a\mid s_t,g_t)\quad\text{or more generally}\quad\pi_\theta(a\mid s_t,G),

where GG is a user-specified target return, reflecting the intended return level at test time.

Training is conducted via supervised learning by treating each (st,gt)at(s_t, g_t)\rightarrow a_t tuple as a labeled training example. The typical loss function is the negative log-likelihood of demonstrated actions under the learned policy:

LBC(θ)=EτD[t=1Tlogπθ(atst,gt)].\mathcal{L}_{\rm BC}(\theta) = - \mathbb{E}_{\tau\sim\mathcal{D}}\left[\sum_{t=1}^T\log \pi_\theta(a_t\mid s_t,g_t)\right].

Alternatively, if πθ\pi_\theta is Gaussian and actions are real-valued, a mean squared error can be used:

LBC(θ)=EτD[1Tt=1Tatπθ(st,gt)2].\mathcal{L}_{\rm BC}(\theta) = \mathbb{E}_{\tau\sim\mathcal{D}}\left[\frac1T\sum_{t=1}^T \|a_t-\pi_\theta(s_t,g_t)\|^2\right].

2. Challenges of OOD Return Conditioning

A key deficiency in standard RCBC arises under high-value return conditioning that lies outside the empirical support of the offline dataset. In practice, dataset returns are bounded above by some maxτrτ\max_\tau r_\tau; if the policy is queried at test time with GG exceeding this bound, it must extrapolate. This train–test distribution shift in (s,G)(s, G) pairs often leads to abrupt performance collapse, with realized returns substantially below the requested value. This vulnerability stems from the lack of high-return supervision in training, as well as architectural limitations that hinder generalization to unseen return contexts (Nguyen et al., 2022).

3. ConserWeightive Behavioral Cloning (CWBC): Methodology

CWBC augments RCBC with two principal mechanisms: trajectory weighting and conservative regularization.

a. Trajectory Weighting

Each trajectory τ\tau is assigned a nonnegative weight that increases with total return:

w(τ)exp(βrτ),rτ=t=1Trt,w(\tau)\propto \exp(\beta\cdot r_\tau),\quad r_\tau=\sum_{t=1}^T r_t,

with temperature parameter β>0\beta > 0. This amplifies the influence of high-return trajectories, mitigating the bias toward suboptimal returns endemic to the original offline data distribution.

b. Conservative Regularization

For trajectories with returns exceeding a high percentile rqr_q (e.g., the 95th percentile), pseudo-OOD contexts are constructed by adding noise ε\varepsilon so that g1+εrg_1 + \varepsilon \geq r^*. Define perturbed RTG as

ωt=gt+εTt+1,\omega_t' = \frac{g_t + \varepsilon}{T - t + 1},\,

The conservative penalty enforced is

$R(\theta) = \mathbb{E}_{\substack{\tau\sim\mathcal{D}\r_\tau > r_q,\; \varepsilon \sim \mathcal{E}}}\left[\sum_{t=1}^T \|\pi_\theta(s_t,\omega_t')-a_t\|^2\right].$

This regularizer constrains the policy’s behavior under OOD returns to remain close to trajectories seen in high-quality data.

c. Combined Objective

The final objective function is:

LCWBC(θ)=EτD[w(τ)LBC(τ;θ)]+λR(θ),\mathcal{L}_{\rm CWBC}(\theta) = \mathbb{E}_{\tau\sim\mathcal{D}}\left[w(\tau)\, \mathcal{L}_{\rm BC}(\tau;\theta)\right] + \lambda\,R(\theta),

where λ0\lambda \geq 0 governs the tradeoff between maximum likelihood imitation and conservative OOD regularization.

4. Implementation Details

The CWBC algorithm involves the following procedural steps (Nguyen et al., 2022):

  • Precompute the returns rτr_\tau for all τD\tau \in \mathcal{D}.
  • Partition trajectories into BB return-quantile bins.
  • At each iteration:

    1. Sample a mini-batch: first select a bin bb with probability proportional to exp(βmean_return(b))\exp(\beta\cdot \text{mean\_return}(b)), then sample uniformly from trajectories in bb.
    2. For each sampled trajectory, compute the standard BC loss. For those with rτ>rqr_\tau > r_q, inject noise into gtg_t and evaluate the conservative penalty.
    3. Update θ\theta via gradient descent on the combined (weighted + regularized) loss.

Recommended hyperparameters include:

Hyperparameter Value Context
Trajectory-weight bins (BB) $20$ Robust across tasks
Smoothing (λ\lambda) $0.01$ Training stability
Conservative percentile (qq) $95$ OOD regularization
Noise std (σ\sigma) $1000$ Return perturbation
Regularization weight $1.0$ Final loss function
Test-time conditioning (GG) Expert return (no per-task tuning) Evaluation

5. Theoretical Foundations

Appendix C of (Nguyen et al., 2022) provides a bias–variance bound for the gradient discrepancy between the reweighted objective (as used in trajectory weighting) and an ideal expert distribution:

ELqLp2C1Erq[1/Nr]+C2D2(qpD)/D+C3DTV(p,q)2,\mathbb{E}\left\|\nabla L_q - \nabla L_{p^*}\right\|^2 \leq C_1 \mathbb{E}_{r\sim q}[1/N_r] + C_2 D_2(q\|p_D)/|\mathcal{D}| + C_3 D_{\rm TV}(p^*, q)^2,

where qq is the reweighted return distribution, NrN_r is the number of trajectories with return rr, and pp^* the expert distribution. Exponential weighting is derived as minimizer of this upper bound, balancing bias due to underrepresentation of high-returns and variance introduced by aggressive weighting. Conservative regularization acts as an additional control on extrapolation error for OOD return contexts.

6. Empirical Evaluation

CWBC was evaluated on D4RL locomotion benchmarks (hopper, walker2d, halfcheetah) using “medium,” “med-replay,” and “med-expert” datasets, as well as Atari replay data. Primary metrics included normalized return (relative to expert) and success rates on AntMaze.

Key results (average over 10 seeds):

  • RvS baseline: 64.6%\sim64.6\% → RvS+CWBC: 76.5%\sim76.5\% (+18 points)

  • Decision Transformer baseline: 66.7%\sim66.7\% → DT+CWBC: 71.9%\sim71.9\% (+5.2 points)
  • CWBC maintained strong performance even when conditioned on out-of-distribution high returns, often matching or exceeding CQL and IQL.
  • On low-quality “med-replay” datasets, standard RvS crashed on OOD conditioning, whereas RvS+CWBC exhibited consistent reliability.

7. Practical Considerations and Limitations

Empirical evidence supports that CWBC is a robust augmentation to any conditional BC framework, requiring minimal tuning and exhibiting strong generalization for in-distribution and modestly out-of-distribution return conditioning (Nguyen et al., 2022). Robust default hyperparameters further facilitate practical application. However, perfect linear extrapolation to arbitrarily high OOD returns—i.e., guaranteeing proportional performance for any user-specified GG—remains elusive, and generalization beyond the support of offline data is an unresolved research challenge.

In summary, Return-Conditioned Behavioral Cloning reframes offline RL as supervised learning conditioned on a desired return. ConserWeightive Behavioral Cloning further enhances reliability by (1) upweighting high-return trajectories and (2) imposing a conservative penalty for OOD contexts, collectively yielding a robust and practical recipe for offline RL with minimal complexity (Nguyen et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Return Conditioned Behavior Cloning (RCBC).