Conditional Behavior Transformers (C-BeT)

Updated 23 March 2026

Conditional Behavior Transformers (C-BeT) are transformer-based models that condition on both past observations and future goals to generate adaptive, goal-directed behaviors.
They employ a GPT-style, causal masking architecture to fuse visual and proprioceptive data, achieving state-of-the-art performance in imitation learning and meta-reinforcement scenarios.
C-BeT enables efficient latent intent inference and rapid adaptation to non-stationary dynamics in collaborative robotics through multi-modal output and hindsight relabeling.

Conditional Behavior Transformers (C-BeT) are transformer-based sequence models that leverage conditional and contextual information to generate goal-directed, multi-modal, and adaptive behaviors in robot learning and interactive sequential decision-making. They have been introduced and evaluated across both imitation learning from uncurated play data and meta-reinforcement learning in collaborative robotics. The architecture underpins state-of-the-art approaches for inferring latent human or agent intent, handling non-stationarity, and synthesizing complex, task-centric behaviors in both simulated and real-world settings (Cui et al., 2022, Mon-Williams et al., 2023, Carroll et al., 2022).

1. Core Model Architecture

Conditional Behavior Transformers (C-BeT) extend the Behavioral Transformer (BeT) by conditioning action prediction on both past context and explicit future goals, as well as by inferring latent factors representing non-stationary human behaviors in meta-RL pipelines.

C-BeT (for play data and offline imitation learning) is architected as a GPT-style, sequence-to-sequence transformer. The input comprises two primary blocks:

Current observation window: $\bar o_c = (o_t, o_{t+1}, …, o_{t+N-1})$
Future (goal) observation window: $\bar o_g = (o_{t'}, o_{t'+1}, …, o_{t'+N'-1}), \, t' > t$

Each observation $o$ is a concatenation of visual (e.g., ResNet/BYOL embeddings) and proprioceptive (joint angles) vectors, projected to a shared embedding space with positional encodings. Tokens from $\bar o_g$ and $\bar o_c$ are concatenated and fed into a transformer with $L$ layers of multi-head self-attention. Causal (unidirectional) masking ensures autoregressive prediction: future tokens can act as goals, but output actions cannot access future actions or rewards during training (Cui et al., 2022).

In meta-RL collaborative settings, C-BeT (also termed BeTrans) comprises:

A causal, autoregressive Transformer (the Behavioral Transformer) that processes a history of $(s_t,a_t,r_t,d_t)$ tuples to produce a latent vector $v_t$ characterizing current human behavior.
A Dynamics Network (MLP) receiving $(s_t, v_t)$ , predicting $(\hat s_{t+1}, \hat r_t)$ .
A Soft-Actor Critic (SAC) actor-critic, with both the policy $\pi_\theta(a_t | s_t, v_t)$ and critic $Q_\phi(s_t, v_t, a_t)$ conditioned on $v_t$ (Mon-Williams et al., 2023).

Latents $v_t$ can be continuous (Gaussian reparameterization) or discrete (Gumbel-Softmax), inferred from the final transformer's hidden state over the context window.

2. Conditioning, Inference Mechanisms, and Loss Functions

C-BeT enables flexible behavior by directly incorporating future information into the input streams. To specify a goal-conditioned policy, one concatenates the future goal window with the current trajectory, allowing self-attention to bind history and desired outcome. Unlike manually inserting special tokens, this architecture leverages transformer's global attention to relate both past observations and desired future outcomes.

The action prediction head operates with multi-modal output:

Discrete bin logits $p_d \in \mathbb{R}^K$ : Action space is clustered by $k$ -means during preprocessing; each action is represented as a bin index $i$ plus continuous residual $r = a - \mu_i$ .
Continuous offset predictions $p_c \in \mathbb{R}^{K \times |A|}$ : Fine mode assignment via residual regression.

The loss function for each example is:

$L_{C\text{-}BeT} = L_{focal}(p_d, i^*) + \lambda L_{MT}(p_c, i^*, r^*)$

where $L_{focal}$ is the focal loss (with $\gamma=2$ ), $L_{MT}$ is the mean squared error on the residual, and $\lambda=1.0$ in practice (Cui et al., 2022).

In meta-RL, the joint optimization comprises:

SAC actor and critic objectives:

$J_Q(\theta_Q) = \mathbb{E}_{(s,v,a,r,s')} \left[\frac{1}{2}\left(Q_{\theta_Q}(s,v,a) - (r + \gamma \mathbb{E}_{a' \sim \pi}[Q_{\theta_Q'}(s',v,a') - \alpha \log \pi_\theta(a'|s',v)])\right)^2\right]$

$J_\pi(\theta_\pi) = \mathbb{E}_{s,v}\left[\alpha \log \pi_\theta(a|s,v) - Q_{\theta_Q}(s,v,a)\right]$

Dynamics and KL penalties:

$J_{sr} = \mathbb{E}\left[\|s' - \hat s\|^2 + (r - \hat r)^2\right]$

$J_{KL} = D_{KL}(\mathcal{N}(\mu(H),\sigma^2(H)) \| \mathcal{N}(0,I))$

Total world-model loss: $J_{DN} = J_{sr} + \beta J_{KL}$ (Mon-Williams et al., 2023).

3. Data Preparation and Meta-Learning Protocols

For imitation learning from play, training data comprise uncurated robot trajectories $\tau = (o_t, a_t)_t$ from non-expert teleoperation, without task or reward labels. Action vectors are discretized via $k$ -means clustering on the dataset, separating multi-modal action structures for enhanced expressivity. Observation and goal segments are sampled with hindsight relabeling: for any trajectory, context and goal windows are paired arbitrarily as long as $t' > t$ . Inputs are embedded and tokenized with standard transformer machinery, using visual features (BYOL, ResNet-18) and augmented proprioception (Cui et al., 2022).

In meta-RL settings, each episode instantiates a simulated human with time-invariant, low-frequency, and high-frequency biases sampled from a continuous-Bernoulli prior. Within-episode human goals may shift, producing diverse human-robot interaction histories; C-BeT continually re-infers $v_t$ from sliding context windows ( $L=125$ tokens $\approx 25$ timesteps). Training executes a single joint loop (no inner-loop adaptation): after trajectory collection, batched SAC and world-model samples update all parameters (with learning rates $\alpha_Q=\alpha_\pi=3\times10^{-4}, \alpha_{BT}=4\times10^{-4}$ , batch size $B=32$ ) (Mon-Williams et al., 2023).

4. Experimental Results and Benchmarks

C-BeT has been evaluated across simulated and real-world benchmarks:

Imitation and Play Data Benchmarks

CARLA self-driving (visual RL): C-BeT (multimodal) reaches 0.98 success rate vs. 0.74 for next-best (GTI, CVAE+autoregressive).
Simulated Kitchen tasks (Franka): C-BeT achieves 2.80/4 mean task success, slightly ahead of unimodal C-BeT.
BlockPush: C-BeT reaches 0.90 success vs. 0.35 for unimodal.
Averaged, C-BeT outperforms prior SOTA approaches by approximately 45.7% (Cui et al., 2022).
Real-world Robot (Franka, toy kitchen): Multimodal C-BeT attains 24/50 success (single-task aggregating oven, microwave, pot, knob) vs. 13/50 (unimodal), 12/50 (unconditioned), and 0/25 (GoFAR).
Multi-task rollouts complete 1.1 tasks/run vs. $<$ 0.5 for all baselines. Conditioning on unseen demonstration snippets maintains $\approx$ 67% of single-task performance.

Meta-RL: Human-Robot Collaboration

Co-pass (ball-passing): C-BeT adapts to non-stationary simulated human agents, achieving 30–50% faster convergence vs. RNN-LILI and LILI (VAE). In noise, performance drops by $<$ 10% for C-BeT vs. 20–30% for VAE/RNN. For within-episode goal switches and long-term dependency scenarios, only C-BeT regains or maintains near-optimal rewards.
Discrete Gumbel-Softmax latents expedite SAC convergence by 24–46%, acting as an implicit regularizer. C-BeT closely tracks the Oracle (true latent) upper bound (Mon-Williams et al., 2023).

5. Architectural Insights, Strengths, and Limitations

C-BeT's effectiveness is attributed to the transformer's self-attention, which enables simultaneous modeling of recent and long-range dependencies within trajectories, as well as goal conditioning via global input context. In meta-learning, causal masking ensures zero-shot latent inference using only available past information, with no need for online gradient updates (Mon-Williams et al., 2023).

Key strengths:

Multi-modal goal conditioning: Supports multi-modal action distribution, robust to noisy, unlabeled, and diverse interaction data.
Zero-shot meta-adaptation: Rapid policy adjustment to novel agent/human types without further adaptation steps.
Scalability: Handles high-dimensional visual data and real-world robot observations by leveraging powerful visual embedding techniques.
Hindsight relabeling: Enables efficient data reuse for arbitrary goal conditioning.
Improved sample efficiency: Learnings transfer quickly in presence of task switches, non-stationarity, and long-horizon credit assignment.

Limitations:

Dependence on context window: Tasks requiring dependencies exceeding the context length ( $L$ ) degrade in performance unless the context is extended.
Markovian latent assumption: The latent $v_t$ must satisfy a first-order Markov property; truly non-Markovian behavior requires explicit sequence modeling.
Visual embedding quality: Failures may arise if the visual encoder is not discriminative enough (e.g., similar knob angles in kitchen tasks).
Distractor sensitivity: Generalization performance degrades significantly beyond two distractors in the environment.
Deterministic transition modeling: World model assumes squared loss; inherently stochastic agent switches are not explicitly modeled (Mon-Williams et al., 2023, Cui et al., 2022).

Conditional Behavior Transformers synthesize concepts from supervised behavioral cloning, goal-conditioned imitation, meta-reinforcement learning, and multi-modal sequence modeling. Related frameworks include:

FlexiBiT: Bidirectional transformers trained with random and structured token masking over sequences of $(s_t, a_t, \hat R_t)$ , enabling unified inference over BC, RL, planning, and inverse/forward dynamics; performs competitively with task-specific models and enables task composition via inference-time masking (Carroll et al., 2022).
VAE/RNN-based latent RL: Latent variable models (e.g., LILI, RNN-LILI) for meta-RL, less adaptive in non-stationary or noisy settings compared to transformer inference.
Decision Transformers: Sequence modeling of (state, action, return) triplets, but generally without explicit multi-modal or goal-block conditioning as in C-BeT.

A distinguishing aspect of C-BeT is the treatment of arbitrary goal blocks as first-class input, supporting flexible behavior specification without task labels or explicit reward signals and achieving on-par or superior performance in both supervised and reinforcement learning contexts (Cui et al., 2022, Mon-Williams et al., 2023, Carroll et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data (2022)

A behavioural transformer for effective collaboration between a robot and a non-stationary human (2023)

Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Behavior Transformers (C-BeT).

Conditional Behavior Transformers (C-BeT)

1. Core Model Architecture

2. Conditioning, Inference Mechanisms, and Loss Functions

3. Data Preparation and Meta-Learning Protocols

4. Experimental Results and Benchmarks

Imitation and Play Data Benchmarks

Meta-RL: Human-Robot Collaboration

5. Architectural Insights, Strengths, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Conditional Behavior Transformers (C-BeT)

1. Core Model Architecture

2. Conditioning, Inference Mechanisms, and Loss Functions

3. Data Preparation and Meta-Learning Protocols

4. Experimental Results and Benchmarks

Imitation and Play Data Benchmarks

Meta-RL: Human-Robot Collaboration

5. Architectural Insights, Strengths, and Limitations

6. Related Approaches and Theoretical Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research