Conditional Behavior Transformers (C-BeT)
- Conditional Behavior Transformers (C-BeT) are transformer-based models that condition on both past observations and future goals to generate adaptive, goal-directed behaviors.
- They employ a GPT-style, causal masking architecture to fuse visual and proprioceptive data, achieving state-of-the-art performance in imitation learning and meta-reinforcement scenarios.
- C-BeT enables efficient latent intent inference and rapid adaptation to non-stationary dynamics in collaborative robotics through multi-modal output and hindsight relabeling.
Conditional Behavior Transformers (C-BeT) are transformer-based sequence models that leverage conditional and contextual information to generate goal-directed, multi-modal, and adaptive behaviors in robot learning and interactive sequential decision-making. They have been introduced and evaluated across both imitation learning from uncurated play data and meta-reinforcement learning in collaborative robotics. The architecture underpins state-of-the-art approaches for inferring latent human or agent intent, handling non-stationarity, and synthesizing complex, task-centric behaviors in both simulated and real-world settings (Cui et al., 2022, Mon-Williams et al., 2023, Carroll et al., 2022).
1. Core Model Architecture
Conditional Behavior Transformers (C-BeT) extend the Behavioral Transformer (BeT) by conditioning action prediction on both past context and explicit future goals, as well as by inferring latent factors representing non-stationary human behaviors in meta-RL pipelines.
C-BeT (for play data and offline imitation learning) is architected as a GPT-style, sequence-to-sequence transformer. The input comprises two primary blocks:
- Current observation window:
- Future (goal) observation window:
Each observation is a concatenation of visual (e.g., ResNet/BYOL embeddings) and proprioceptive (joint angles) vectors, projected to a shared embedding space with positional encodings. Tokens from and are concatenated and fed into a transformer with layers of multi-head self-attention. Causal (unidirectional) masking ensures autoregressive prediction: future tokens can act as goals, but output actions cannot access future actions or rewards during training (Cui et al., 2022).
In meta-RL collaborative settings, C-BeT (also termed BeTrans) comprises:
- A causal, autoregressive Transformer (the Behavioral Transformer) that processes a history of tuples to produce a latent vector characterizing current human behavior.
- A Dynamics Network (MLP) receiving , predicting .
- A Soft-Actor Critic (SAC) actor-critic, with both the policy and critic conditioned on (Mon-Williams et al., 2023).
Latents can be continuous (Gaussian reparameterization) or discrete (Gumbel-Softmax), inferred from the final transformer's hidden state over the context window.
2. Conditioning, Inference Mechanisms, and Loss Functions
C-BeT enables flexible behavior by directly incorporating future information into the input streams. To specify a goal-conditioned policy, one concatenates the future goal window with the current trajectory, allowing self-attention to bind history and desired outcome. Unlike manually inserting special tokens, this architecture leverages transformer's global attention to relate both past observations and desired future outcomes.
The action prediction head operates with multi-modal output:
- Discrete bin logits : Action space is clustered by -means during preprocessing; each action is represented as a bin index plus continuous residual .
- Continuous offset predictions : Fine mode assignment via residual regression.
The loss function for each example is:
where is the focal loss (with ), is the mean squared error on the residual, and in practice (Cui et al., 2022).
In meta-RL, the joint optimization comprises:
- SAC actor and critic objectives:
- Dynamics and KL penalties:
- Total world-model loss: (Mon-Williams et al., 2023).
3. Data Preparation and Meta-Learning Protocols
For imitation learning from play, training data comprise uncurated robot trajectories from non-expert teleoperation, without task or reward labels. Action vectors are discretized via -means clustering on the dataset, separating multi-modal action structures for enhanced expressivity. Observation and goal segments are sampled with hindsight relabeling: for any trajectory, context and goal windows are paired arbitrarily as long as . Inputs are embedded and tokenized with standard transformer machinery, using visual features (BYOL, ResNet-18) and augmented proprioception (Cui et al., 2022).
In meta-RL settings, each episode instantiates a simulated human with time-invariant, low-frequency, and high-frequency biases sampled from a continuous-Bernoulli prior. Within-episode human goals may shift, producing diverse human-robot interaction histories; C-BeT continually re-infers from sliding context windows ( tokens timesteps). Training executes a single joint loop (no inner-loop adaptation): after trajectory collection, batched SAC and world-model samples update all parameters (with learning rates , batch size ) (Mon-Williams et al., 2023).
4. Experimental Results and Benchmarks
C-BeT has been evaluated across simulated and real-world benchmarks:
Imitation and Play Data Benchmarks
- CARLA self-driving (visual RL): C-BeT (multimodal) reaches 0.98 success rate vs. 0.74 for next-best (GTI, CVAE+autoregressive).
- Simulated Kitchen tasks (Franka): C-BeT achieves 2.80/4 mean task success, slightly ahead of unimodal C-BeT.
- BlockPush: C-BeT reaches 0.90 success vs. 0.35 for unimodal.
- Averaged, C-BeT outperforms prior SOTA approaches by approximately 45.7% (Cui et al., 2022).
- Real-world Robot (Franka, toy kitchen): Multimodal C-BeT attains 24/50 success (single-task aggregating oven, microwave, pot, knob) vs. 13/50 (unimodal), 12/50 (unconditioned), and 0/25 (GoFAR).
- Multi-task rollouts complete 1.1 tasks/run vs. 0.5 for all baselines. Conditioning on unseen demonstration snippets maintains 67% of single-task performance.
Meta-RL: Human-Robot Collaboration
- Co-pass (ball-passing): C-BeT adapts to non-stationary simulated human agents, achieving 30–50% faster convergence vs. RNN-LILI and LILI (VAE). In noise, performance drops by 10% for C-BeT vs. 20–30% for VAE/RNN. For within-episode goal switches and long-term dependency scenarios, only C-BeT regains or maintains near-optimal rewards.
- Discrete Gumbel-Softmax latents expedite SAC convergence by 24–46%, acting as an implicit regularizer. C-BeT closely tracks the Oracle (true latent) upper bound (Mon-Williams et al., 2023).
5. Architectural Insights, Strengths, and Limitations
C-BeT's effectiveness is attributed to the transformer's self-attention, which enables simultaneous modeling of recent and long-range dependencies within trajectories, as well as goal conditioning via global input context. In meta-learning, causal masking ensures zero-shot latent inference using only available past information, with no need for online gradient updates (Mon-Williams et al., 2023).
Key strengths:
- Multi-modal goal conditioning: Supports multi-modal action distribution, robust to noisy, unlabeled, and diverse interaction data.
- Zero-shot meta-adaptation: Rapid policy adjustment to novel agent/human types without further adaptation steps.
- Scalability: Handles high-dimensional visual data and real-world robot observations by leveraging powerful visual embedding techniques.
- Hindsight relabeling: Enables efficient data reuse for arbitrary goal conditioning.
- Improved sample efficiency: Learnings transfer quickly in presence of task switches, non-stationarity, and long-horizon credit assignment.
Limitations:
- Dependence on context window: Tasks requiring dependencies exceeding the context length () degrade in performance unless the context is extended.
- Markovian latent assumption: The latent must satisfy a first-order Markov property; truly non-Markovian behavior requires explicit sequence modeling.
- Visual embedding quality: Failures may arise if the visual encoder is not discriminative enough (e.g., similar knob angles in kitchen tasks).
- Distractor sensitivity: Generalization performance degrades significantly beyond two distractors in the environment.
- Deterministic transition modeling: World model assumes squared loss; inherently stochastic agent switches are not explicitly modeled (Mon-Williams et al., 2023, Cui et al., 2022).
6. Related Approaches and Theoretical Context
Conditional Behavior Transformers synthesize concepts from supervised behavioral cloning, goal-conditioned imitation, meta-reinforcement learning, and multi-modal sequence modeling. Related frameworks include:
- FlexiBiT: Bidirectional transformers trained with random and structured token masking over sequences of , enabling unified inference over BC, RL, planning, and inverse/forward dynamics; performs competitively with task-specific models and enables task composition via inference-time masking (Carroll et al., 2022).
- VAE/RNN-based latent RL: Latent variable models (e.g., LILI, RNN-LILI) for meta-RL, less adaptive in non-stationary or noisy settings compared to transformer inference.
- Decision Transformers: Sequence modeling of (state, action, return) triplets, but generally without explicit multi-modal or goal-block conditioning as in C-BeT.
A distinguishing aspect of C-BeT is the treatment of arbitrary goal blocks as first-class input, supporting flexible behavior specification without task labels or explicit reward signals and achieving on-par or superior performance in both supervised and reinforcement learning contexts (Cui et al., 2022, Mon-Williams et al., 2023, Carroll et al., 2022).