Behavioral Cloning from Observation (BCO)

Updated 30 December 2025

Behavioral Cloning from Observation (BCO) is an imitation learning approach that recovers control policies by inferring actions from state-only expert trajectories using an inverse dynamics model.
It employs a two-phase self-supervised pipeline that first learns an inverse dynamics model from exploration data and then applies behavioral cloning using the inferred pseudo-actions.
Recent advances include goal-aware sampling, self-attention enhancements, adversarial frameworks, and domain transfer techniques to improve sample efficiency, robustness, and scalability.

Behavioral Cloning from Observation (BCO) is a class of imitation learning methods for recovering control policies by observing only expert state trajectories, without access to their action sequences. BCO leverages self-supervision to infer actions via an inverse dynamics model, enabling behavioral cloning in settings where (state, action) expert pairs are unavailable. This framework underpins a corpus of methods that address sample efficiency, local minima, and domain transfer when learning from observation-only demonstrations.

1. Formal Problem Statement and Original Methodology

BCO is situated in Markov Decision Processes (MDPs), seeking to learn a policy $\pi(a|s)$ from a collection of state-only expert trajectories $D_{\rm demo} = \{\zeta^k\}_{k=1}^K$ , where each $\zeta^k = (s_0^k, s_1^k, ..., s_{T_k}^k)$ and no expert actions are observed. The key assumption is that learners and experts share the same state and action spaces, but the environment dynamics $T$ are unknown. BCO introduces a self-supervised two-phase pipeline (Torabi et al., 2018):

Phase 1: Inverse Dynamics Model Learning

The agent collects tuples $(s, a, s')$ by rolling out an exploration (typically random) policy pre-demonstration.
An inverse dynamics model $M_\theta(a | s, s')$ is trained via maximum likelihood to predict actions given adjacent states.

Phase 2: Imitation from Observation

For each transition $(s_t, s_{t+1})$ in the expert’s demonstrations, the model infers a pseudo-action $\hat a_t = \arg\max_a M_\theta(a | s_t, s_{t+1})$ .
Standard behavioral cloning is then performed: train $\pi_\phi(a|s)$ to predict pseudo-labels $\hat a_t$ from states $s_t$ .

Optionally, BCO may be iterated with a small post-demonstration interaction budget to improve $M_\theta$ and $\pi_\phi$ in alternation (Torabi et al., 2018).

2. Advances in Model Design and Sampling Strategies

The original BCO is susceptible to sub-optimal solutions due to poor action inference when dynamics are underexplored. Recent extensions address these shortcomings:

Augmented Behavioral Cloning from Observation (ABCO) implements two main advances (Monteiro et al., 2020):

Goal-aware, Win/Loss Sampling: At each iteration, data for retraining the inverse dynamics model are sampled in proportion to the demonstrator’s success probability in the environment, ensuring a curriculum from random exploratory to increasingly expert-like states.
Self-Attention Enhancements: Incorporate self-attention modules into both inverse dynamics and policy networks, improving the modeling of long-range dependencies in both vector and image states.

Imitating Unknown Policies via Exploration (IUPE) combines balanced sampling (maintaining a mix of pre- and post-demonstration data), softmax-based stochastic exploration, and self-attention layers to encourage policy exploration and prevent collapse onto trivial policies (Gavenski et al., 2020). Instead of always selecting maximum likelihood actions, actions are sampled from the inferred distribution, increasing exploration and, empirically, performance on high-dimensional control and visual domains.

Concurrent Training Approaches (BCO*) further eliminate the inefficiency of accumulating large random exploration buffers by interleaving policy improvement and inverse dynamics model retraining. Transitions for inverse model learning are continually collected on-policy, leading to a tighter feedback loop and dramatic reductions in sample complexity (Robertson et al., 2020).

3. Adversarial and Self-Supervised Extensions

Standard BCO can fall into local minima where the policy performs “no action” loops due to imperfect action inference. To address this, adversarial frameworks have been proposed:

Self-Supervised Adversarial Imitation Learning (SAIL) augments the BCO pipeline with two additional components: a discriminator $D$ that distinguishes expert from learner trajectories based solely on states, and a forward-dynamics model $G$ . The imitation policy, inverse model, and $G$ are trained jointly in a minimax game where D provides both an automated goal-criterion and adversarial pressure to avoid trivial (no-action) local minima (Monteiro et al., 2023). The combined objective encourages the policy to generate state sequences that a discriminator cannot distinguish from the expert, thereby avoiding collapse to non-progressive behavior.

Experimental results demonstrate that SAIL matches or exceeds earlier BCO-style and adversarial approaches (GAIL, GAIfO, IUPE) on a suite of Gym tasks, particularly by eliminating the need for manual goal selection and user intervention.

4. Robustness, Partial Observability, and Causal Confounds

BCO and its relaxations are sensitive to dataset bias and nuisance correlates, especially under partial observability. In such settings, behavioral cloning from observation histories can yield “copycat” agents that infer the expert’s previous action rather than correct next action, relying on action autocorrelation in expert data (Wen et al., 2020). This causal confusion leads to poor generalization under distribution shift.

To combat this, adversarial feature learning is employed: an encoder-decoder policy is trained alongside a target-conditioned adversary (TCA) that ensures the learned representation erases excess information about previous actions (nuisance variable) while retaining information necessary to predict the next action. An information bottleneck (KL penalty) is also imposed on the latent representation. This method achieves state-of-the-art offline performance on partially observed MuJoCo domains, demonstrating that careful regularization is essential for robust BCO in generalized settings.

5. Domain Transfer and Observation Mapping

Advancing beyond shared-environment imitation, BCO principles have been adapted for cross-domain policy transfer with disjoint observation spaces (Shukla et al., 2023). Here, unpaired CycleGANs learn mappings between source and target observations, with cycle-consistency losses ensuring invertibility. The mapped observations allow cloning a source-domain policy into the target, even under severe visual and semantic mismatch (including sim-to-real transfer). Only a small static set of target-domain images is needed, and no action or further real-world rollouts are required for BC. Reported success rates approach 90–94% with as few as 4,000 real images.

6. Empirical Evaluation and Comparative Results

BCO and its variants have been systematically evaluated on classic control (CartPole, Acrobot, MountainCar), maze navigation (Gym-Maze), and deep continuous domains (MuJoCo locomotion, TurtleBot real-world sim-to-real). Key metrics include Average Episodic Reward (AER) and Normalized Performance $P$ (scaled between random and expert returns). Across these benchmarks:

Vanilla BCO, ABCO, and IUPE typically reach or exceed performance of baseline Behavioral Cloning (using ground-truth actions), GAIL, and related methods using only expert states and minimal environment interaction (Monteiro et al., 2020, Gavenski et al., 2020).
SAIL outperforms BCO and adversarial IfO baselines, particularly in avoiding high-variance loops and action-inactivity collapse (Monteiro et al., 2023).
Methods introducing attention, balanced sampling, and joint optimization display substantially superior learning curves, especially on large maze and visually complex tasks.

Method	CartPole P	MountainCar P	Acrobot P	Maze 10x10 P
BC	1.000	1.000	1.000	1.000
BCO	1.000	0.948	0.980	-0.416
ABCO	1.000	1.289	1.071	0.860
IUPE	1.135	1.314	1.086	1.000
SAIL	1.000	0.990	0.990	--

All reported values and further details are reproduced directly from the referenced experimental tables (Monteiro et al., 2020, Gavenski et al., 2020, Monteiro et al., 2023).

7. Limitations and Future Directions

Current BCO methodologies face several constraints:

Inverse model quality depends on coverage and diversity of pre/post-demo exploration.
Partitioning states into agent-specific and task-specific components often requires domain knowledge (Torabi et al., 2018).
Adversarial extensions add system complexity and impose capacity constraints on the discriminator to avoid overfitting on small demo sets (Monteiro et al., 2023).
Iterative labeling loops may accumulate approximation errors if the inverse model fails to generalize.

Active research directions include:

Scaling BCO to pixel-based observations and learning directly from raw video (Monteiro et al., 2023).
Employing more advanced discriminator architectures (e.g., transformers) and data augmentation to increase robustness (Monteiro et al., 2023).
Extending BCO frameworks to multi-modal actions, partial observability, and large-scale domain adaptation for broader applicability (Robertson et al., 2020, Shukla et al., 2023).

A plausible implication is that future advances in semi-supervised learning, domain adaptation, and representation robustness will further close the remaining performance gap between observation-only and action-augmented imitation learning, as well as enable robust policy transfer from unstructured demonstration data at scale.