Augmented Behavioral Cloning from Observation (ABCO)

Updated 2 February 2026

Augmented Behavioral Cloning from Observation (ABCO) is a robust imitation-from-observation method that integrates self-supervised action inference, regulated sampling, and self-attention to address BCO limitations.
It strategically blends pre-demonstration and post-demonstration data with softmax-based exploratory action sampling to prevent action class collapse and improve generalization.
Empirical results demonstrate that ABCO outperforms traditional methods in high-dimensional, vector, and image-based tasks, achieving near-optimal performance across environments.

Augmented Behavioral Cloning from Observation (ABCO) is a family of imitation-from-observation (IfO) algorithms that enhance standard behavioral cloning from observation (BCO) by coupling self-supervised action inference with advanced dataset regulation, exploration, and attention-based representation learning. ABCO addresses failures of basic BCO arising from local minima, distributional collapse, and weak generalization in high-dimensional and partially observed settings, reliably outperforming prior IfO methods across vector and image-based tasks (Gavenski et al., 2020, Monteiro et al., 2020, Ozcan et al., 2024).

1. Problem Setting and Baseline: BCO

Behavioral Cloning from Observation (BCO) operates in a Markov Decision Process $\mathcal{M} = \langle S, A, T, r, \gamma \rangle$ , where only state trajectories of expert policies are observable; action labels and reward signals are unavailable. The central learning objective is to produce a policy $\pi(a|s)$ such that the agent’s induced state-visitation matches that of the expert, using as a bootstrap a self-learned inverse dynamics model (IDM) $f_\theta(a|s_t, s_{t+1})$ trained on agent-collected experience under a random or exploratory policy (Torabi et al., 2018).

The canonical BCO pipeline is:

Collect agent data $I^{pre} = \{(s_t, a_t, s_{t+1})\}$ using a random policy.
Train inverse dynamics model to predict $a_t$ from $(s_t, s_{t+1})$ .
Infer actions for expert transitions: assign labels $\hat{a}_t = \arg\max_a f_\theta(a|s^e_t, s^e_{t+1})$ .
Clone policy: behaviorally clone on $(s_t^e, \hat{a}_t)$ .
(Optional) Iteration: alternate policy rollout and model retraining.

Limiting factors of BCO include action class collapse, poor coverage of rare transitions, and a “feedback trap” where the iterative label generation process amplifies early prediction errors, entrenching suboptimal policies (Gavenski et al., 2020, Monteiro et al., 2020, Robertson et al., 2020).

2. ABCO Algorithmic Structure and Innovations

ABCO (termed IUPE in (Gavenski et al., 2020)), introduces:

Sampling regulation, to stabilize the support of action labels and maintain distributional diversity by blending “pre-demonstration” (random) and “post-demonstration” (policy-generated) experience according to success-driven weights.
Exploratory action sampling, using stochastic softmax sampling for both IDM decoding and agent rollouts, replacing deterministic argmax to induce self-tuned exploration.
Self-attention modules, within IDM and policy networks, to capture global state features and nonlocal dependencies.

The iterative loop is:

IDM Decoding: Train $\mathcal{M}_\theta(a|s_t, s_{t+1})$ on blended buffer $\mathcal{I}^s$ . Generate pseudo-labels for all expert pairs $(s_t^e, s_{t+1}^e)$ by sampling from $\mathrm{softmax}$ logits, not MAP.
Behavioral Cloning: Train policy $\pi_\phi$ via cross-entropy to predict $\hat{a}_t$ labels from expert states.
Exploration / Rollout: Execute $\pi_\phi$ stochastically in the environment; record only those trajectories that succeed at the task.
Sampling Mechanism: Synthesize new $\mathcal{I}^s$ by combining $I^{pre}$ and successful $I^{pos}$ with action-wise proportions determined by empirical “win-rate”.

This cycle is repeated for a fixed number of improvement iterations. The process is formalized in Algorithm 1 of (Gavenski et al., 2020), which closely mirrors the structure described above.

3. Mathematical Objectives and Sampling Formalism

The core ABCO objectives are:

Inverse Dynamics Loss:

$L_{IDM}(\theta) = -\mathbb{E}_{(s, s', a) \in \mathcal{I}^s} \log \mathcal{M}_\theta(a|s, s')$

Policy Model Loss:

$L_{PM}(\phi) = -\mathbb{E}_{(s^e, \hat{a})} \log \pi_\phi(\hat{a}|s^e)$

Sampling Distribution:

$P(A|E;I^{pos}) = \frac{1}{|E|} \sum_{e=1}^{|E|} v_e P(A|e)$

where $v_e$ signals successful rollouts, and $P(A|e)$ is the frequency of action $A$ in run $e$ .

The sampling mechanism produces $\mathcal{I}^s = \mathcal{I}^{pre}_{spl} \cup \mathcal{I}^{pos}_{spl}$ , in proportions determined by action-wise success rates, maintaining coverage over all action classes to avoid premature pruning (Gavenski et al., 2020, Monteiro et al., 2020).

Both decoding and rollout phases leverage categorical sampling from softmax outputs:

$a_t \sim \text{Categorical}( \text{softmax}( \ell ) )$

where $\ell$ are logits produced by the network. This yields a decaying exploration process: more uncertain models induce higher entropy, boosting exploration and recovery from early policy collapse.

4. Self-Attention Augmentation

ABCO networks integrate multi-head self-attention modules based on the scaled dot-product paradigm. For a feature tensor $F \in \mathbb{R}^{H \times W \times C}$ , the attention map is constructed as:

$Q = W_Q F;\quad K = W_K F;\quad V = W_V F$

$A = \text{softmax}\left( \frac{Q K^{T}}{\sqrt{d_k}} \right );\quad F' = A V + F$

These modules are inserted after particular ResBlocks in convolutional encoders (image inputs) or into hidden layers of multilayer perceptrons (vector inputs). The overall network is trained end-to-end using the same cross-entropy objectives; no extra attention-specific regularizer or loss is needed (Gavenski et al., 2020, Monteiro et al., 2020).

Empirical ablations demonstrate that while attention alone enhances representational power, only the combination of attention with regulated sampling ensures that rare action classes persist and are not pruned from the policy's support during training (Monteiro et al., 2020).

5. Training Regimen and Hyperparameterization

Standard ABCO training protocol, as established in (Gavenski et al., 2020), is as follows:

Collect $|\mathcal{I}^{pre}| \approx 10^4$ random transitions.
Gather expert dataset of 100 state-only trajectories per domain.
Iterate the ABCO loop for $\alpha = 10$ $α = 10$ cycles; per cycle:
- IDM: 3 epochs (Adam, lr= $10^{-3}$ , batch 64).
- PM: 3 epochs (same optimizer).
- Stochastic policy rollouts: $|E|=20$ episodes (capped per environment).
No explicit $\epsilon$ -decay; exploration is emergent via softmax sampling.
Self-attention: 4 heads, key/query/value dim 64 (images $C=256$ ), or 1 head of dim 32 (vector).
All experiments run for 100 total epochs (Gavenski et al., 2020, Monteiro et al., 2020).

6. Empirical Results and Ablation

Performance benchmarks span both vector and image-based OpenAI Gym tasks:

Environment	BC (AER/P)	BCO	ILPO	ABCO
CartPole-v1	500/1.00	1.00/500	1.00/500	1.00/500
Acrobot-v1	-83.6/1.00	0.98/-85.3	1.07/-85.3	1.29/-77.9
MountainCar-v0	-117.7/1.00	0.95/-167.0	0.63/-167.0	1.29/-132.3
Maze 3x3	0.18/-1.21	0.88/0.93	-1.71/-0.03	0.91/0.93
Maze 5x5	-0.51/-0.92	-0.11/-0.06	-0.40/-0.06	0.96/0.93
Maze 10x10	-1.00/-0.47	-0.42/-0.02	0.26/-0.02	0.86/0.86

In high-dimensional mazes, naive BCO and ILPO policies collapse, evidenced by negative or near-zero normalized performance $P$ , while ABCO stably maintains all action classes and achieves near-optimal returns (Gavenski et al., 2020, Monteiro et al., 2020).

Ablation studies reveal:

Self-attention alone is insufficient to prevent class collapse.
Sampling alone retards collapse but offers slow learning.
The complete ABCO (attention + sampling + stochastic exploration) prevents error feedback loops and yields maximal performance; e.g., in the 10×10 maze, ABCO achieves $P=1.00, \mathrm{AER}=0.98$ (Gavenski et al., 2020, Monteiro et al., 2020).

7. Extensions and Connections

Subsequent work (Ozcan et al., 2024) generalized ABCO to continuous control and model-based RL. In these frameworks, an augmented policy loss is adopted:

$J_{\text{aug}}(\theta) = (1-\epsilon) J_{RL}(\pi_\theta) + \epsilon L_{BC}(\theta; \phi, \mathcal{D}^e)$

where $J_{RL}$ is the SAC objective and $L_{BC}$ is a model-based state-matching loss penalizing the deviation between agent-hallucinated next states and expert state transitions. The weight $\epsilon$ is adaptively tuned using model ensemble disagreement on expert states.

Empirical results demonstrate that ABCO-style augmentation achieves up to $3-4\times$ higher sample efficiency than baseline RL or pure BCO/BC, matching expert performance in under a million interactions across DeepMind Control Suite tasks (Ozcan et al., 2024).

8. Significance and Open Questions

ABCO establishes a robust paradigm for imitation-from-observation, particularly in settings where action annotation is unavailable—e.g., real-world video or trajectory archives. Its design addresses notorious pitfalls of earlier BCO methods by actively balancing exploration, action-class coverage, and global feature representation.

Key open directions include further automating sample-efficient blending of experience buffers, improving robustness in highly nonstationary environments, and extending ABCO components (especially attention-based IDMs) to multimodal and partially observed domains (Gavenski et al., 2020, Monteiro et al., 2020, Ozcan et al., 2024).

References: