BCV-LR: Behavior Cloning via Latent Representations

Updated 2 January 2026

The paper's main contribution is introducing a method that uses video-derived latent embeddings for efficient imitation learning without direct expert action supervision.
It employs pretrained models such as vision transformers and VAEs to extract invariant, task-agnostic features, enabling robust action selection via latent space search and decoding.
Experimental results show BCV-LR achieves superior sample efficiency and zero-shot adaptability in both discrete and continuous control tasks across diverse benchmarks.

Behavior Cloning from Videos via Latent Representations (BCV-LR) is a family of imitation learning methods that leverage video-derived, task-agnostic latent representations to enable highly sample-efficient policy learning in both discrete and continuous control domains. By eschewing direct supervision from expert actions or rewards, BCV-LR operates by building, searching, or manipulating state/action embeddings derived from large-scale video corpora, yielding robust, generalizable, and in some cases zero-shot capable imitation policies.

1. Latent Representation Models and Pretraining

BCV-LR fundamentally relies on compact, information-rich latent state/action encodings constructed from raw video observations. The core module is an embedding function $f: S \to Z$ , where $S$ is the raw observation space (such as RGB frames or proprioceptive readings), and $Z \subset \mathbb{R}^L$ is a latent space designed to be invariant to visual nuisance factors yet maximally predictive of imminent transitions/actions.

State and/or action embeddings are typically produced via large pretrained models:

Video PreTraining (VPT) Vision Transformers: As in MineRL studies, where VPT models (e.g., with ∼307M parameters) are pretrained on human gameplay videos and used to embed each temporal observation $s_t$ into $z_t = f(s_t) \in \mathbb{R}^L$ . The CLS or pooled token from the transformer outputs serves as the latent descriptor (Malato et al., 2023, Malato et al., 2022).
Task-Agnostic Human Motion VAEs: Trajectories of human hand motion are encoded via sequence-wise variational autoencoders, mapping subtrajectories (e.g., 15-frame windows of 23D hand+arm state) to $D$ -dimensional latents which are then decoded back to joint/pose sequences (Liconti et al., 2024).
Contrastive Self-Supervised Encoders with Dynamics Objectives: In policy learning for general environments, encoders are trained with contrastive or prototype-based temporal association losses, optionally combined with reconstruction, so that frame augmentations collapse to the same latent, and temporally-neighboring frames are grouped by consistency (Liu et al., 25 Dec 2025).

This stage yields a fixed encoder $f$ used throughout downstream policy learning.

2. Policy Extraction and Control via Latent Spaces

Several BCV-LR frameworks cast the policy extraction problem as a search, decision, or prediction problem in latent space, sidestepping direct pixel-level regression or action labeling:

Nearest-Neighbor Search and Copy: Given an agent's current observation $s_t$ , it is embedded as $z_t=f(s_t)$ . The system queries an offline index of expert latent trajectories $\{z^i_k\}$ for the closest match under a specified metric (typically Euclidean or L1 distance). The corresponding action $a^j_k$ from the matched demo is executed, and the agent continues to follow along the matched demo until the agent's latent diverges past a threshold $\tau$ , prompting a new search (Malato et al., 2023, Malato et al., 2022).
Latent Sequence Modeling and Decoding: Rather than producing explicit joint commands, the policy (often a Transformer) predicts the next latent code $z$ in the pretrained manifold, which is then decoded into valid, temporally-coherent action sequences through the VAE decoder, ensuring that robot actions stay within human-like or expert-consistent space (Liconti et al., 2024).
Latent Action Distillation and Fine-Tuning: In robotics settings, BCV-LR can learn disentangled scene/motion token representations from multi-modal video (human, robot, language instruction), then distill these latents into a vision-language-action policy by aligning representations and subsequently fine-tuning an action-output head, enabling cross-embodiment generalization (Li et al., 28 Nov 2025).

This latent-centric paradigm eliminates, or greatly attenuates, the need for direct action/reward supervision and produces inherently robust, explainable control policies.

3. Training Protocols and Algorithms

BCV-LR implementations share a multi-stage training design. A canonical example:

Offline Encoder Pretraining: Learn $f$ via self-supervised or supervised objectives on large video datasets, including contrastive losses, temporal clustering, and (optionally) global reconstruction or dynamics prediction penalties (Liu et al., 25 Dec 2025).
Latent Action Disentanglement and Dynamics Prediction: Jointly train models to extract action-like transitions from pairs of successive latents, using either supervised alignment to trajectory data or unsupervised auto-regressive or world-model losses. Quantization (VQ) may enforce discreteness of the latent action set (Liu et al., 25 Dec 2025).
Policy Search/Distillation: Use nearest-neighbor search to select actions from demonstration embeddings, or train a latent policy (via behavior cloning or distillation) to reproduce expert latent transitions, sometimes with differentiable imitation losses (Malato et al., 2023, Liu et al., 25 Dec 2025, Li et al., 28 Nov 2025).
Online Fine-Tuning/Alignment: When operational, align latent transitions to real environment actions using collected experience, optimizing both action reconstruction and dynamics consistency losses. This may include iterative cycles wherein the improved policy refreshes the behavioral dataset for further refinement (Liu et al., 25 Dec 2025).

A representative pseudocode for nearest-neighbor latent search and action copy is provided in (Malato et al., 2023), while detailed pipeline outlines for latent action distillation, chunked latent decoding, and iterative self-improvement are provided in (Li et al., 28 Nov 2025, Liconti et al., 2024, Liu et al., 25 Dec 2025).

4. Empirical Evaluation and Results

BCV-LR methods have been benchmarked across diverse settings:

Environment	Measurement	BCV-LR	Baselines	Remark
MineRL BASALT	Success rate	75% of RL	Pixel BC (lower), PPO	Zero env. interactions for training
SIMPLER/Google	Success rate	78.0%	π₀: +25.3%	Outperforms state-of-art open models
LIBERO	Avg. success	98%	π₀.₅: +1.1%, UniVLA: +2.8%	Strong on long-horizon manipulation
Procgen (discrete)	Normalized return	0.79	UPESV: 9.0, PPO:2.3	Video-only, few-shot generalization
Franka real robot	Success (few-shot)	48% (10d)	<1% (π₀/π₀.₅), 63.3%(50d)	Robust transfer, no domain finetuning
MuJoCo hand sim	Positional error	0.95cm	Raw BC: 5.6cm	83% reduction under noise

In all reported domains, BCV-LR achieves comparable or superior performance to standard behavioral cloning, reward-based RL, and other imitation-from-observation methods, with orders-of-magnitude greater sample efficiency. For instance, in the Metaworld suite, BCV-LR achieved an 0.84 success rate (50k steps), while BCO and DrQv2 managed 0.07 and 0.16, respectively (Liu et al., 25 Dec 2025).

A notable property is strong zero-shot adaptation: swapping demonstration data or latent encodings at test-time enables immediate policy adaptation to new behaviors, a direct consequence of the search or latent structure (Malato et al., 2023).

5. Ablations, Limitations, and Theoretical Analysis

Ablation studies demonstrate:

The necessity of both self-supervised latent pretraining and latent action refinement for effective low-sample policy learning; omitting either sharply degrades performance, especially in continuous control (Liu et al., 25 Dec 2025).
Disentanglement of scene and motion tokens, and bidirectional decoding between these, are critical for universal latent action learning (Li et al., 28 Nov 2025).
Hyperparameters such as divergence threshold ( $\tau$ ), copy segment length ( $K_{\mathrm{max}}$ ), and the weighting of dynamics consistency losses have significant impact on imitation stability and robustness (Malato et al., 2023, Liu et al., 25 Dec 2025).

Limitations include:

Coverage: BCV-LR is fundamentally limited by the manifold of expert demonstrations. If the environment induces states not present in the demos, performance and stability degrade due to “latent drift” or frequent re-search (Malato et al., 2023).
Pretraining Cost: Some instantiations (e.g., LatBot-style) require substantial computational investment (multiple GPUs, 10+ days) and extensive video/action telemetry (Li et al., 28 Nov 2025).
Covariate Shift: As with all behavior cloning, policies remain vulnerable to distributional shift in long-horizon or highly stochastic settings. Hybrid integration with inverse RL or reward modeling in latent space is required to overcome this (Liu et al., 25 Dec 2025, Giammarino et al., 2023).
Domain Assumptions: Most approaches assume expert and agent share identical POMDP domains (state, transition, and observation mapping). Out-of-domain adaptation (e.g. third-person, novel environments) remains a challenge (Giammarino et al., 2023).

On the theoretical front, total-variation-based suboptimality bounds demonstrate that matching the distribution of latent transitions between expert and imitator is sufficient for performance convergence under certain reward structures (Giammarino et al., 2023).

6. Extensions, Generalization, and Impact

BCV-LR exhibits scalable generalization properties:

Cross-Embodiment and Cross-Dataset Transfer: Distilled latent action representations enable seamless transfer across robot types (e.g., Franka, UR5, Kuka) and domains (human, video game, physical robot), provided that supervisory signals such as physical telemetry are unified during pretraining (Li et al., 28 Nov 2025).
Few-Shot and Zero-Shot Imitation: By leveraging dense latent indexing and plug-and-play demonstration replacement, BCV-LR routinely demonstrates strong performance in low-data regimes, including “10-shot” real-world robotic settings or adaptation to novel video-game levels (Li et al., 28 Nov 2025, Liu et al., 25 Dec 2025).
Sample Efficiency: BCV-LR achieves near-expert performance within $\mathcal{O}(10^4$ – $10^5)$ interactions, vastly undershooting traditional RL or pixel-based imitation (Liu et al., 25 Dec 2025).

Emerging research focuses on multi-task pretraining, lifelong adaptation, offline-to-online pipeline improvements, and addressing covariate shift via latent reward inference or hybrid learning. BCV-LR currently sets a performance and sample efficiency benchmark for imitation learning from videos without expert action or reward access (Malato et al., 2023, Liu et al., 25 Dec 2025, Liconti et al., 2024, Li et al., 28 Nov 2025).