Unsupervised Imitation Learning from Observation

Updated 31 January 2026

UfO is a framework that learns control policies from expert state sequences without access to action data or reward signals.
It unifies distribution matching, adversarial optimization, and model-based planning to improve sample efficiency and stability.
UfO methods have demonstrated expert-level performance in continuous control and robotics with significantly fewer rollouts.

Unsupervised Imitation Learning from Observation (UfO) is the problem of learning a control policy solely from sequences of expert states (or high-dimensional observations) without access to expert actions or environment rewards. The goal is for the learner's behavior to match the distributional properties of the expert's trajectories, typically in terms of state or state-transition distributions. The UfO paradigm unifies algorithmic and theoretical insights from distribution-matching, model-based learning, adversarial optimization, and reinforcement learning (RL) to enable agents to acquire skills purely “by watching,” analogous to many aspects of human and animal learning.

1. Formal Problem Definition and Foundations

In UfO, the agent is provided with $N$ expert trajectories, each trajectory a sequence $\tau^E = (s_0, s_1, \dots, s_T)$ in a Markov Decision Process (MDP) $(S, A, P, r, p_0, T)$ . The learner never observes expert actions or the environment's reward function. The key objective is to learn a policy $\pi(a|s)$ such that the resulting distribution over state transitions $\beta = \frac{1}{T} \sum_{i=0}^{T-1} \delta_{(s_i, s_{i+1})_\pi}$ matches the empirical expert transition distribution $\alpha = \frac{1}{T} \sum_{i=0}^{T-1} \delta_{(s_i, s_{i+1})_E}$ . This is formalized as minimizing a discrepancy $D(\alpha, \beta)$ , where $D$ is typically a divergence or metric on probability distributions, such as the Kullback-Leibler (KL) divergence, Jensen-Shannon divergence, or Wasserstein distance (Burnwal et al., 20 Sep 2025, Torabi et al., 2019, Chang et al., 2023).

Extensions to visual observation—where the "state" is a sequence of images, possibly in a partially observable setting—require learning representations or sufficient statistics from observation histories, but the core distribution-matching principle remains (Giammarino et al., 2023, Liu et al., 2017).

2. Taxonomy of Algorithmic Approaches

UfO algorithms can be broadly categorized according to their treatment of missing actions, use of model-based components, and choice of distribution-matching objectives:

Model-Free Adversarial Approaches: Directly fit a discriminator to distinguish expert from learner state-transition pairs (or raw observations) and use the discriminator output as a surrogate reward for policy optimization. Examples include Generative Adversarial Imitation from Observation (GAIfO) and its variants, which optimize

$\min_\pi \max_D \; \mathbb{E}_{(s,s')\sim\alpha}[\log D(s,s')] + \mathbb{E}_{(s,s')\sim\beta}[\log(1-D(s,s'))].$

(Torabi et al., 2021, Torabi et al., 2019, Burnwal et al., 20 Sep 2025, Torabi et al., 2019)

Model-Based Inverse Dynamics and Action Recovery: Learn an inverse dynamics model $f_{IDM}(s,s') \rightarrow a$ using self-collected (state, action, state) tuples, then apply it to expert state transitions to infer pseudo-actions, enabling behavioral cloning from observation (Monteiro et al., 2020, Torabi et al., 2019).
Forward Model or Latent Dynamics Approaches: Fit a forward or latent dynamics model (state or observation space), sometimes with goal-conditioned skills or latent action spaces, to reconstruct or match expert transitions (Pathak et al., 2018, Liu et al., 2017).
Optimal Transport and Divergence-Minimization: Use the Wasserstein distance between expert and learner empirical trajectory (or transition) distributions as an imitation metric, deriving an explicit reward signal from the transport plan (Chang et al., 2023).
KL-Minimization via Density Models: Fit conditional density models for expert and policy-induced transition distributions and minimize the KL divergence between them in a non-adversarial way, yielding a stable and interpretable objective (Boborzi et al., 2022).
Energy-Based and Score-Matching: Develop energy models of expert transitions and use score-matching with denoising diffusion models to produce smooth rewards for RL (Diwan et al., 24 Jan 2025).
Trajectory-Centric and Planning-Based: Integrate trajectory optimization (LQR, iLQR, MPPI, MPC) into the imitation loop, using adversarial or likelihood-based costs, and exploiting local dynamics models for sample efficiency (Han et al., 29 Jul 2025, Torabi et al., 2021, Torabi et al., 2019, Wang et al., 2024).

An overview taxonomy is provided in Table 1:

Approach	Key Mechanism	Reward/Objective
GAIfO, LAIfO, DEALIO	Adversarial matching	Discriminator-based
OT/Score/Flow matching	Metric divergence	Wasserstein/KL/energy
Model-based BC via IDM	Action recovery	BC on pseudo-actions
MPC/Planning-based	Model-predictive RL	Adversarial or flow-based

3. Representative Algorithms and Technical Workflows

Adversarial Imitation from Observation (GAIfO and DEALIO)

In GAIfO (Torabi et al., 2021, Torabi et al., 2019), a discriminator $D_\theta(s, s')$ is trained to distinguish expert from policy-induced state transitions. The agent's policy maximizes expected reward $r(s,s') = \log D_\theta(s,s')$ (or another monotone function thereof), driving the policy’s transition measure towards the expert’s occupancy.

DEALIO (Torabi et al., 2021) integrates a quadratic-structured discriminator and fitted local linear dynamics with trajectory-optimization (PILQR) for substantial improvements in sample complexity. By formulating the adversarial cost in a form compatible with LQR and path integral updates, DEALIO achieves 4× faster convergence than pure model-free adversarial approaches.

Optimal Transport–Based Imitation

The OT-UfO method (Chang et al., 2023) computes the 1-Wasserstein distance between empirical measures of state transitions, deriving a per-transition reward assignment by decomposing the optimal transport cost. This reward can be seamlessly used with any off-policy RL algorithm (e.g., TD3, DDPG), yielding state-of-the-art performance even with a single expert demonstration.

The core steps are:

Construct empirical transition sets for expert and learner.
Solve for an optimal coupling $P^\star$ using the Sinkhorn algorithm.
Assign reward $r_j = -\sum_{i} c(x_i, y_j) P^\star_{i,j}$ for each transition $y_j$ .
Train the policy with this reward in RL.

KL and Energy-Based Divergence Minimization

SOIL-TDM (Boborzi et al., 2022) fits conditional density models for transition dynamics using normalizing flows and minimizes KL divergence between policy-induced and expert transition distributions, providing an analytic stopping criterion and invariant reward assignment.

NEAR (Diwan et al., 24 Jan 2025) builds noise-conditioned energy models of expert state transitions via denoising score matching, anneals the level of smoothing during RL, and avoids adversarial optimization pathologies.

Planning and Model-Predictive Control (MPAIL)

MPAIL (Han et al., 29 Jul 2025) replaces the policy in the adversarial imitation loop with a Model Predictive Path Integral (MPPI) planner, jointly learning cost and value functions via adversarial transitions and solving a KL-regularized trajectory optimization at each episode. The method demonstrates robust out-of-distribution generalization and interpretable, constraint-compatible planning.

4. Theoretical Guarantees and Practical Comparison

Distribution-Matching Guarantees

Under sufficient model and policy expressivity, adversarial state-occupancy matching converges to the expert's transition distribution (Torabi et al., 2019). In the state-only setting, the saddle point of the min-max objective ensures that the occupancies or transition statistics are matched, up to the capacity of the discriminator and the support of demonstration data.

Optimal transport–based approaches provide well-defined metrics even when the learner’s and expert’s supports are disjoint, unlike KL-based objectives, which are not defined for non-overlapping supports (Chang et al., 2023).

LAIfO (Giammarino et al., 2023) shows theoretical bounds for partially observable settings, demonstrating that matching the latent transition distribution in belief-space tightly bounds the learner’s suboptimality with respect to the expert.

Sample Efficiency

Model-based and planning-based methods consistently demonstrate superior sample-efficiency compared to pure model-free adversarial RL, reducing required environment interactions by 2–4× (Torabi et al., 2021, Han et al., 29 Jul 2025, Chang et al., 2023). The use of off-policy updates, analytic divergence minimization, and analytic stopping metrics further reduces learning time and increases practical reliability (Boborzi et al., 2022, Diwan et al., 24 Jan 2025).

Generalization

Stagewise algorithms that decouple proxy transition modeling from behavioral alignment (e.g., (Gavenski et al., 24 Jan 2026)) exhibit improved generalization to unseen initial states, as measured by reduced variance of episodic returns and normalized performance exceeding the teacher in multiple MuJoCo continuous control domains.

5. Empirical Benchmarks and Performance

Modern UfO algorithms have been systematically evaluated on continuous control suites (MuJoCo, PyBullet, DeepMind Control Suite), simulated robotics (Ant, Minitaur, BipedalWalker), visual navigation, and complex high-dimensional tasks such as humanoid locomotion and martial arts motion imitation.

OT-UfO achieves expert-level returns (normalized return ≈1.0) using as little as a single state-only demonstration, consistently beating prior baselines including adversarial and inverse-dynamics-based methods (Chang et al., 2023).
DEALIO and LQR+GAIfO converge in 3–4× fewer rollouts than model-free adversarial methods at equal asymptotic performance (Torabi et al., 2021, Torabi et al., 2019).
MPAIL and NEAR demonstrate strong out-of-distribution robustness and competitive or superior quantitative metrics on navigation and locomotion, even in model mismatch or sim-to-real settings (Han et al., 29 Jul 2025, Diwan et al., 24 Jan 2025).
Stage-wise UfO (Gavenski et al., 24 Jan 2026) not only matches but surpasses teacher-level performance (normalized performance $\eta>1$ ), while consistently displaying the lowest coefficient of variation among state-only imitation learning baselines.

Empirical evaluation consistently uses normalized return, average episodic reward, and coefficient of variation, supplemented by distance/energy alignment metrics, and in some cases, dynamic time-warping pose error or spectral arc-length for motion imitation tasks.

6. Open Challenges and Research Frontiers

Despite considerable advances in algorithmic and empirical performance, the field of UfO faces several significant open problems (Burnwal et al., 20 Sep 2025, Torabi et al., 2019):

Partial Observability and Visual Domain Shift: Learning robust invariants for third-person demonstrations, varying viewpoints, and different embodiments requires architectural innovation (e.g., context translation, domain adaptation).
Scalability and Optimization: Scaling non-adversarial density models (flows/score-based) to high-dimensional and visual observation spaces remains a challenge.
Demonstration Sparsity and Diversity: Reliable performance with few, noisy, or multimodal demonstrations is essential for practical deployment, motivating hybrid and hierarchical policy representations.
Safe and Constrained Imitation: Safety guarantees under incomplete demonstration coverage and integration of constraint satisfaction into imitation planning are only beginning to be addressed.
Performance Measurement and Theoretical Guarantees: Beyond episode return, more nuanced metrics (e.g., Wasserstein trajectory distance, behavioral alignment scores) are needed for evaluating imitation quality and generalization.
Sample-Efficient Real-World Deployment: Reducing sample complexity to within practical limits for real robot learning, especially without simulator assistance, is an ongoing focus, as is sim-to-real transfer under dynamics mismatch.

Addressing these limitations will require further integration of model-based planning, foundation generative models, robust representation learning, and principled off-policy or offline RL optimization.

UfO is deeply interwoven with advances in offline RL, model-based planning, generative representation learning, energy-based modeling, and hierarchical control. Many algorithmic motifs—such as DICE-based divergence optimization, goal-conditioned imitation, and self-supervised skill acquisition—are being adapted between these communities (Burnwal et al., 20 Sep 2025, Torabi et al., 2019).

Future research directions include:

Foundation models for universal imitation from large, diverse state-only datasets,
Planning-based frameworks incorporating learned constraints and diverse expert behaviors,
More interpretable and stable energy- or divergence-based imitation objectives,
Advanced metrics and benchmarking for imitation quality, robustness, and safety,
Transfer and adaptation across differing environments, morphologies, or multi-agent scenarios.

The field continues to move towards enabling truly general-purpose, robust, and data-efficient policy learning from rich, imperfect observations, without reliance on reward engineering or expert action access.