State-Only Imitation Learning (SOIL)

Updated 25 September 2025

State-Only Imitation Learning (SOIL) is a method that learns task policies solely from expert state trajectories, bypassing the need for explicit action data.
It leverages model-based, model-free, and hierarchical techniques—such as inverse dynamics, adversarial training, and sub-goal decomposition—to infer hidden actions and align state distributions.
SOIL is applied in robotics, dexterous manipulation, and continuous control, demonstrating competitive performance even in environments where collecting expert action annotations is impractical.

State-Only Imitation Learning (SOIL), also termed Learning from Observation (LfO), is a paradigm in which an agent acquires task-solving policies by imitating expert-generated state trajectories, while explicitly lacking access to the underlying expert actions. This approach relaxes the conventional imitation learning constraint of acquiring both state and action information, thus enabling the utilization of demonstration data that exists solely as state trajectories (e.g., from videos or raw logs). This shift broadens the applicability of imitation learning to domains where expert action annotations are impractical or unavailable.

1. Foundational Frameworks and Problem Formalization

SOIL is addressed within the overall context of reinforcement learning (RL) and imitation learning (IL), both typically defined over a Markov Decision Process (MDP) $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{T}, r, \gamma, \rho_0)$ , where $\mathcal{S}$ is the set of states, $\mathcal{A}$ the set of actions, $\mathcal{T}$ the transition dynamics, $r$ the reward function, $\gamma$ the discount factor, and $\rho_0$ the initial state distribution (Burnwal et al., 20 Sep 2025). A stationary policy $\pi$ induces an occupancy measure $\mu^\pi(s, a)$ that satisfies the Bellman flow constraint:

$\sum_a \mu(s, a) = (1-\gamma)\rho_0(s) + \gamma \sum_{s', a'} \mathcal{T}(s | s', a') \mu(s', a'), \quad\mu(s, a) \geq 0\ \forall s, a.$

Traditional IL uses state-action pairs for behavioral cloning, inverse reinforcement learning (IRL), or adversarial IL such as GAIL. In SOIL, policies must be learned—using only state (or, in some cases, state transition) visitation information—by inferring or reconstructing action-level feedback or by aligning state(-transition) distributions.

SOIL methods generally fall into the following categories (Torabi et al., 2019, Burnwal et al., 20 Sep 2025):

Model-based approaches: Use learned system dynamics (inverse or forward models) to infer hidden action information from state transitions.
Model-free approaches: Learn directly from state(-transition) distributions via adversarial or reward-engineering objectives without explicit action inference.
Hierarchical and Goal-directed decomposition: Decompose imitation into sub-goal prediction and low-level control, often leveraging hierarchical RL techniques.

2. Core Methodologies and Algorithmic Taxonomy

The taxonomy of SOIL algorithms can be systematically organized as follows, reflecting both the composition of the expert dataset and the learning strategy (Burnwal et al., 20 Sep 2025):

Methodological Class	Key Mechanism	Example Techniques
Inverse/Forward Dynamics	Estimate hidden actions or latent controls from state-only sequences	Inverse dynamics models, latent variable policies (Torabi et al., 2019, Radosavovic et al., 2020)
Reward Engineering	Compute surrogate rewards from state similarity in embedding space	Embedding-based rewards, temporal/contrastive losses (Torabi et al., 2019, Burnwal et al., 20 Sep 2025)
Adversarial Distribution-Matching	Minimax divergence minimization using discriminators over states or transitions	State-only GAIL, WAIfO, occupancy-based matching (Torabi et al., 2019, Boborzi et al., 2022, Wang et al., 2023, Huang et al., 7 Oct 2024)
Goal-from-Observation (Hierarchical)	Meta-policy selects achievable demonstration goals, low-level policy solves reaching	Hierarchical frameworks (e.g., SILO) (Lee et al., 2019)

Model-based Approaches:

Inverse dynamics models seek a function $f_{\rm inv}: (s_t, s_{t+1}) \mapsto a_t$ , trained from environment samples $(s_t, a_t, s_{t+1})$ to minimize prediction error on actions, and then used to generate surrogate actions for expert sequences (Torabi et al., 2019, Radosavovic et al., 2020). Forward dynamics models $f_{\rm fwd}: (s_t, z) \mapsto s_{t+1}$ with latent (“action-like”) variables can first train a latent policy $\pi(z|s_t)$ such that predicted $s_{t+1}$ matches observed expert transitions, with a mapping later established between the latent actions $z$ and real actions $a_t$ .

Model-Free Approaches:

In adversarial variants, the minimax objective becomes

$\min_{\pi} \max_D \mathbb{E}_{(s_t, s_{t+1})\sim\tau_e} [\log D(s_t, s_{t+1})] + \mathbb{E}_{(s_t, s_{t+1})\sim\pi} [\log(1 - D(s_t, s_{t+1}))]$

with the goal of aligning the state-transition occupancy measure between imitator and expert (Torabi et al., 2019, Wang et al., 2023, Huang et al., 7 Oct 2024).

Reward engineering can involve defining

$r(s_t) = - \| \phi(s_t) - \phi(s^{\text{expert}}_t)\|_2$

where $\phi$ is a learned or fixed state embedding, potentially leveraging contrastive or temporal constraints for robustness (Torabi et al., 2019).

Hierarchical and Goal-based Decomposition:

Meta policies select temporally feasible or reachable demonstration states (“sub-goals”) given the current agent’s capabilities, while low-level controllers attempt to reach those sub-goals (Lee et al., 2019). This enables effective handling of differences in dynamics, embodiment, or workspace relative to the demonstrator.

3. Theoretical Properties and Divergence-minimization Objectives

A central theme in SOIL is the minimization of divergence between the expert’s and the imitator’s visitation measures:

$\min_\pi D(\mu^\pi(\cdot) \Vert \mu^E(\cdot))$

where $\mu^\pi$ and $\mu^E$ are, e.g., state or state–transition distributions under the imitator and expert, respectively (Boborzi et al., 2022, Freund et al., 2023, Burnwal et al., 20 Sep 2025).

Adversarial methods implement this by training a state(-transition)-level discriminator via GAN or diffusion-driven losses (Wang et al., 2023, Huang et al., 7 Oct 2024).
Non-adversarial (density-matching) methods optimize, for example, the $\mathrm{KL}$ divergence between μ^π and μ^E, rewriting the objective into maximum entropy RL or SAC-style updates with rewards computed from conditional state transition densities (Boborzi et al., 2022).

Several works propose further refinements: robust approaches handle transition dynamics mismatch by optimizing over families of plausible transition models or integrating worst-case action perturbations (Gangwani et al., 2020, Viano et al., 2022); hierarchical and decoupled schemes construct planners operating at the state or sub-goal level, abstracting policy optimization away from direct state-to-action mappings (Liu et al., 2022).

4. Practical Applications, Empirical Results, and Performance Characteristics

SOIL has demonstrated efficacy in a range of domains, with notable performance in:

Robotics: real-world robot arms performing reaching or manipulation tasks using only state-based demonstrations or third-person videos, benefitting from sample-efficient adversarial methods and hierarchical goal decomposition (Torabi et al., 2019, Lee et al., 2019).
Dexterous Manipulation: providing state-only demonstration enables learning in high-dimensional control setups where collecting action data is prohibitive; joint inverse model/policy training yields competitive performance with state-action baselines (Radosavovic et al., 2020).
Continuous Control: methods such as conditional density estimation, diffusion-based adversarial discriminators, or coupled normalizing flows achieve superior or state-of-the-art performance across benchmarks, even with as little as a single demonstration trajectory (Freund et al., 2023, Wang et al., 2023, Huang et al., 7 Oct 2024, Agrawal et al., 17 Aug 2024, Agrawal et al., 2023).
Zero-shot Transfer and Domain Adaptation: robust SOIL methods generalize favorably across transition dynamics and morphology mismatches, supporting sim-to-real transfer and heterogeneous embodiment scenarios (Liu et al., 2019, Gangwani et al., 2020, Viano et al., 2022, Liu et al., 2022).
Interactive and Human-guided Learning: frameworks such as TIPS allow non-expert users to teach policies via state-space corrections, with lower cognitive burden and empirically improved learning curves relative to classic IL techniques (Jauhri et al., 2020).

Empirical results across these settings consistently show that, while SOIL policies may converge more slowly or suffer modest performance loss outside the demonstrated distribution, appropriately designed algorithms (e.g., those incorporating optimistic exploration, robust loss formulations, or diffusion-model discriminators) can achieve performance competitive with conventional action-aware methods.

SOIL methodologies overlap with and draw from several adjacent fields:

Offline RL: Algorithms such as DICE-based methods exploit Bellman flow constraints and leverage offline imitator data to reduce distribution shift and enable safe deployment (Burnwal et al., 20 Sep 2025).
Model-based RL: The use of learned forward and inverse models for prediction, planning, and reward estimation in SOIL closely relates to model-based exploration and planning in RL.
Hierarchical RL: Meta/controller decomposition in goal-driven SOIL methods parallels the structure seen in many hierarchical RL architectures, enhancing temporal abstraction and generalization across environments (Lee et al., 2019, Liu et al., 2022).
Computer Vision and Perception: Advances in pose estimation, keypoint detection, and domain adaptation directly support robust SOIL, especially for visual demonstration data (Torabi et al., 2019).
Diffusion and Flow-based Generative Modeling: Modern discriminators in SOIL now employ diffusion models or coupled normalizing flows for improved stability and accurate density estimation, improving reward signal fidelity and policy training robustness (Wang et al., 2023, Huang et al., 7 Oct 2024, Freund et al., 2023).

6. Open Challenges and Future Research Directions

Ongoing challenges and open avenues for research in SOIL, as articulated in recent surveys and position papers (Burnwal et al., 20 Sep 2025), include:

Robust Perceptual Front-ends: Addressing challenges in extracting precise state information from observational data, particularly in the context of noise, occlusion, and viewpoint/embodiment mismatch. Potential avenues involve advanced CV techniques and representation learning (e.g., using CycleGAN for domain adaptation).
Sample Efficiency and Data Utilization: Although significant progress has been made (e.g., LQR+GAIfO, diffusion-based reward estimation), further improvements in sample efficiency—especially in real-world or safety-critical applications—require continued development of control-theoretic and model-based approaches.
Integration with Reinforcement Learning: Combining SOIL techniques with RL (e.g., via bootstrapping, mixing imitation-based reward signals with sparse true rewards) may enhance learning robustness and policy generalization.
Third-person and Unaligned Demonstrations: Leveraging demonstrations from different viewpoints, domains, or embodiments (e.g., human-to-robot transfer, YouTube videos) requires sophisticated domain alignment and viewpoint-invariant representation learning, as well as hierarchical or sub-goal decomposition.
Automatic Discovery of Sub-goals and Hierarchies: Enabling agents to automatically identify and exploit subgoal structures within unannotated demonstration data remains an open problem for scaling SOIL to complex, long-horizon tasks.
Safe Imitation: Developing evaluation metrics beyond accumulated return, emphasizing safety and alignment with demonstrator intent, particularly important for SOIL in critical domains.

7. Summary Table: Classes of SOIL Algorithms

Algorithm Class	Summary Description	Cited Works
Inverse/Forward Dynamics	Infer actions or latent controls from state transitions; joint policy/model learning	(Torabi et al., 2019, Radosavovic et al., 2020)
Adversarial Distribution	Minimax discrimination over state(-transition) distributions between agent and expert	(Torabi et al., 2019, Wang et al., 2023, Huang et al., 7 Oct 2024)
Reward Engineering	Construct reward signals from state (or embedding) similarity metrics	(Torabi et al., 2019, Burnwal et al., 20 Sep 2025)
Goal-from-Observation	Hierarchical selection of achievable expert goals with low-level control	(Lee et al., 2019)
Conditional Density Estimation	Markov or balance-equation satisfaction loss with (kernel or flow-based) estimators	(Agrawal et al., 2023, Agrawal et al., 17 Aug 2024)
Diffusion/Flow-based	Diffusion/model-based discriminators, robust density matching	(Wang et al., 2023, Huang et al., 7 Oct 2024, Freund et al., 2023)

References to Specific Papers

(Torabi et al., 2019): Literature review and methodological summary of SOIL, introducing the distinction between model-based and model-free approaches and outlining open problems (perception, sample efficiency, viewpoint mismatch).
(Lee et al., 2019): SILO—Meta-policy framework for selective imitation of reachable states, supporting hierarchical skill transfer and robust adaptation.
(Liu et al., 2019): State alignment methods that leverage both local (β-VAE next-state prediction + inverse dynamics) and global (Wasserstein distribution distances) perspectives for improved stability under dynamics mismatch.
(Radosavovic et al., 2020): Joint training of inverse dynamics and policy for dexterous manipulation from pure state observations in high-dimensional action spaces.
(Boborzi et al., 2022): KL divergence minimization for non-adversarial SOIL, achieving robust policy learning with interpretable convergence metrics.
(Liu et al., 2022): Decoupled policy optimization, learning transferable high-level planners separately from low-level skill modules.
(Freund et al., 2023): Coupled flow-based density estimation for flexible state-only or state-transition occupancy matching.
(Wang et al., 2023, Huang et al., 7 Oct 2024): Diffusion-based discriminators in SOIL, providing enhanced distribution modeling capacity and more reliable reward signals.
(Agrawal et al., 17 Aug 2024, Agrawal et al., 2023): Markov balance and conditional density estimation frameworks for fully batch, offline SOIL settings, with theoretical consistency guarantees and competitive empirical results.
(Burnwal et al., 20 Sep 2025): Comprehensive survey, taxonomy, and roadmap for SOIL research, emphasizing connections to hierarchical, model-based, and offline RL, and outstanding challenges: third-person LfO, subgoal discovery, safety, and evaluation metrics.

In summary, SOIL encompasses a broad class of methods—spanning model-based inference, adversarial distribution alignment, reward engineering, and hierarchical decomposition—that enable effective imitation from state-only trajectories. The field continues to advance with innovations in generative modeling, robust transfer, and integration with vision and planning, while addressing practical constraints in demonstration collection, environmental mismatch, and evaluation.