Sinkhorn Imitation Learning (SIL)
- Sinkhorn Imitation Learning is an imitation learning framework that minimizes the entropic optimal transport distance between learner and expert occupancy measures.
- It employs a transport plan over state-action batches with a learned cosine cost in feature space through an adversarial critic for stable policy updates.
- Empirical evaluations demonstrate robust sample efficiency and competitive performance with methods like GAIL and AIRL in continuous control tasks.
Sinkhorn Imitation Learning (SIL) is an imitation learning framework in which the learner policy minimizes the Sinkhorn (entropic optimal transport) distance between its occupancy measure and that of an expert. Instead of classical f-divergences or adversarial discriminators, SIL employs a transport plan over batchwise state–action samples, with the cost defined in a learned feature space by an adversarial critic. SIL offers a principled, tractable minimax approach for aligning learner and expert behaviors, enhanced by the theoretical and algorithmic properties of entropic optimal transport (Papagiannis et al., 2020).
1. Occupancy Measures and Problem Setup
Let be a -discounted, infinite-horizon Markov Decision Process (MDP). For a stochastic policy , the induced occupancy measure over state–action pairs is given by
denotes the expert’s occupancy measure; that of the learner (Papagiannis et al., 2020). The fundamental objective in imitation learning is to drive close to , typically measured by a divergence or metric on distributions. In SIL, this comparison is performed using the entropic OT (Sinkhorn) distance.
2. Sinkhorn Distance: Definitions and Properties
Given two discrete measures and and ground cost , the -Sinkhorn distance is defined as
where and (Papagiannis et al., 2020, Luise et al., 2018). As , converges to the (unregularized) Wasserstein distance.
Both the regularized () and the sharp () Sinkhorn distances are on the product of probability simplices, supporting stable, unbiased gradient-based learning (Luise et al., 2018). This smoothness is essential for backpropagation in imitation learning.
3. SIL Minimax Formulation and Adversarial Critic
SIL learns a policy and critic , optimizing a minimax objective:
where denotes the pushforward of occupancy through embedding . The ground cost is defined via the cosine distance in feature space:
The critic parameterizes this feature space using a 2-layer MLP with 128 ReLU units per layer (Papagiannis et al., 2020). This adversarial learning of the cost function guides both the transport plan and the policy update.
4. Algorithmic Implementation and Computational Aspects
At each iteration, batches of learner and expert trajectories are paired. For each batch pair, the cost matrix is computed, and the Sinkhorn plan is obtained by iterative scaling:
- Initialize ,
- Iterate , for steps,
- Final plan: .
For each learner sample , the reward proxy is
The policy is updated via standard policy-gradient or TRPO methods using as the reward, while the critic is updated by ascent on . Backpropagation is performed through the Sinkhorn computation, where (Papagiannis et al., 2020, Luise et al., 2018). The per-iteration computational complexity is for batch size and Sinkhorn iterations; further efficiencies are possible for gradient computation (Luise et al., 2018).
5. Theoretical Analysis and Connections
SIL’s objective is equivalent to a causal-entropy-regularized IRL where the regularizer is . As decreases, regularization vanishes and recovers the Wasserstein metric (Papagiannis et al., 2020). The property of the Sinkhorn distance underpins stable gradient-based policy optimization and ensures universal consistency: as the sample size grows, the empirical Sinkhorn risk converges to the population-level optimum (Luise et al., 2018). Under RKHS assumptions, the excess risk decays at rate .
6. Empirical Evaluation and Performance
Experiments were conducted on MuJoCo environments (Hopper-v2, HalfCheetah-v2, Walker2d-v2, Ant-v2, Humanoid-v2), comparing SIL to Behavioral Cloning, GAIL, and AIRL. Metrics include true cumulative reward and Sinkhorn distance (cosine cost) between learner and expert. SIL demonstrates:
- Consistently minimal Sinkhorn distance between learner and expert occupancies, especially with few expert demonstrations,
- Reward performance on par with GAIL and AIRL, with superior sample efficiency in Ant and Humanoid,
- Notable effectiveness of adversarial feature learning: ablations with fixed (non-learned) cosine cost yield markedly inferior results (Papagiannis et al., 2020).
| Method | Performance Metric | Notable Findings |
|---|---|---|
| SIL | Reward, Sinkhorn dist. | Consistently matches expert, few-shot robust |
| GAIL, AIRL | Reward | Comparable reward, less stable on few demos |
| Behavioral Cloning | Reward | Inferior with limited expert data |
7. Role of Sinkhorn Gradients and Statistical Guarantees
In SIL and related learning tasks, the sharp Sinkhorn distance (as opposed to its regularized version) admits a closed-form, efficient gradient via backpropagation through the dual variables. This enables stable, unbiased updates for both critic and policy:
- The gradient with respect to input measures requires inverting a structured (diag + low-rank) matrix and scales as with appropriate solvers (Luise et al., 2018).
- Gradients are smooth and free of bias terms associated with regularization, supporting stable structured prediction and variance reduction in policy gradients.
- Statistical guarantees include universal consistency and explicit learning rates, subject to standard regularity conditions (Luise et al., 2018).
A plausible implication is that SIL’s gradient structure, enabled by entropic smoothing, addresses both the optimization and variance bottlenecks typically associated with Wasserstein- or adversarial-critics in imitation learning.
References:
- "Imitation Learning with Sinkhorn Distances" (Papagiannis et al., 2020)
- "Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance" (Luise et al., 2018)