Sinkhorn Imitation Learning (SIL)

Updated 24 March 2026

Sinkhorn Imitation Learning is an imitation learning framework that minimizes the entropic optimal transport distance between learner and expert occupancy measures.
It employs a transport plan over state-action batches with a learned cosine cost in feature space through an adversarial critic for stable policy updates.
Empirical evaluations demonstrate robust sample efficiency and competitive performance with methods like GAIL and AIRL in continuous control tasks.

Sinkhorn Imitation Learning (SIL) is an imitation learning framework in which the learner policy minimizes the Sinkhorn (entropic optimal transport) distance between its occupancy measure and that of an expert. Instead of classical f-divergences or adversarial discriminators, SIL employs a transport plan over batchwise state–action samples, with the cost defined in a learned feature space by an adversarial critic. SIL offers a principled, tractable minimax approach for aligning learner and expert behaviors, enhanced by the theoretical and algorithmic properties of entropic optimal transport (Papagiannis et al., 2020).

1. Occupancy Measures and Problem Setup

Let $(S, A, P, r, \gamma)$ be a $\gamma$ -discounted, infinite-horizon Markov Decision Process (MDP). For a stochastic policy $\pi$ , the induced occupancy measure $\rho_\pi$ over state–action pairs $(s, a)$ is given by

$\rho_\pi(s, a) = \sum_{t=0}^\infty \gamma^t P(s_t = s, a_t = a \mid \pi).$

$\rho_E$ denotes the expert’s occupancy measure; $\rho_\pi$ that of the learner (Papagiannis et al., 2020). The fundamental objective in imitation learning is to drive $\rho_\pi$ close to $\rho_E$ , typically measured by a divergence or metric on distributions. In SIL, this comparison is performed using the entropic OT (Sinkhorn) distance.

2. Sinkhorn Distance: Definitions and Properties

Given two discrete measures $\mu = \{(x_i, p_i)\}_{i=1}^N$ and $\nu = \{(y_j, q_j)\}_{j=1}^M$ and ground cost $C_{ij} = c(x_i, y_j)$ , the $\varepsilon$ -Sinkhorn distance is defined as

$W_\varepsilon(\mu, \nu) = \min_{\Gamma \in U(\mu, \nu)} \langle \Gamma, C \rangle - \varepsilon H(\Gamma),$

where $U(\mu, \nu) = \{\Gamma \geq 0: \Gamma 1_M = p, \Gamma^T 1_N = q \}$ and $H(\Gamma) = -\sum_{i, j} \Gamma_{ij} \log \Gamma_{ij}$ (Papagiannis et al., 2020, Luise et al., 2018). As $\varepsilon \to 0$ , $W_\varepsilon$ converges to the (unregularized) Wasserstein distance.

Both the regularized ( $\widetilde W_\varepsilon$ ) and the sharp ( $S_\varepsilon$ ) Sinkhorn distances are $\mathcal{C}^\infty$ on the product of probability simplices, supporting stable, unbiased gradient-based learning (Luise et al., 2018). This smoothness is essential for backpropagation in imitation learning.

3. SIL Minimax Formulation and Adversarial Critic

SIL learns a policy $\pi_\theta$ and critic $f_\phi$ , optimizing a minimax objective:

$\min_{\pi_\theta} \max_\phi W_\varepsilon( f_\phi \# \rho_{\pi_\theta}, f_\phi \# \rho_E ),$

where $f_\phi \# \rho$ denotes the pushforward of occupancy $\rho$ through embedding $f_\phi: S \times A \to \mathbb{R}^d$ . The ground cost is defined via the cosine distance in feature space:

$c_\phi((s, a), (s', a')) = 1 - \frac{\langle f_\phi(s, a), f_\phi(s', a') \rangle}{\|f_\phi(s, a)\|_2 \|f_\phi(s', a')\|_2}.$

The critic parameterizes this feature space using a 2-layer MLP with 128 ReLU units per layer (Papagiannis et al., 2020). This adversarial learning of the cost function guides both the transport plan and the policy update.

4. Algorithmic Implementation and Computational Aspects

At each iteration, batches of learner and expert trajectories are paired. For each batch pair, the cost matrix $C_\phi^{ij}$ is computed, and the Sinkhorn plan $\Gamma^{ij}$ is obtained by iterative scaling:

Initialize $K_{ij} = \exp(-C_{\phi, ij}/\varepsilon)$ ,
Iterate $u \leftarrow p ./ (K v)$ , $v \leftarrow q ./ (K^T u)$ for $T$ steps,
Final plan: $\Gamma = \operatorname{diag}(u) K \operatorname{diag}(v)$ .

For each learner sample $(s, a)$ , the reward proxy is

$r_\phi(s, a) = -\sum_{(s', a') \in \text{TE}^j} \Gamma_{(s, a), (s', a')} \cdot c_\phi((s, a), (s', a')).$

The policy is updated via standard policy-gradient or TRPO methods using $r_\phi(s, a)$ as the reward, while the critic $f_\phi$ is updated by ascent on $W_\varepsilon$ . Backpropagation is performed through the Sinkhorn computation, where $\partial W/\partial C = \Gamma$ (Papagiannis et al., 2020, Luise et al., 2018). The per-iteration computational complexity is $O(T N^2)$ for batch size $N$ and $T$ Sinkhorn iterations; further efficiencies are possible for gradient computation (Luise et al., 2018).

5. Theoretical Analysis and Connections

SIL’s objective is equivalent to a causal-entropy-regularized IRL where the regularizer is $\mathcal{R}(\rho_\pi) = -W_\varepsilon(\rho_\pi, \rho_E)$ . As $\varepsilon$ decreases, regularization vanishes and $W_\varepsilon$ recovers the Wasserstein metric (Papagiannis et al., 2020). The $\mathcal{C}^\infty$ property of the Sinkhorn distance underpins stable gradient-based policy optimization and ensures universal consistency: as the sample size grows, the empirical Sinkhorn risk converges to the population-level optimum (Luise et al., 2018). Under RKHS assumptions, the excess risk decays at rate $O(\ell^{-1/4})$ .

6. Empirical Evaluation and Performance

Experiments were conducted on MuJoCo environments (Hopper-v2, HalfCheetah-v2, Walker2d-v2, Ant-v2, Humanoid-v2), comparing SIL to Behavioral Cloning, GAIL, and AIRL. Metrics include true cumulative reward and Sinkhorn distance (cosine cost) between learner and expert. SIL demonstrates:

Consistently minimal Sinkhorn distance between learner and expert occupancies, especially with few expert demonstrations,
Reward performance on par with GAIL and AIRL, with superior sample efficiency in Ant and Humanoid,
Notable effectiveness of adversarial feature learning: ablations with fixed (non-learned) cosine cost yield markedly inferior results (Papagiannis et al., 2020).

Method	Performance Metric	Notable Findings
SIL	Reward, Sinkhorn dist.	Consistently matches expert, few-shot robust
GAIL, AIRL	Reward	Comparable reward, less stable on few demos
Behavioral Cloning	Reward	Inferior with limited expert data

7. Role of Sinkhorn Gradients and Statistical Guarantees

In SIL and related learning tasks, the sharp Sinkhorn distance $S_\varepsilon$ (as opposed to its regularized version) admits a closed-form, efficient gradient via backpropagation through the dual variables. This enables stable, unbiased updates for both critic and policy:

The gradient with respect to input measures requires inverting a structured (diag + low-rank) matrix and scales as $O(n m^2)$ with appropriate solvers (Luise et al., 2018).
Gradients are smooth and free of bias terms associated with regularization, supporting stable structured prediction and variance reduction in policy gradients.
Statistical guarantees include universal consistency and explicit learning rates, subject to standard regularity conditions (Luise et al., 2018).

A plausible implication is that SIL’s gradient structure, enabled by entropic smoothing, addresses both the optimization and variance bottlenecks typically associated with Wasserstein- or adversarial-critics in imitation learning.

References:

"Imitation Learning with Sinkhorn Distances" (Papagiannis et al., 2020)
"Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance" (Luise et al., 2018)

Markdown Report Issue Upgrade to Chat

References (2)

Imitation Learning with Sinkhorn Distances (2020)

Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sinkhorn Imitation Learning (SIL).

Sinkhorn Imitation Learning (SIL)

1. Occupancy Measures and Problem Setup

2. Sinkhorn Distance: Definitions and Properties

3. SIL Minimax Formulation and Adversarial Critic

4. Algorithmic Implementation and Computational Aspects

5. Theoretical Analysis and Connections

6. Empirical Evaluation and Performance

7. Role of Sinkhorn Gradients and Statistical Guarantees

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sinkhorn Imitation Learning (SIL)

1. Occupancy Measures and Problem Setup

2. Sinkhorn Distance: Definitions and Properties

3. SIL Minimax Formulation and Adversarial Critic

4. Algorithmic Implementation and Computational Aspects

5. Theoretical Analysis and Connections

6. Empirical Evaluation and Performance

7. Role of Sinkhorn Gradients and Statistical Guarantees

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research