Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sinkhorn Imitation Learning (SIL)

Updated 24 March 2026
  • Sinkhorn Imitation Learning is an imitation learning framework that minimizes the entropic optimal transport distance between learner and expert occupancy measures.
  • It employs a transport plan over state-action batches with a learned cosine cost in feature space through an adversarial critic for stable policy updates.
  • Empirical evaluations demonstrate robust sample efficiency and competitive performance with methods like GAIL and AIRL in continuous control tasks.

Sinkhorn Imitation Learning (SIL) is an imitation learning framework in which the learner policy minimizes the Sinkhorn (entropic optimal transport) distance between its occupancy measure and that of an expert. Instead of classical f-divergences or adversarial discriminators, SIL employs a transport plan over batchwise state–action samples, with the cost defined in a learned feature space by an adversarial critic. SIL offers a principled, tractable minimax approach for aligning learner and expert behaviors, enhanced by the theoretical and algorithmic properties of entropic optimal transport (Papagiannis et al., 2020).

1. Occupancy Measures and Problem Setup

Let (S,A,P,r,γ)(S, A, P, r, \gamma) be a γ\gamma-discounted, infinite-horizon Markov Decision Process (MDP). For a stochastic policy π\pi, the induced occupancy measure ρπ\rho_\pi over state–action pairs (s,a)(s, a) is given by

ρπ(s,a)=t=0γtP(st=s,at=aπ).\rho_\pi(s, a) = \sum_{t=0}^\infty \gamma^t P(s_t = s, a_t = a \mid \pi).

ρE\rho_E denotes the expert’s occupancy measure; ρπ\rho_\pi that of the learner (Papagiannis et al., 2020). The fundamental objective in imitation learning is to drive ρπ\rho_\pi close to ρE\rho_E, typically measured by a divergence or metric on distributions. In SIL, this comparison is performed using the entropic OT (Sinkhorn) distance.

2. Sinkhorn Distance: Definitions and Properties

Given two discrete measures μ={(xi,pi)}i=1N\mu = \{(x_i, p_i)\}_{i=1}^N and ν={(yj,qj)}j=1M\nu = \{(y_j, q_j)\}_{j=1}^M and ground cost Cij=c(xi,yj)C_{ij} = c(x_i, y_j), the ε\varepsilon-Sinkhorn distance is defined as

Wε(μ,ν)=minΓU(μ,ν)Γ,CεH(Γ),W_\varepsilon(\mu, \nu) = \min_{\Gamma \in U(\mu, \nu)} \langle \Gamma, C \rangle - \varepsilon H(\Gamma),

where U(μ,ν)={Γ0:Γ1M=p,ΓT1N=q}U(\mu, \nu) = \{\Gamma \geq 0: \Gamma 1_M = p, \Gamma^T 1_N = q \} and H(Γ)=i,jΓijlogΓijH(\Gamma) = -\sum_{i, j} \Gamma_{ij} \log \Gamma_{ij} (Papagiannis et al., 2020, Luise et al., 2018). As ε0\varepsilon \to 0, WεW_\varepsilon converges to the (unregularized) Wasserstein distance.

Both the regularized (W~ε\widetilde W_\varepsilon) and the sharp (SεS_\varepsilon) Sinkhorn distances are C\mathcal{C}^\infty on the product of probability simplices, supporting stable, unbiased gradient-based learning (Luise et al., 2018). This smoothness is essential for backpropagation in imitation learning.

3. SIL Minimax Formulation and Adversarial Critic

SIL learns a policy πθ\pi_\theta and critic fϕf_\phi, optimizing a minimax objective:

minπθmaxϕWε(fϕ#ρπθ,fϕ#ρE),\min_{\pi_\theta} \max_\phi W_\varepsilon( f_\phi \# \rho_{\pi_\theta}, f_\phi \# \rho_E ),

where fϕ#ρf_\phi \# \rho denotes the pushforward of occupancy ρ\rho through embedding fϕ:S×ARdf_\phi: S \times A \to \mathbb{R}^d. The ground cost is defined via the cosine distance in feature space:

cϕ((s,a),(s,a))=1fϕ(s,a),fϕ(s,a)fϕ(s,a)2fϕ(s,a)2.c_\phi((s, a), (s', a')) = 1 - \frac{\langle f_\phi(s, a), f_\phi(s', a') \rangle}{\|f_\phi(s, a)\|_2 \|f_\phi(s', a')\|_2}.

The critic parameterizes this feature space using a 2-layer MLP with 128 ReLU units per layer (Papagiannis et al., 2020). This adversarial learning of the cost function guides both the transport plan and the policy update.

4. Algorithmic Implementation and Computational Aspects

At each iteration, batches of learner and expert trajectories are paired. For each batch pair, the cost matrix CϕijC_\phi^{ij} is computed, and the Sinkhorn plan Γij\Gamma^{ij} is obtained by iterative scaling:

  • Initialize Kij=exp(Cϕ,ij/ε)K_{ij} = \exp(-C_{\phi, ij}/\varepsilon),
  • Iterate up./(Kv)u \leftarrow p ./ (K v), vq./(KTu)v \leftarrow q ./ (K^T u) for TT steps,
  • Final plan: Γ=diag(u)Kdiag(v)\Gamma = \operatorname{diag}(u) K \operatorname{diag}(v).

For each learner sample (s,a)(s, a), the reward proxy is

rϕ(s,a)=(s,a)TEjΓ(s,a),(s,a)cϕ((s,a),(s,a)).r_\phi(s, a) = -\sum_{(s', a') \in \text{TE}^j} \Gamma_{(s, a), (s', a')} \cdot c_\phi((s, a), (s', a')).

The policy is updated via standard policy-gradient or TRPO methods using rϕ(s,a)r_\phi(s, a) as the reward, while the critic fϕf_\phi is updated by ascent on WεW_\varepsilon. Backpropagation is performed through the Sinkhorn computation, where W/C=Γ\partial W/\partial C = \Gamma (Papagiannis et al., 2020, Luise et al., 2018). The per-iteration computational complexity is O(TN2)O(T N^2) for batch size NN and TT Sinkhorn iterations; further efficiencies are possible for gradient computation (Luise et al., 2018).

5. Theoretical Analysis and Connections

SIL’s objective is equivalent to a causal-entropy-regularized IRL where the regularizer is R(ρπ)=Wε(ρπ,ρE)\mathcal{R}(\rho_\pi) = -W_\varepsilon(\rho_\pi, \rho_E). As ε\varepsilon decreases, regularization vanishes and WεW_\varepsilon recovers the Wasserstein metric (Papagiannis et al., 2020). The C\mathcal{C}^\infty property of the Sinkhorn distance underpins stable gradient-based policy optimization and ensures universal consistency: as the sample size grows, the empirical Sinkhorn risk converges to the population-level optimum (Luise et al., 2018). Under RKHS assumptions, the excess risk decays at rate O(1/4)O(\ell^{-1/4}).

6. Empirical Evaluation and Performance

Experiments were conducted on MuJoCo environments (Hopper-v2, HalfCheetah-v2, Walker2d-v2, Ant-v2, Humanoid-v2), comparing SIL to Behavioral Cloning, GAIL, and AIRL. Metrics include true cumulative reward and Sinkhorn distance (cosine cost) between learner and expert. SIL demonstrates:

  • Consistently minimal Sinkhorn distance between learner and expert occupancies, especially with few expert demonstrations,
  • Reward performance on par with GAIL and AIRL, with superior sample efficiency in Ant and Humanoid,
  • Notable effectiveness of adversarial feature learning: ablations with fixed (non-learned) cosine cost yield markedly inferior results (Papagiannis et al., 2020).
Method Performance Metric Notable Findings
SIL Reward, Sinkhorn dist. Consistently matches expert, few-shot robust
GAIL, AIRL Reward Comparable reward, less stable on few demos
Behavioral Cloning Reward Inferior with limited expert data

7. Role of Sinkhorn Gradients and Statistical Guarantees

In SIL and related learning tasks, the sharp Sinkhorn distance SεS_\varepsilon (as opposed to its regularized version) admits a closed-form, efficient gradient via backpropagation through the dual variables. This enables stable, unbiased updates for both critic and policy:

  • The gradient with respect to input measures requires inverting a structured (diag + low-rank) matrix and scales as O(nm2)O(n m^2) with appropriate solvers (Luise et al., 2018).
  • Gradients are smooth and free of bias terms associated with regularization, supporting stable structured prediction and variance reduction in policy gradients.
  • Statistical guarantees include universal consistency and explicit learning rates, subject to standard regularity conditions (Luise et al., 2018).

A plausible implication is that SIL’s gradient structure, enabled by entropic smoothing, addresses both the optimization and variance bottlenecks typically associated with Wasserstein- or adversarial-critics in imitation learning.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sinkhorn Imitation Learning (SIL).