Papers
Topics
Authors
Recent
2000 character limit reached

ASAG: Adversarial Sinkhorn Guidance

Updated 17 November 2025
  • ASAG is a technique that integrates entropic optimal transport with adversarial training to enhance attention mechanisms in diffusion and super-resolution networks.
  • It replaces traditional similarity-based attention with Sinkhorn-regularized transport costs, flattening spurious correlations and improving output fidelity.
  • Empirical results show significant improvements in FID, SSIM, and convergence speed across text-to-image diffusion and digital elevation map super-resolution tasks.

Adversarial Sinkhorn Attention Guidance (ASAG) refers to a family of techniques that inject entropic optimal transport (OT)–based perturbations into attention mechanisms or adversarial training objectives, across both diffusion-based generative modeling and adversarial super-resolution networks. The central theme is the deliberate introduction of an adversarial Sinkhorn-regularized transport cost to attention or generator-discriminator architectures, flattening or destabilizing spurious alignments and improving both fidelity and controllability in output distributions. ASAG has been established in the context of text-to-image diffusion sampling (Kim, 10 Nov 2025) and in Sinkhorn-regularized adversarial networks for guided digital elevation map super-resolution (Paul et al., 21 Sep 2024).

1. Mathematical Foundations: Optimal Transport and Sinkhorn Regularization

The theoretical foundation of ASAG is entropic optimal transport. Given a cost matrix MRn×nM \in \mathbb{R}^{n \times n} and two marginal distributions μ,νΣn={pR0n:ipi=1}\mu, \nu \in \Sigma_n = \{p \in \mathbb{R}^n_{\ge 0} : \sum_i p_i = 1\}, the classical OT plan seeks a coupling PU(μ,ν)P \in U(\mu, \nu) minimizing P,M\langle P, M \rangle, where U(μ,ν)={PR0n×n:P1=μ,  P1=ν}U(\mu,\nu) = \{P \in \mathbb{R}^{n \times n}_{\ge 0} : P \mathbf{1} = \mu,\; P^\top \mathbf{1} = \nu\}. The entropic variant introduces a regularizer H(P)=ijPijlogPijH(P) = -\sum_{ij} P_{ij} \log P_{ij} with ϵ>0\epsilon > 0, leading to the problem

minPU(μ,ν)P,MϵH(P).\min_{P \in U(\mu,\nu)} \langle P, M \rangle - \epsilon H(P).

The minimizer can be written as P=diag(u)exp(λM)diag(v)P^* = \mathrm{diag}(u) \exp(-\lambda M) \mathrm{diag}(v), found iteratively via Sinkhorn–Knopp scaling: K=exp(λM), u(k+1)=μKv(k), v(k+1)=νKu(k+1),\begin{aligned} K &= \exp(-\lambda M), \ u^{(k+1)} &= \frac{\mu}{K v^{(k)}}, \ v^{(k+1)} &= \frac{\nu}{K^\top u^{(k+1)}}, \end{aligned} with convergence after a small (often T=2T = 2 or $10$) number of iterations, depending on the application and computational constraints (Kim, 10 Nov 2025, Paul et al., 21 Sep 2024).

2. Adversarial Cost Injection in Self-Attention and GANs

ASAG modifies standard attention or adversarial networks by replacing similarity-focused dynamics with adversarial transport costs.

In Diffusion Models

Let Q,K,VRn×dQ, K, V \in \mathbb{R}^{n \times d} be the query, key, and value matrices. Standard self-attention computes A=softmax(QK/d)A = \mathrm{softmax}(QK^\top/\sqrt{d}) and SA(Q,K,V)=AV\mathrm{SA}(Q,K,V) = AV. ASAG defines an adversarial cost M=QKM^{\downarrow} = QK^\top, so that high-similarity entries become expensive to transport, deliberately flattening strong correlations. The Sinkhorn plan PASA=Sinkhorn(λM)P_{\mathrm{ASA}} = \mathrm{Sinkhorn}(\lambda M^{\downarrow}) replaces the attention map, yielding

A~ijsoftmax(1ϵ(QK)ij),\widetilde{A}_{ij} \approx \mathrm{softmax}\left(-\frac{1}{\epsilon} (QK^\top)_{ij}\right),

which, for small λ\lambda, approximates a uniform plan and uniformly redistributes attention mass (Kim, 10 Nov 2025).

In Adversarial Super-Resolution

In guided DEM super-resolution, ASAG employs a composite loss:

  • A pixelwise reconstruction loss,
  • An SSIM structural similarity term,
  • A classic GAN adversarial loss,
  • A Sinkhorn divergence regularizer.

Given generated and ground-truth distributions μθ\mu_\theta and ν\nu, with cost C(y^,y)=y^y22C(\hat y, y) = \|\hat y - y\|_2^2, the Sinkhorn divergence is

SC,ε(μθ,ν)=WC,ε(μθ,ν)12WC,ε(μθ,μθ)12WC,ε(ν,ν).\mathcal{S}_{C,\varepsilon}(\mu_\theta, \nu) = \mathcal{W}_{C,\varepsilon}(\mu_\theta, \nu) - \frac{1}{2}\mathcal{W}_{C,\varepsilon}(\mu_\theta, \mu_\theta) - \frac{1}{2}\mathcal{W}_{C,\varepsilon}(\nu, \nu).

This regularization mitigates vanishing gradients and promotes stable adversarial learning (Paul et al., 21 Sep 2024).

3. Algorithmic Implementation

Diffusion Guidance Loop

At each sampling step tt (from TT down to $1$):

  • Compute standard conditional score ϵθ(xt,c)\epsilon_\theta(x_t, c) and adversarial score ϵ~θ(xt,c)\tilde \epsilon_\theta(x_t, c), the latter obtained by replacing all self-attention softmax operations with the adversarial Sinkhorn attention operator.
  • Compose the guided score: ϵASAG=ϵθ(xt,c)+s[ϵθ(xt,c)ϵ~θ(xt,c)],\epsilon^{\mathrm{ASAG}} = \epsilon_\theta(x_t, c) + s[\epsilon_\theta(x_t, c) - \tilde\epsilon_\theta(x_t, c)], where ss is a hyperparameter (typ. s1.5s \approx 1.5).
  • Perform the standard DDIM/PLMS denoising step using ϵASAG\epsilon^{\mathrm{ASAG}}.

Integration is plug-and-play: no retraining or model modification is needed, as the only intervention is at the attention operator level during sampling (Kim, 10 Nov 2025).

Adversarial Network Training

The generator GθG_\theta is trained using the loss: LG=λPLP+λSSIMLSSIM+λADVLADVG+λOTLOT.\mathscr{L}_G = \lambda_P \mathscr{L}_P + \lambda_{SSIM} \mathscr{L}_{SSIM} + \lambda_{ADV}\mathscr{L}_{ADV}^G + \lambda_{OT} \mathscr{L}_{OT}. Discriminator feature maps are aggregated to form spatial attention maps, passed through a Polarized Self-Attention (PSA) block. These maps are used to spatially modulate the multispectral guide zz during generation, enforcing spatial coherence between the guide and the generated output. The Sinkhorn regularizer is computed via matrix scaling as described above, with T=10T=10 iterations and ε=0.1\varepsilon=0.1 in practice (Paul et al., 21 Sep 2024).

4. Empirical Results and Evaluation

Diffusion Applications

On MS-COCO (30K images, SDXL backbone), ASAG achieves the best unconditional and conditional performance:

  • Unconditional: FID =92.01=92.01, IS =10.54=10.54;
  • Conditional: FID =23.30=23.30, CLIPScore =25.85=25.85 with ASAG+CFG. Comparable improvements are obtained using SD3 backbone.

On DrawBench and HPD benchmarks, ASAG+CFG yields the highest CLIPScore, PickScore, ImageReward, and Human-Preference scores. In downstream tasks, such as ControlNet under Canny/depth or IP-Adapter setups, ASAG preserves structural detail and enhances controllability without retraining (Kim, 10 Nov 2025).

Super-Resolution Applications

Across four DEM datasets, ASAG attains:

  • “Inside India”: RMSE =9.28=9.28 m, SSIM =90.6%=90.6\%, PSNR =35.06=35.06 dB;
  • “Outside India”: RMSE =15.74=15.74 m, SSIM =83.9%=83.9\%, PSNR =31.56=31.56 dB.

On ASTER/AW3D30, similar 10–18% RMSE gains are observed. Ablations confirm that inclusion of FSGA, PSA, and the Sinkhorn regularizer incrementally improve sharpness, fidelity, and convergence speed (Paul et al., 21 Sep 2024).

5. Architectural Variants and Integrations

Hybrid Transformer Generator and M-FSGA

The super-resolution variant employs a generator with an initial convolutional encoder, six hybrid transformer blocks (HTBs), and a decoder. Each HTB includes:

  • A Densely connected Multi-Residual Block (DMRB) for local context propagation,
  • A multi-headed Frequency-Selective Graph Attention (M-FSGA) module, implementing graph Fourier filtering for global context and high-frequency enhancement.

M-FSGA constructs a patch-graph and applies spectral filtering via graph Laplacian eigenvectors. For each attention head, the high-pass filter isolates salient frequencies, and multi-headed aggregation offers robust feature integration. The effective complexity scales as O((Nk)hwC)\mathcal{O}((N-k)hwC) per head, a marked reduction relative to full MSA (Paul et al., 21 Sep 2024).

Discriminator-derived Conditional Attention

The discriminator provides intermediate feature maps, aggregated and passed through PSA to remove redundancy. During training and inference, these spatial attention maps guide the generator by modulating the multispectral input, ensuring topographically and spectrally aligned reconstructions (Paul et al., 21 Sep 2024).

6. Theoretical Properties: Gradient Stability and Convergence

ASAG-equipped adversarial networks benefit from gradient stabilization and accelerated convergence:

  • The smoothness of the Sinkhorn loss implies bounded gradient variation (dependent on the entropic regularization ε\varepsilon and smoothness constants of the cost).
  • Upper bounds on expected gradient norm guarantee that, with ε\varepsilon appropriately selected, gradients do not collapse in the adversarial regime.
  • Iteration complexity bounds improve over pure-GAN configurations, with empirical evidence of faster training convergence and mitigation of the vanishing gradient phenomenon (Paul et al., 21 Sep 2024).

7. Practical Considerations and Deployment

ASAG is designed for minimal integration and overhead. In diffusion sampling, the only additional computational cost is $2$ Sinkhorn iterations per attention-modified layer, resulting in \sim+0.35 sec/prompt (H100 GPU) and +0.2+0.2 GB memory cost. In super-resolution pipelines, T=10T=10 Sinkhorn iterations suffice per batch. For both families, key guidance scales, regularization strengths, and iteration counts can be set to default values and yield robust results. Stability is linked to the balance between entropy maximization and preservation of residual structure; excessively uniform transport plans may reduce output diversity, hence moderate λ\lambda is preferred (Kim, 10 Nov 2025, Paul et al., 21 Sep 2024).

A plausible implication is that the ASAG paradigm may generalize to other structured generative models and adversarial settings that utilize differentiable attention or transport plans, provided entropic regularization is tuned to maintain gradient flow and sample diversity.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Adversarial Sinkhorn Attention Guidance (ASAG).