Marginal Data Transport Distillation

Updated 20 March 2026

Marginal Data Transport Distillation (MDT-dist) is a framework that uses optimal transport to align data marginals in a teacher-student paradigm across different modalities.
It employs entropic optimal transport and surrogate losses, such as Velocity Matching and Velocity Distillation, to distill knowledge for contrastive language–vision tasks and flow-based 3D generation.
Empirical results demonstrate that MDT-dist improves zero-shot recognition and dramatically speeds up 3D generation, yielding significant gains in accuracy and reduced inference costs.

Marginal Data Transport Distillation (MDT-dist) is a class of distillation frameworks that use optimal transport principles to align the marginals of data distributions in a teacher-student paradigm. MDT-dist has been formulated for both contrastive language–vision training and for compressing multi-step flow models in 3D generation. Across both contexts, MDT-dist enforces strict marginal constraints and matches joint couplings between distributions, enabling efficient learning from weaker or fewer data pairs and yielding architectures with reduced inference cost while maintaining fidelity.

1. Entropic Optimal Transport Formulation in MDT-dist

In the OTTER framework for language-supervised zero-shot recognition, MDT-dist is instantiated as an entropic optimal transport (OT) problem over minibatches of $N$ paired examples $\{(x_i, t_i)\}_{i=1}^N$ with $\ell_2$ -normalized teacher model embeddings $\tilde{z}^v_i = \tilde{v}(x_i)$ and $\tilde{z}^t_j = \tilde{t}(t_j)$ . A similarity matrix $S \in \mathbb{R}^{N\times N}$ is constructed: $S_{ij} = \gamma_v \langle \tilde{z}^v_i, \tilde{z}^v_j \rangle + \gamma_t \langle \tilde{z}^t_i, \tilde{z}^t_j \rangle + \langle \tilde{z}^v_i, \tilde{z}^t_j \rangle - \eta \cdot \delta_{ij}$ where $\gamma_v, \gamma_t$ control within-modality similarities, $\eta \to \infty$ zeros the diagonal.

The goal is to find a coupling $P^*\in \mathbb{R}_+^{N\times N}$ in the transport polytope $U(a, b)$ with uniform marginals $a = b = (1/N, ..., 1/N)^T$ , i.e.

$U(a, b) = \big\{P \geq 0 : P \mathbf{1}_N = a, \ P^T \mathbf{1}_N = b\big\}$

that minimizes the entropic-regularized OT objective: $P^* = \arg\min_{P\in U(a, b)} \langle P, C\rangle - \epsilon H(P)\,, \qquad H(P) = -\sum_{i,j} P_{ij} \log P_{ij}, \quad C = -S$ The entropic regularizer $\epsilon > 0$ promotes "soft" assignments. This OT is solved via repeated Sinkhorn–Knopp normalization: $K = \exp(-C/\epsilon), \quad P = \text{diag}(u) K \text{diag}(v)$ with scaling vectors $u, v$ updated by alternating normalization against marginals.

2. MDT-dist in Flow-based 3D Generation: Surrogate Objectives

In the context of flow-based 3D generation, as in TRELLIS, MDT-dist aims to distill a multi-step pretrained teacher $v_\text{pre}(x, t)$ into a few-step or even single-step student $\phi_\theta(x_t, t)$ . The distillation goal is to match the marginal-data transport

$T(t) = \int_0^t v_\text{pre}(x_\tau, \tau)d\tau$

for $x_t = (1-t)x_0 + t z$ . Directly matching $T(t)$ is intractable, so two surrogate losses are proposed:

Velocity Matching (VM):

$L_\text{VM}(\theta) = \mathbb{E}_{t,x_0,z}\left[\left\|u_\theta(x_t, t) - v_\text{pre}(x_t, t)\right\|^2\right]$

where $u_\theta(x_t, t) = \phi_\theta(x_t, t) + t \frac{d}{dt} \phi_\theta(x_t, t)$ , with $\frac{d}{dt}$ computed by finite difference and stop-gradient through backward terms.

Velocity Distillation (VD):

$\nabla_\theta L_\text{VD} = \mathbb{E}_{t, z', z''}\left[ - \left(u_\theta(x'_t, t) - v_\text{pre}(x'_t, t)\right) \cdot \frac{\partial x'_t}{\partial \theta} \right]$

This matches the student and teacher marginal densities using the probability-flow ODE.

3. Loss Construction and Implementation Details

In OTTER, the optimal OT coupling $P^*_{ij}$ provides the soft labels for student training. Student probabilities $p_{ij}$ are defined via softmax logits: $p_{ij} = \frac{\exp(\langle z^v_i, z^t_j\rangle/\tau)}{\sum_{k=1}^N \exp(\langle z^v_i, z^t_k\rangle/\tau)}$ and the OT-distillation loss for the image-to-text direction is: $L_\text{OT}^v = - \frac{1}{N} \sum_{i,j} P^*_{ij} \log p_{ij}$ Symmetric loss is computed for text-to-image, and the final objective interpolates with the InfoNCE: $L = \alpha \big(L_\text{InfoNCE}^v + L_\text{InfoNCE}^t\big) + (1-\alpha) \big(L_\text{OT}^v + L_\text{OT}^t\big)$ In flow-based 3D MDT-dist, the VM and VD objectives are combined during training. The algorithm alternates between VM and VD branches, updating $\theta$ by descending the joint gradient.

Sinkhorn iterations are implemented batch-wise (typical $N=512$ ), with stabilization by subtracting row-max from each row before exponentiating and regularizing by an $\epsilon$ -floor on $P$ . Each iteration has $\mathcal{O}(N^2)$ cost, with memory bounding batch size. For MDT-distillation in 3D, finite differences are used for time derivatives, and gradient flow for the density-matching objective is managed meticulously to control bias.

4. Empirical Results and Theoretical Guarantees

Experiments in OTTER use data-efficient setups: e.g., 3M Conceptual Captions pairs, orders of magnitude less than the 400M pairs used by CLIP. Models are evaluated zero-shot on Google Open Images (19,958 classes) and ImageNet10K (10,032 classes). OTTER-trained models achieve FH@1 29.1% (vs. 26.8% InfoNCE) on GOI and 12.0% (vs. 10.9%) on IN10K, surpassing even CLIP-RN50's performance on GOI for this data regime. Across 42 comparisons (architectures, datasets, metrics), OTTER outperforms or ties all baselines in 34/42 cases (Wu et al., 2021).

In TRELLIS-based 3D flow distillation, MDT-dist reduces sampling from two 25-step transformers (50 steps total) to 1–2 steps per transformer, yielding 9.0x–6.5x speedup (down to 0.68–0.94s latency) with strong retention of visual and geometric fidelity (e.g., $FD_\text{incep}$ of 18.09 vs. 11.80 baseline; $ULIP_I$ geometric score near parity). Compared to state-of-the-art consistency-model distillation (CM, PCM, sCM), MDT-dist gives lower $FD_\text{incep}$ and $FD_\text{dinov2}$ , and higher $ULIP_I$ (Zhou et al., 4 Sep 2025).

Theoretical analysis formalizes the connection: bounding $L_\text{VM}$ ensures primary transport error is likewise bounded (by $M \mathbb{E}_t[t^2]$ for the error bound $M$ ), and the VD gradient is shown to recover the score-matching gradient for KL divergence between student and teacher marginals.

5. Advantages, Limitations, and Extensions

MDT-dist differs from classic knowledge distillation in enforcing strict marginal constraints for batch-level couplings, accommodating many-to-many assignments and denoising noisy or weakly paired supervision. In OTTER, this allows for nonzero mass on plausible but non-originally paired image–text pairs, unlike InfoNCE or teacher-softmax with free marginals.

Compared to label smoothing or classical distillation, MDT-dist incorporates modality-specific similarities and entropic smoothing, directly addressing the noise in large weakly paired datasets and yielding substantial data and sampling efficiency improvements.

Limitations include computational cost $\mathcal{O}(N^2)$ for OT computation (limiting minibatch size), requirement for paired data in 3D generation, and bias introduced by finite-difference and stop-gradient approximations in VM loss. Batch-local OT does not resolve inter-batch alignments.

Potential extensions under active exploration include:

Nonuniform marginals for variable sample confidence.
Cross-batch/global-memory OT for larger context.
Adaptive $\epsilon$ for dynamic sharpening of couplings.
Relaxing explicit geometry supervision by incorporating image-only score distillation.
Generalization to other flow-based or cross-modal generation tasks (e.g., video–text, cross-modal latent flows), and improved numerical schemes (higher-order finite difference, reversible integration) to reduce estimator bias.

6. Summary Table: Key Elements of MDT-dist Across Contexts

Context	Core Coupling/Objective	Distillation Loss	Practical Benefit
Language–Vision (OTTER)	Batch-wise entropic OT, uniform marginals	Cross-entropy between $P^*$ and $p_{ij}$	Soft denoising, data efficiency
3D Flow Generation (TRELLIS)	Transport $\int_0^t v_\text{pre}$ via VM/VD	Velocity and density matching	Step reduction, inference speed

Both instantiations of MDT-dist rely on matching not just output logits but the marginal transport structure connecting input and output (or data and teacher-transform path), using principled, theoretically supported losses.

7. Significance and Impact

MDT-dist provides a principled, theoretically grounded mechanism to replace hard or uniform labels in contrastive and flow distillation settings with joint couplings that enforce marginal alignment. This enables significant sampling reduction in generative models and dramatic data efficiency improvements in zero-shot recognition, outperforming standard knowledge distillation, label smoothing, and consistency-model-based alternatives for fine-grained alignment between data modalities (Wu et al., 2021, Zhou et al., 4 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation (2021)

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Marginal Data Transport Distillation (MDT-dist).