Distribution-Conditioned Transport

Published 5 Mar 2026 in cs.LG | (2603.04736v1)

Abstract: Learning a transport model that maps a source distribution to a target distribution is a canonical problem in machine learning, but scientific applications increasingly require models that can generalize to source and target distributions unseen during training. We introduce distribution-conditioned transport (DCT), a framework that conditions transport maps on learned embeddings of source and target distributions, enabling generalization to unseen distribution pairs. DCT also allows semi-supervised learning for distributional forecasting problems: because it learns from arbitrary distribution pairs, it can leverage distributions observed at only one condition to improve transport prediction. DCT is agnostic to the underlying transport mechanism, supporting models ranging from flow matching to distributional divergence-based models (e.g. Wasserstein, MMD). We demonstrate the practical performance benefits of DCT on synthetic benchmarks and four applications in biology: batch effect transfer in single-cell genomics, perturbation prediction from mass cytometry data, learning clonal transcriptional dynamics in hematopoiesis, and modeling T-cell receptor sequence evolution.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel framework that leverages distribution encoders with conditional generative models to achieve robust transport between probability distributions.
The methodology decouples embedding learning from the transport mapping, enabling efficient scaling and unbiased minibatch convergence with strong OOD generalization.
Empirical results demonstrate lower out-of-distribution errors and improved batch effect correction, with applications in scRNA-seq, mass cytometry, and TCR evolution.

Distribution-Conditioned Transport: Framework, Numerical Results, and Implications

Formal Framework and Motivation

"Distribution-Conditioned Transport" (2603.04736) establishes a unifying framework for transport modeling between probability distributions that is explicitly designed for generalization to unseen source and target distributions. The approach leverages distribution encoders that generate latent representations of empirical sample sets, ensuring invariance with respect to both sample permutations and sizes. By auguring these embeddings into conditional generative models (including flow matching, divergence-based objectives, normalizing flows, and adversarial generators), the proposed framework enables three major operating modes: supervised (one-to-one), unsupervised (any-to-any), and semi-supervised transport, the latter incorporating orphan marginals—marginal distributions observed without paired counterparts.

The hierarchical structure of modern scientific datasets—such as those found in single-cell genomics, mass cytometry, lineage tracing, and TCR repertoire sequencing—often presents sparsity and incomplete pairing, making traditional paired distribution transport insufficient. The DCT framework is agnostic to the chosen transport mechanism and decouples embedding learning from the generative mapping, thus permitting broad applicability and efficient exploitation of available data.

Methodology

Distribution encoders $\mathcal{E}: S_i \to z_i \in \mathbb{R}^d$ are trained to ensure that $z_i$ captures the distributional structure of sample sets $S_i$ without being influenced by sampling noise. Invariance properties guarantee reliance only on empirical measures. For downstream transport objectives with sufficient smoothness, the encoder admits a central limit theorem, enabling unbiased minibatch training with convergence guarantees—critical for scalability to large datasets.

Two main forms of conditional transport maps are considered:

Source-conditioned: $T(x | z_{\text{src}})$ , used for supervised tasks with explicit source-target pairs.
Source-target-conditioned: $T(x | z_{\text{src}}, z_{\text{tgt}})$ , used for any-to-any and semi-supervised tasks, capitalizing on the full space of learned distributional embeddings.

Semi-supervised learning leverages regression models to predict target embeddings from source embeddings within the set of available paired distributions, supporting generalization beyond training pairs.

Empirical Results and Numerical Evidence

The paper provides comprehensive numerical evaluation across synthetic and real datasets. Notable findings include:

Synthetic Gaussian and GMM Benchmarks: When scaling the number of distributions $K$ , source-target-conditioned models consistently achieve lower out-of-distribution (OOD) error compared to K-to-K baselines, especially as $K$ increases. These baselines, which memorize discrete distribution identities, perform adequately on in-distribution targets for small $K$ , but fail to interpolate for unseen targets. DCTs retain low error for both in-distribution and OOD targets, demonstrating superior generalization.
scRNA-seq Batch Effect Transfer: On murine pancreas scRNA-seq data, DCT models outperform both traditional batch correction methods (e.g., scVI, Harmony) and K-to-K models in transferring cells across held-out experimental batches. The MMD distance for DCT is substantially lower (e.g., 0.1164 vs. 0.3160 for SWD) on held-out test distributions.
Drug Perturbation Prediction (Mass Cytometry): For patient-specific prediction of chemotherapy response in colorectal organoids, DCT semi-supervised models show improved generalization to unseen patients. OOD patient-level MMD distance is reduced by more than 35% relative to source-conditioned baselines.
Modeling Clonal Population Dynamics (scRNA-seq Lineage Tracing): Incorporation of orphan marginals via any-to-any pairing in DCT leads to significantly lower forecasting errors; e.g., semi-supervised models achieve MMD distances of 8.67 compared to 9.87 for supervised competitors.
TCR Sequence Evolution Forecasting: Using sequence embeddings from pretrained ESM2, the discrete flow matching DCT substantially improves energy distance and MMD-RBF on longitudinal immune repertoire forecasting; semi-supervised models halve error rates relative to supervised approaches.

The numerical results consistently document that DCT's generalization properties yield robust performance across application domains, particularly in OOD settings where conventional methods struggle.

Theoretical and Practical Implications

On the theoretical front, DCT extends meta-learning concepts to the space of probability measures, providing convergence proofs, mean consistency, and unbiased plug-in loss estimates—all validated through the functional delta method and empirical measure factorization. The use of permutation/proportional invariance for encoders aligns with recent efforts in kernel mean embeddings, deep set architectures, and generative distribution embeddings.

Practically, the flexibility to condition transport on arbitrary distribution embeddings enables:

Zero-shot generalization to unseen conditions, donors, or experimental batches
Utilization of unpaired marginals for enhanced predictive modeling, reducing data waste
Modular combination of transport mechanisms (flow matching, sliced Wasserstein regression, MMD, etc.) with learned distributional context
Efficient scaling without quadratic computational overhead even when training on all pairwise combinations

For biological applications, this provides notable improvements in batch integration, perturbation forecasting, dynamic inference, and evolutionary modeling, often with direct transferability to other domains with hierarchical, sparsely paired distributional datasets.

Limitations and Prospects

While DCT achieves strong OOD performance, it may underperform on in-distribution targets relative to memorization-based K-to-K models under limited training regimes. The scaling properties and optimization landscape of source-target-conditioned models could require further adaptation (e.g., additional training iterations or regularization) for maximum efficacy.

Future directions include:

Extension to higher-order multi-marginal transport and barycentric interpolation
Incorporation of additional structure or regularization in the generative models to enforce desirable coupling and avoid degenerate solutions (x-ignoring modes)
Application to trajectory inference, evolutionary modeling, and time-series data with complex dependence structures
Exploration of alternative distribution embeddings (e.g., kernel methods, transformer-based embeddings) for improved representation learning

Conclusion

Distribution-Conditioned Transport (2603.04736) formalizes a robust, theoretically grounded framework for learning transport maps between probability distributions with strong out-of-distribution generalization. Through the integration of distributional encoders and conditional generative models, DCT supports supervised, unsupervised, and semi-supervised transport, efficiently utilizing all available data, including orphan marginals. The empirical results demonstrate substantial performance gains on synthetic and real biological datasets, particularly in tasks where generalization to unseen contexts is critical. This approach represents a significant expansion of meta-learning principles in generative modeling, with broad implications for both theoretical and applied machine learning.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Distribution-Conditioned Transport (DCT): A simple guide

What’s this paper about?

This paper is about learning how to “translate” whole groups of data from one situation to another. Imagine each group is a cloud of points (like a big scatter plot of cells, people, images, or sequences). A “transport model” learns how to move points from a source cloud so they end up looking like a target cloud. The new idea here is to make that translation work even for new, never-seen-before groups. The authors call this approach Distribution-Conditioned Transport (DCT).

What big questions are they trying to answer?

Here are the main goals, in everyday terms:

Can we learn one flexible translator that works for many different groups, not just the pairs we saw during training?
Can we translate between any two groups (any-to-any), even ones we never saw together before?
Can we improve predictions by using “orphan” groups that only have one side of the pair (for example, only a “before” or only an “after”), instead of throwing that data away?

How does their approach work?

Think of this in three parts: making fingerprints, using a translator, and training wisely.

Making fingerprints of whole groups

A “distribution” here means a whole set of samples (like all cells from one donor, all sequences from one patient, etc.).
The model first makes a compact “fingerprint” (a short vector) that summarizes the whole group.
Important: this fingerprint doesn’t change if you reorder the samples or duplicate some—just like a class average doesn’t depend on the order people are listed.
As you see more samples from a group, the fingerprint becomes more accurate and stable (like your estimate of average height becomes better as you measure more students).

A translator that uses fingerprints

The “transport model” is the engine that moves individual points from the source cloud so the moved points look like the target cloud.
DCT “conditions” this engine on the fingerprints:
- Source-conditioned: the engine sees the source group’s fingerprint (useful for one-to-one tasks where each source has a known target).
- Source-and-target-conditioned: the engine sees both fingerprints (useful for any-to-any translation between new pairs at test time).
The engine itself can be any strong generator or flow model (for example, GANs, normalizing flows, or “flow matching” models). DCT is a wrapper that gives the engine better context.

Training wisely, including partial data

During training, the model repeatedly:
- Picks one or two groups,
- Computes their fingerprints,
- Tries to move samples from the source to match the target,
- Updates itself to do better next time.
Semi-supervised trick: if you only have one side (say, just the “before” for some groups), you can still train the any-to-any translator using many possible pairs. Later, you can predict the target fingerprint from the source fingerprint with a small helper model and then translate.

An analogy:

Fingerprints = travel guides for each city (group).
Transport model = your GPS that tells you how to get from one city’s streets (source) to another’s (target).
Conditioning on fingerprints means the GPS gets both travel guides, so it can plan the route even for cities it hasn’t visited before.

What did they find?

The authors tested DCT on both simple simulations and several real biology problems. Here’s what they observed and why it matters.

Simple synthetic tests (Gaussian clouds of points):
- Models trained to handle only a fixed set of groups (“K-to-K”) do well on the same groups but struggle on new ones.
- DCT (any-to-any) learns a smooth space of fingerprints and translates well even to new, unseen targets.
- Why it matters: It shows DCT actually generalizes, rather than memorizing.
Single-cell RNA sequencing (batch effect transfer):
- Goal: predict how cells would look under a different lab batch or technical setting.
- DCT made better predictions for new batches than several baselines (including widely used tools like scVI and Harmony and the “K-to-K” version).
- Why it matters: Scientists can compare data across experiments more reliably.
Drug response in organoids (mass cytometry):
- Goal: predict how cells from a cancer patient respond to different drugs.
- Within known patients, supervised models did fine. But for new patients, DCT (any-to-any) generalized better.
- Why it matters: Better predictions for new patients could guide experiments or treatments.
Clonal dynamics in blood cell development (single-cell lineage tracing):
- Many clones are only measured at one timepoint (“orphans”).
- Using DCT with any-to-any training plus these orphan groups led to better forecasting of future distributions than supervised models that used only paired data.
- Why it matters: DCT uses more of the data you already have, improving predictions.
T-cell receptor (TCR) sequence evolution (discrete sequences):
- With a discrete flow-based translator, semi-supervised DCT improved forecasting across patients, even when longitudinal data were scarce.
- Why it matters: The approach works beyond continuous data, helping with sequences too.

A key pattern across tasks:

Supervised models (trained only on matched pairs) often do best on the exact kind of pairs they saw (in-distribution).
DCT’s source-and-target conditioning tends to do better when translating to new, unseen conditions (out-of-distribution), and it can exploit “orphan” data that others ignore.

Why is this important?

It’s flexible: DCT is not tied to one specific “engine.” You can plug in different transport models.
It’s data-efficient: It uses partial and unpaired data to learn better maps.
It generalizes: It aims to work on new situations at test time, not just the ones it memorized.
It’s practical: The authors showed gains on real biological problems where datasets are messy, multi-scale, and often incomplete.

What are the takeaways and future impact?

DCT gives researchers a “universal translator” for distributions: one system that can move between many conditions—even new ones—by reading each group’s fingerprint.
This can help in biology (harmonizing datasets, predicting treatment effects, forecasting cell or immune system changes) and can also apply to other fields (e.g., aligning measurements across hospitals, translating styles in imaging, or adapting models across cities or time in climate and economics).
Limitation to keep in mind: sometimes the DCT version may underperform on groups that look exactly like the training data, possibly because it needs more training or capacity to match supervised models’ specialization. Also, some transport engines can “cheat” by ignoring the source sample; the authors discuss diagnostics to catch this.

In short: DCT learns to summarize whole groups and uses those summaries to guide powerful translation engines. That lets it translate between new situations, learn from partial data, and often outperform traditional setups when stepping outside the training comfort zone.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and unresolved questions that future work could address:

Theoretical generalization guarantees: What conditions on the metadistribution Q, encoder capacity, and transport class ensure that a model trained on finite pairs generalizes to unseen source–target distributions with provable bounds (e.g., sample complexity in number of distributions n vs per-distribution samples m, covering numbers over P(X))?
Encoder identifiability and injectivity: Under what assumptions does the distribution encoder produce embeddings that are (approximately) injective over relevant families of distributions, avoiding collisions where distinct distributions map to similar embeddings and degrade transport?
Sensitivity to encoder architecture and pooling: How do alternative invariant architectures and pooling operators (beyond mean-pooled DeepSets) affect CLT behavior, bias/variance, robustness to heterogeneous sample sizes, and downstream transport accuracy?
Embedding dimensionality selection: How to choose the embedding dimension d to balance expressivity vs overfitting across tasks, and how does d impact any-to-any generalization and degeneracy risks?
Finite-sample CLT accuracy: What are finite-sample error constants and Berry–Esseen-type rates for practical encoders in high dimensions, and how do these propagate through specific transport objectives (FM, MMD, SWD) to affect training bias and variance?
Coupling vs marginal matching ambiguity: Because T can match target marginals while ignoring individual source samples, what objectives or regularizers guarantee “alignment” (i.e., source-dependent couplings) and under what conditions can such degeneracy be ruled out?
Alignment diagnostics as training constraints: The paper proposes a diagnostic to detect source-ignorance; how can this be embedded into the loss (e.g., contrastive or mutual-information terms) to actively prevent degenerate conditional generation?
Choice of metadistribution Q_joint: How should one design or learn the source–target pair sampling policy to optimize downstream performance (e.g., curriculum strategies, importance sampling, task-aligned pair weighting), and what is the theoretical effect of Q_joint on learned maps?
Semi-supervised target embedding prediction: Beyond linear ridge, what models (e.g., probabilistic regressors with uncertainty, sequence models for time) better predict z_tgt from z_src, and how does uncertainty propagation into T(.|z_src,z_tgt) affect predictions?
Orphan marginal utility and negative transfer: Under what data regimes do unpaired distributions help vs hurt, and can criteria be developed to detect when semi-supervised training introduces bias or negative transfer?
Robustness to support mismatch: How does DCT behave when source and target have disjoint or weakly overlapping supports, singular measures, heavy tails, or manifold structures, and what mechanisms ensure stable transport in these regimes?
Transport mechanism choice: Which families (FM, divergence-minimization, normalizing flows, adversarial) most reliably leverage distribution conditioning across domains, and can we derive guidance or criteria for mechanism selection?
Discrepancy between synthetic and real performance: Flow matching generalized strongly OOD in Gaussians but not in real tasks; what dataset or objective characteristics explain this gap, and how can mechanisms be adapted to close it?
Discrete sequence transport limitations: The ProGen-based bridge collapsed to near-identical embeddings; why does this degeneracy occur, and what modifications (e.g., DFM variants, conditioning strategies, token-level alignment losses) prevent it for sequence data?
Evaluation beyond distributional distances: Current evaluation emphasizes MMD/SWD/Energy; how to assess coupling fidelity (e.g., pseudo-pairs, ground-truth correspondences), biological endpoint preservation, and cell-level identity consistency after transport?
Preservation of biological signal vs technical effects: In batch effect transfer, how to ensure biological variations are preserved while technical artifacts are corrected, and can task-specific constraints or disentangling objectives be incorporated?
Interpretability of embeddings: How do embedding dimensions relate to interpretable experimental factors (e.g., donor, time, treatment), and can methods be developed to probe, manipulate, or disentangle factors in embedding space?
Uncertainty quantification and calibration: How to quantify and report uncertainty over transported distributions and embeddings, especially in semi-supervised settings where z_tgt is predicted?
Scalability and compute: Although per-step costs avoid quadratic pair enumeration, what are the scaling limits in n (number of distributions), m (samples per distribution), and dimensionality, and are there caching, amortization, or streaming strategies to keep training tractable?
Impact of heterogeneous sample sizes: How does variability in per-distribution sample counts affect encoder variance, training stability, and fairness across tasks, and can reweighting or variance-aware objectives mitigate this?
Domain and modality generality: To what extent does DCT transfer to images, texts, and multi-modal settings; what adaptations are needed for cross-modality transport and for domains with complex discrete structures?
Path properties and physical constraints: For dynamic forecasting tasks, can path constraints (e.g., monotonicity, conservation laws, mechanistic priors) be enforced in T to ensure physically plausible transports?
Benchmarks and standardized splits: The paper evaluates on select biological tasks; establishing public benchmarks with standardized IID/OOD splits and known couplings would enable rigorous, reproducible comparison of DCT variants.
Effect of encoder–transport co-training: How does joint vs staged training (pretrain encoder, then transport) influence stability and generalization, and are there benefits to meta-learning or bilevel optimization over encoder parameters?
Regularization and hyperparameter scaling: Source–target-conditioned models sometimes underperform in-distribution; what training regimes, capacity scaling laws, or regularizers reduce underfitting without harming OOD generalization?
Handling variable sequence lengths and insertions/deletions: For TCR and other sequences, how should transport handle length changes and structural edits while maintaining semantic alignment across time or conditions?
Safety and bias considerations: Conditioning on distribution embeddings may encode sensitive cohort attributes; how can privacy-preserving or fairness-aware versions of DCT be formulated without degrading transport quality?

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are practical uses that can be deployed now by leveraging the paper’s findings and released codebase, often by wrapping existing transport mechanisms (flow matching, MMD/SWD, GANs) with distribution encoders to enable generalization across unseen distributions.

Any-to-any batch effect transfer for single-cell data
- Sectors: healthcare/biotech, academia (genomics core facilities, lab bioinformatics)
- Tools/products/workflows:
- Add a “Batch Transfer API” to Scanpy/Seurat pipelines that computes distribution embeddings per batch and applies a source–target-conditioned transport map for unseen batch pairs.
- QC dashboards that track alignment metrics (MMD/SWD) on held-out donors/batches and alert when transport degeneracy occurs (alignment diagnostic to detect source-ignoring behavior).
- Assumptions/dependencies:
- Sufficient cells per batch to learn stable distribution embeddings (CLT assumptions).
- Batch labels per sample set; realistic coverage of batch variability seen at training to enable encoder generalization.
- Compute budgets for training source–target-conditioned models, which can require more capacity than K-to-K baselines.
Semi-supervised perturbation prediction in high-throughput drug screens
- Sectors: healthcare/biotech (preclinical), pharma R&D, academia
- Tools/products/workflows:
- “Perturbation Forecaster” that trains any-to-any DCT on all treatment/control marginals, then fits a lightweight predictor (e.g., ridge regression) from source embedding plus drug metadata to target embedding; generate counterfactual post-treatment distributions for new patients or compounds.
- Experimental triage: prioritize compound–patient pairs for wet-lab validation based on predicted effect sizes and uncertainty.
- Assumptions/dependencies:
- Accurate metadata for drug identity/dose; enough unpaired marginals to leverage semi-supervised gains.
- Out-of-distribution (OOD) generalization depends on encoder capacity and the diversity of training distributions.
Cross-site harmonization in multi-center omics and imaging studies
- Sectors: healthcare/public health, policy, academia
- Tools/products/workflows:
- “Site Harmonizer” service that embeds site-specific distributions (labs, instruments, protocols) and performs any-to-any transport to harmonize data across cohorts, enabling pooled analyses and meta-studies.
- Integrates into study data-lakes with periodic recalibration via Qjoint sampling focused on site-to-site pairings of interest.
- Assumptions/dependencies:
- Access to representative sample sets per site; consistent preprocessing.
- Governance for tracking provenance and harmonization parameters for auditability.
Clonal fate forecasting and lineage-tracing analysis
- Sectors: academia (developmental biology, hematopoiesis), biotech
- Tools/products/workflows:
- “Clone Dynamics Modeler” that uses semi-supervised DCT with orphan marginals to forecast late-time distributions from early-time clones (t1 → t2), aiding hypothesis generation on differentiation trajectories.
- Suggests which clones/timepoints to sample next to reduce uncertainty (active-learning loop using transport errors).
- Assumptions/dependencies:
- Enough clones measured at single timepoints to benefit from orphan marginals; careful Qjoint design to respect time direction (forward transport only).
T-cell receptor (TCR) repertoire evolution forecasting (research use)
- Sectors: academia, translational immunology
- Tools/products/workflows:
- Discrete flow matching (DFM) within DCT to forecast next-time repertoires from current samples, leveraging cross-patient marginal pairs for semi-supervised gains.
- Embedding backbone integration (e.g., ESM) with mean-pooled DeepSets encoder for population summarization.
- Assumptions/dependencies:
- Reliable sequence embeddings; adequate longitudinal or cross-sectional breadth; watch for degenerate generators (add alignment diagnostics to ensure conditioning on source distribution is used).
Domain adaptation for ML teams (tabular, vision, audio)
- Sectors: software, e-commerce, marketing analytics, content moderation
- Tools/products/workflows:
- “Distribution Adapter” microservice: compute embeddings per client/domain/camera and deploy any-to-any transport to normalize data across domains (e.g., camera color spaces, store-level sales distributions, microphone/room acoustics).
- Drop-in wrappers around existing models (GANs, flows, SWD) to convert K-to-K domain-transfer pipelines to generalizable DCT without retraining for each new domain.
- Assumptions/dependencies:
- Stable estimation of distribution embeddings from available samples; proper pairing policy Qjoint to emphasize relevant domain transitions.
Simulation data translation for experimentation and A/B test augmentation
- Sectors: finance (scenario analysis), retail (promo/weather shifts), energy (load forecasting)
- Tools/products/workflows:
- “Scenario Translator” to push observed distributions into counterfactual conditions (e.g., macroeconomic scenarios, weather regimes), aiding sensitivity analyses and synthetic augmentation.
- Assumptions/dependencies:
- Scenario descriptors or covariates to help predict target embeddings; rigorous validation to prevent overtrust in counterfactuals.
Education and methodology teaching aids
- Sectors: academia, education technology
- Tools/products/workflows:
- Interactive notebooks demonstrating one-to-one vs any-to-any transport, K-to-K vs DCT, and semi-supervised benefits with orphan marginals on Gaussian/GMM datasets.
- Assumptions/dependencies:
- None beyond standard ML compute; ideal for coursework on OT, generative modeling, and meta-learning.

Long-Term Applications

These applications are feasible but require additional research, scaling, validation, or regulatory pathways before routine deployment.

Personalized clinical decision support (CDS) for therapy selection
- Sectors: healthcare, pharma
- Tools/products/workflows:
- Patient-specific perturbation prediction from pre-treatment biopsies (scRNA-seq/mass cytometry) using semi-supervised DCT to forecast treatment distributions; integrate with response biomarkers and trial design.
- Assumptions/dependencies:
- Extensive clinical validation, calibration, and fairness auditing; regulatory approval; robust uncertainty quantification and interpretability to avoid source-ignoring degeneracy.
Vaccine response and immune monitoring from repertoire dynamics
- Sectors: healthcare, public health
- Tools/products/workflows:
- Longitudinal DCT for TCR/BCR repertoires to forecast response trajectories to vaccines or infections; early-warning scores for atypical evolution.
- Assumptions/dependencies:
- Large longitudinal cohorts; improved discrete transport mechanisms and embeddings tailored to immunogenomics; clear clinical utility.
General sim-to-real and real-to-real transfer in robotics and autonomy
- Sectors: robotics, autonomous driving
- Tools/products/workflows:
- Any-to-any domain transfer of sensor distributions across weather, time of day, sensor modalities, or environments; sim data transported to match target deployment distributions.
- Assumptions/dependencies:
- Tight coupling between sample and target distribution to avoid mode-matching without alignment; integration with control stacks; safety certification.
Unsupervised machine translation and cross-domain NLP with unseen domains/languages
- Sectors: software, localization
- Tools/products/workflows:
- Discrete DCT (with DFM) to learn any-to-any mappings between text corpora/domains, generalizing beyond fixed domain labels (e.g., styles, outlets) and potentially low-resource or unseen languages.
- Assumptions/dependencies:
- Scaling discrete transport to large vocabularies; robust distribution encoders for text; evaluation beyond BLEU on content preservation/style transfer.
Financial stress testing and regulatory scenario harmonization
- Sectors: finance, policy/regulators
- Tools/products/workflows:
- “Stress Transport” to push borrower/portfolio distributions into regulatory scenarios, harmonize cross-institution datasets, and impute missing timepoints for systemic risk models.
- Assumptions/dependencies:
- Transparent, auditable pipelines; conservative uncertainty bounds; acceptance by regulators; robust OOD generalization.
Grid planning and demand-response forecasting under novel regimes
- Sectors: energy/utilities
- Tools/products/workflows:
- Transporting consumption or generation distributions across new tariff structures, DER penetration, or climate regimes to plan capacity and DR programs.
- Assumptions/dependencies:
- High-fidelity covariates; careful pairing policies Qjoint; validation on exogenous shocks.
Federated and privacy-preserving distribution embeddings for data sharing
- Sectors: healthcare, finance, public sector
- Tools/products/workflows:
- On-premise computation of distribution embeddings with CLT properties; share embeddings only to enable cross-institution transport without sharing raw data.
- Assumptions/dependencies:
- Privacy analysis on embedding leakage; secure aggregation; standardization of embedding schemas across sites.
Population-level causal and counterfactual inference at distribution scale
- Sectors: academia, policy
- Tools/products/workflows:
- Use DCT to estimate counterfactual population distributions under interventions when only orphan marginals are available; support policy evaluation (e.g., program rollout effects).
- Assumptions/dependencies:
- Strong identifiability assumptions; careful design of Qjoint to reflect causal structure; integration with causal inference frameworks.
Universal domain harmonization platforms
- Sectors: software/SaaS platforms
- Tools/products/workflows:
- “Distribution Harmonizer” platform offering plug-in encoders and transport backends (FM, SWD, GANs) for customers to generalize across their evolving domains without retraining per domain.
- Assumptions/dependencies:
- Efficient, scalable training/inference; automatic degeneracy diagnostics; user-friendly pairing policy design and monitoring.

Cross-cutting assumptions and dependencies

Data requirements: Each distribution needs enough samples for stable embeddings; encoder invariances (permutation and proportional invariance) and smoothness assumptions underpin CLT-based training.
Model choice: Some transport mechanisms can match targets while ignoring the source sample; include alignment diagnostics and, where needed, mechanisms that encourage sample-level coupling (e.g., flow-matching with coupling-aware objectives).
Pairing policy: Define Qjoint to reflect domain constraints (e.g., time directionality, clinically relevant pairs) to avoid learning uninformative transports.
Compute and scaling: Source–target-conditioned models may need more capacity/training than K-to-K; monitor underfitting and adjust scaling.
Governance and safety: For clinical/regulated uses, require rigorous validation, uncertainty quantification, interpretability, and compliance.

View Paper Prompt View All Prompts

Glossary

Any-to-any transport: Transport between arbitrary source–target distribution pairs, not fixed at training time. "Unsupervised (any-to-any) transport between any pair of distributions."
Barycenter: A central distribution minimizing distances to a set of distributions. "MMSI enforces that transport paths go through all K distributions to learn a barycenter"
Batch effect transfer: Predicting how data would appear under a different technical batch condition. "This problem, which we refer to as batch effect transfer, is closely related to batch integration."
Central limit theorem (CLT): Asymptotic normality result used here for distribution embeddings and losses. "A key property of such encoders is that they admit a central limit theorem (CLT)."
Change-of-variable models: Invertible mappings used by normalizing flows to transform densities. "Normalizing flows provide invertible transport parameterizations through continuous change-of-variable models (Rezende & Mohamed, 2015)."
DeepSets: A permutation-invariant neural architecture for set inputs and aggregation. "with mean-pooled DeepSets aggregation."
Dirichlet prior: A prior over probability vectors, used here for mixture weights. "For GMM, we additionally sample mixture weights from a Dirichlet prior"
Discrete Flow Matching (DFM): A flow-matching framework for discrete domains. "a discrete flow matching (DFM) bridge (Gat & Lipman, 2024)"
Distribution encoder: A permutation/proportion-invariant function mapping a sample set to a distribution embedding. "defining a distribution encoder & : Si -> zi E Rd"
Distribution-conditioned transport (DCT): Conditioning transport maps on learned embeddings of source/target distributions. "The distribution-conditioned transport framework."
Distributional divergence: A measure between distributions used to train generative transport (e.g., Sinkhorn, MMD). "a distributional divergence (e.g. Sinkhorn/MMD)"
Energy distance: A statistical distance between distributions used for evaluation. "Figure 1 shows energy distance as a function of Lo distance"
Energy score: A proper scoring rule related to energy distance for multivariate distributions. "Energy score models"
ESM backbone: A pretrained protein LLM used to embed sequences. "We embed sequences using a pre- trained ESM backbone (Lin et al., 2023) with mean-pooled DeepSets aggregation."
Flow matching: Learning a velocity field to define continuous-time transport between distributions. "flow matching (FM) models"
Generative distribution embeddings (GDE): Learned embeddings summarizing distributions for generative tasks. "We show how generative distribution embeddings (GDE) developed in Fishman et al. (2025) can be coupled with a broad class of transport models"
Hadamard differentiability: A functional smoothness notion enabling CLTs for functionals of measures. "Hadamard differentiability of ø, we have"
Inverse-Wishart prior: A prior over covariance matrices used for Gaussian sampling. "covariances Ei E R2x2 drawn from an inverse-Wishart prior."
Kernel mean embeddings: Representing probability measures as elements in an RKHS. "Kernel methods, including kernel mean embeddings, pro- vide a toolkit for representing probability measures as points in a reproducing kernel Hilbert space"
K-simplex: The simplex with K corners; used to index K discrete distributions. "embedding each distribution as a corner of the K-simplex and conditioning on the source and target corners."
K-to-K transport: Transport restricted to a fixed finite set of K source and K target distributions. "learn dynamics between any pair of a fixed set of K distributions, solving a K-to-K transport problem"
Maximum mean discrepancy (MMD): A kernel-based two-sample metric used for evaluation and training. "and evaluate using MMD in the main text"
Metadistribution: A distribution over distributions (tasks) from which population-level distributions are drawn. "drawn from a shared metadistribution Q over the space of probability measures P(X)"
Multimarginal stochastic interpolants (MMSI): A framework to learn flows connecting multiple distributions. "Multimarginal stochastic interpolants learn dynamics between any pair of a fixed set of K distributions"
Normalizing flows: Invertible neural transformations enabling exact likelihoods and transport. "Normalizing flows provide invertible transport parameterizations"
Orphan marginals: Unpaired distributions observed only at a single condition or timepoint. "as well as allowing us to make use of unstructured, partial observations such as orphan marginals."
Permutation invariant: A property of encoders that are unchanged by reordering samples. "permutation invariant, so that reordering samples does not change & (Si)"
Probability measures P(X): The space of probability distributions on domain X. "over the space of probability measures P(X)"
ProGen: A protein LLM used here as a sequence generator/bridge. "The ProGen model appears degenerate; it learns identical embeddings for all distributions"
Product coupling: Independent sampling of minibatches, yielding block-diagonal covariance in the loss CLT. "for independent minibatches (product coupling)"
Reproducing kernel Hilbert space (RKHS): A Hilbert space associated with a kernel where mean embeddings live. "as points in a reproducing kernel Hilbert space"
Ridge regression: L2-regularized linear regression used to predict target embeddings. "All semi-supervised approaches use ridge regression fit on the same data"
Sinkhorn (divergence): An entropically regularized optimal transport divergence. "e.g. Sinkhorn/MMD"
Sliced Wasserstein distance (SWD): An OT-based distance computed via 1D projections. "sliced Wasserstein distance (SWD) models"
Source-conditioned transport: Transport conditioned only on the source distribution’s embedding. "We refer to these as source-conditioned trans- port models."
Source-target-conditioned transport: Transport conditioned on both source and target distribution embeddings. "we will introduce source-target-conditioned transport models"
Stochastic interpolants: Stochastic paths (or bridges) used to connect source and target distributions. "stochastic interpolants (Lipman et al., 2023; Liu et al., 2022; Albergo et al., 2023b)"
Style-transfer: Unpaired image-to-image translation across domains, used as an analogy to distribution transport. "Style-transfer and unpaired image-translation methods pro- vide a complementary line of work."
T-cell receptor (TCR) repertoire: The set of TCR sequences in a sample, treated here as a distribution over sequences. "We evaluate DCT on longitudinal T-cell receptor (TCR) repertoire sequencing"
Transport map: A mapping that transforms samples from a source distribution to match a target distribution. "The learned transport map is universal in the sense that any distribution can in principle be pushed to any distribution"
TRB CDR3: The complementarity-determining region 3 of the TCR beta chain; a key variable-length sequence segment. "Each repertoire is an empirical distribution over TRB CDR3 amino acid sequences."
Velocity field: A time-dependent vector field whose flow transports one distribution to another. "by learning a velocity field along a path interpolating between source and target."
Wasserstein GANs: GAN variants trained using the Wasserstein distance as a critic objective. "Wasserstein GANs (Arjovsky et al., 2017)."
Within-distribution coupling: Sample-wise alignment notion within a transport; may be weak in conditional generators. "does not impose a within-distribution coupling structure"

Distribution-Conditioned Transport

Summary

Distribution-Conditioned Transport: Framework, Numerical Results, and Implications

Formal Framework and Motivation

Methodology

Empirical Results and Numerical Evidence

Theoretical and Practical Implications

Limitations and Prospects

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Distribution-Conditioned Transport (DCT): A simple guide

What’s this paper about?

What big questions are they trying to answer?

How does their approach work?

What did they find?

Why is this important?

What are the takeaways and future impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets