Conditioned Generative Models

Updated 28 January 2026

Conditioned generative models are probabilistic frameworks that generate samples from p(x|c) by incorporating auxiliary information like class labels or text.
They utilize diverse conditioning mechanisms—such as direct concatenation, cross-modal fusion, and conditional normalization—to achieve controllable and coherent outputs.
Empirical results in domains like text-to-image, medical imaging, and 3D synthesis demonstrate improved metrics and stability by partitioning complex joint distributions.

A conditioned generative model is a probabilistic framework that synthesizes samples from a target distribution $p(x \mid c)$ , where $x$ denotes the synthetic output (e.g., image, sequence, field) and $c$ is auxiliary conditioning information such as class labels, text, measurements, context, or structured side-information. Conditioning sharpens the generative task by partitioning the joint distribution into simpler conditional components, typically enabling more controllable, coherent, and diverse sample generation compared to unconditional approaches.

1. Mathematical Foundations and Theoretical Principles

Conditioned generative models generalize the standard marginal generative paradigm $p(x)$ by explicitly modeling $p(x \mid c)$ . This is justified both expressively and statistically: for fixed capacity, mixtures $p(x) = \int p(x \mid c) \, q(c)\, dc$ can represent a broader class of marginal distributions and simplify fitting by separating data modes (Bao et al., 2022). The learning objective is typically to minimize a divergence $D$ , e.g., KL, between empirical and model conditional distributions:

$\min_{\theta,\,E(\cdot)}\;\mathbb E_{c\sim q(c)}\,D\!\bigl(q(x\mid c)\,\|\,p_\theta(x\mid c)\bigr)$

where $E(\cdot)$ is the conditioning embedding and $\theta$ the generative model parameters.

Sufficient conditions for conditional dominance are formalized: if, for any model backbone, fitting each $q(x\mid c)$ by tuning conditioning is easier than fitting the total $q(x)$ , then the conditional optimum yields strictly tighter marginal approximation [(Bao et al., 2022), Proposition 1]. Conditioning thus acts as a variational partition that can reduce generalization error, lower sample complexity, and stabilize optimization.

2. Conditioning Mechanisms, Network Architectures, and Representations

Conditioning is implemented through architectural mechanisms that ingest $c$ into the generator and, often, the discriminator or density model:

Direct concatenation: $z' = [z; E(c)]$ , where $z$ is noise and $E(c)$ is a learned embedding (canonical in CGAN, VAE, flow) (Dash et al., 2017, Ren et al., 2016).
Cross-modal fusion: Projecting text via encoders (e.g., Skip-Thought (Dash et al., 2017), CLIP, BLIP (Sun et al., 2024), or LLMs), then injecting the latent into generator or diffusion backbone via cross-attention or FiLM-modulation (Sun et al., 2024, Blinn et al., 2021).
Spatial fusion/bilinear pooling: Multiplicative or tensor-product interaction at each spatial location to strengthen joint feature–condition dependence (Kwak et al., 2016).
Conditional batch/fixed normalization: SPADE, AdaIN, Feature-wise Linear Modulation (FiLM) per condition (Blinn et al., 2021, Voleti, 2023).
Auxiliary classifier: Discriminator jointly outputs real/fake and class/condition predictions enforcing $P(\text{class}\mid x)$ separability and semantic fidelity (Dash et al., 2017, Han et al., 2021).

For non-categorical, mixed-type, or partial conditioning, embedding mechanisms include numerical and categorical slot-specific embeddings with mask tokens for missing values (Mueller et al., 22 May 2025), learned feature extractors for partial or sparse attributes (Ibarrola et al., 2020), and latents mapping sentences, scenes, or continuous auxiliary data (Jacobsen et al., 2023, Xie et al., 21 Jan 2026).

Diffusion and flow-based models extend conditioning by injecting $E(c)$ (or measurement/scene latents) into each U-Net, Transformer, or ODE block, either via concatenation, cross-attention, or control branches (Jacobsen et al., 2023, Huang, 2024, Xie et al., 21 Jan 2026).

3. Algorithms, Objective Functions, and Training Protocols

Loss functions and training routines align with the model class and desired conditional fidelity:

GANs: Employ adversarial losses, with generator and discriminator both receiving condition embeddings. Auxiliary losses (e.g., cross-entropy for class) encourage label–output alignment. TAC-GAN (Dash et al., 2017) uses:

$L_G = H\bigl(D_S(I_f, l_r), 1\bigr) + H\bigl(D_C(I_f, l_r), C_r\bigr)$

with $l_r$ the text embedding, $C_r$ the class, $D_S$ the source head, $D_C$ the class head.

Moment-matching: Conditional generative moment-matching networks (CGMMN) minimize Hilbert–Schmidt norm between RKHS embeddings of true/model conditionals (conditional MMD) (Ren et al., 2016).
Variational: Conditional VAE (cVAE) and hybrid VAE-flow use ELBOs, with condition-injected priors and predictors; e.g., VAE-cFlow (Gu et al., 2020) combines VAE-encoded attribute distributions with conditional flow likelihood maximization.
Diffusion/Score-based: Conditional denoising score matching, either via explicit conditional (class, text, measurement) or self-conditioned pseudo-labels (features, clusters) (Bao et al., 2022, Jacobsen et al., 2023, Huang, 2024). E.g., Schrödinger bridge learning fits a time-dependent drift network $b_\theta(t, x, z)$ with squared-error regression (Huang, 2024).
ODE/Flow: Conditional normalizing flows, neural ODEs, or Poisson flow models inject $c$ at each transformation step; losses reflect negative log-likelihood or score-matching on conditionals (Fang et al., 17 Nov 2025, Voleti, 2023).
Partially observed/sparse conditioning: Random masking of condition slots during training simulates test-time partiality; decoders are trained to consume any subset of available $c$ (Ibarrola et al., 2020, Mueller et al., 22 May 2025).

Optimization is almost always performed by Adam or AdamW, with schedule and batch-size following efficiency/overfitting constraints of the domain (classical: batch=64-128 for GANs, 8–64 for high-res 3D/medical settings, small constant sparsity for heavily masked conditioning).

4. Applications and Empirical Performance

Conditional generative models have demonstrated state-of-the-art results in a wide spectrum of domains:

Text-to-image: TAC-GAN (Dash et al., 2017) achieves Inception Score 3.45 (7.8% better than StackGAN) and mean MS-SSIM 0.13 (≈real) on Oxford-102 Flowers, evidencing high semantic accuracy and diversity.
Attribute/semantic feature synthesis: VAE-cFlow (Gu et al., 2020) enables generalized zero-shot learning with GZSL-H = 52.8% (CUB), outperforming earlier VAE/GAN-based feature generators.
Medical imaging: BrainSynth (Peng et al., 2023) produces age- and sex-conditioned 3D MRI achieving MS-SSIM = 0.933 and 51% of brain regions matching real data (Cohen’s $|d|<0.2$ ). PoCGM (Fang et al., 17 Nov 2025) yields CT PSNR = 45.64 dB, SSIM = 0.979.
3D/physical reasoning: GenCA (Sun et al., 2024) generates text-conditioned photorealistic, drivable 3D avatars using LDM conditioning in latent geometry/texture space, surpassing earlier avatar generators in aesthetic and preference ratings. Conditioning by measurement, parameter, or spatial/frequency domain enables high-fidelity physical field synthesis (Jacobsen et al., 2023, Xie et al., 21 Jan 2026).
Handling missing data/sparse conditioning: Masked-conditioning enables robust generative synthesis as sparsity of $c$ increases, with mild MSE degradation and generalization even for $>90\%$ missing slots (Mueller et al., 22 May 2025). PCGAN (Ibarrola et al., 2020) significantly outperforms classical cGAN as condition missingness rises, maintaining lower FID and qualitative sample quality.
Data augmentation: Guidance-based conditional sampling from pre-trained diffusion yields synthetic sets boosting classifier test accuracy by 8% (Graikos et al., 2023).

Quantitative results are typically reported on FID, IS, MS-SSIM, LPIPS, CLIP-similarity, functional metric error (PSNR/SSIM), and domain-specific task accuracy.

5. Advanced Techniques, Partial/Uncertain Conditioning, and Masking

Recent advances include:

Partial and masked conditioning: Simulating missing or uncertain condition slots at training time enables single models to generalize over arbitrary available subsets of $c$ at inference, as in masked-conditioned VAE/LDM (see (Mueller et al., 22 May 2025)) and partially-conditioned GANs (feature extractor $F(\bar y)$ mapping masked inputs (Ibarrola et al., 2020)).
Latent optimization and proxy learning: For functionally targeted generation (e.g., body-aware chairs (Blinn et al., 2021)), small learnable warping networks $F(z;c)$ perturb pretrained generative latent codes, optimized via differentiable surrogate metrics approximating physical or perceptual losses.
Self-supervised and cluster-based conditioning: Self-conditioned diffusion (SCDM) clusters pre-trained self-supervised features for pseudo-label-based conditional modeling, improving unconditional FID by $>2\times$ and approaching supervised conditional performance (Bao et al., 2022).
Scene and geometry latent conditioning: Video/scene models inject latent codes from 4D/scene encoders (e.g., CUT3R tokens in LaVR (Xie et al., 21 Jan 2026)) into diffusion U-Nets, yielding geometric consistency and improved cycle PSNR.
Schrödinger bridge SDEs: Learning conditional samplers via entropy-regularized optimal transport SDEs allows bridging from a fixed source to a target conditional via neural drift estimation, with superior mean/variance estimation MSE (Huang, 2024).

6. Limitations, Trade-offs, and Open Questions

Conditional generative modeling faces several open challenges:

Mode expressivity and label supervision: Simple conditional slices may not suffice for complex multimodal conditionals; over-conditioning may paradoxically yield under-diverse outputs unless embeddings are structured (Gu et al., 2020, Han et al., 2021).
Embedding and fusion capacity: High-dim or structured $c$ (sentences, graphs, partial layouts) require nontrivial cross-attention or tensor fusion modules, with memory-compute trade-offs (e.g., SBP (Kwak et al., 2016)).
Sparse or missing condition generalization: Masked or partial techniques require careful schedule selection; robust generalization demands embedding learning under diverse masking (Mueller et al., 22 May 2025, Ibarrola et al., 2020).
Posterior evaluation/density estimation: Many frameworks (notably score-based, Poisson flow, and Schrödinger bridge) directly sample $p(x|c)$ but do not render an explicit normalized conditional density. Kernel-based or sample-based estimates are necessary for downstream statistics/uncertainty quantification (Huang, 2024, Fang et al., 17 Nov 2025).
Stability and sample efficiency: Adversarial objectives for conditional GAN can be sensitive to adversarial dynamics and regularization, especially as condition dimension increases (Han et al., 2021). Deterministic ODE approaches (Poisson flow, physically-consistent PF-ODE (Jacobsen et al., 2023)) offer increased stability but may lose stochastic diversity.

Potential extensions include classifier-free guidance, more powerful/flexible embeddings (e.g., higher-order text for semantic control), physically-consistent sampling for scientific domains, and multi-task/masked training to maximize cross-task generalization (Jacobsen et al., 2023, Sun et al., 2024, Bao et al., 2022).

7. Summary and Outlook

Conditioned generative models constitute the current state of the art across image, video, 3D, scientific, and language-driven synthesis. By incorporating diverse auxiliary information—textual, semantic, numerical, geometric, or partially observed—these models enable controlled, high-diversity, and domain-faithful sample generation. Emerging lines of research include advanced scene and measurement-level conditioning, partial/masked inference, physically-consistent decoding, and sample-efficient adaptation to new conditions and data domains (Mueller et al., 22 May 2025, Jacobsen et al., 2023, Xie et al., 21 Jan 2026, Huang, 2024). Limitations remain in the robustness of conditioning in the presence of complex or noisy inputs, the tractability of high-dimensional embeddings, and the interpretability of the conditional latent structure. Nonetheless, empirical results from diverse domains indicate significant gains in both quantitative metrics and qualitative fidelity over unconditional and naively concatenated frameworks.