Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 64 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 77 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (2507.12318v2)

Published 16 Jul 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained LLMs. We efficiently finetune a text diffusion LLM to generate DLCs that produce novel samples outside of the image generator training distribution.

Collections

Summary

The paper introduces DLC, a discrete latent code derived from Simplicial Embeddings that enhances the fidelity and compositionality of image diffusion models.
It demonstrates that conditioning on DLCs significantly improves unconditional ImageNet generation, achieving a state-of-the-art FID of 1.59 and novel out-of-distribution samples.
The work also proposes a text-to-image pipeline that leverages pretrained language models to generate DLC tokens, offering a unified approach to image and text generation.

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

This paper introduces Discrete Latent Code (DLC), a novel image representation designed to improve the fidelity, ease of generation, and compositionality of diffusion models. DLCs are derived from Simplicial Embeddings (SEMs) trained with a self-supervised learning objective, resulting in sequences of discrete tokens that offer advantages over standard continuous image embeddings. The paper demonstrates that diffusion models trained with DLCs achieve improved generation fidelity, establish a new state-of-the-art for unconditional image generation on ImageNet, and enable the generation of novel out-of-distribution (OOD) samples through DLC composition. Additionally, the paper showcases a text-to-image generation pipeline leveraging large-scale pretrained LLMs to generate DLCs from text prompts.

Addressing Limitations of Continuous Embeddings

The paper argues that the success of diffusion models is largely attributable to input conditioning. It identifies limitations in current diffusion models, such as low diversity and unrealistic reflections of complex input prompts, as stemming from the inability to fully model the data distribution. The authors posit that conditioning diffusion models on improved representations can alleviate these issues.

Figure 1: Selected samples generated from a DiT-XL/2 with DLC $_{512}$ for both in-distribution and out-of-distribution (OOD). Model trained on ImageNet $256\times 256$ conditioned on a Discrete Latent Code of $512$ tokens. Left: Samples from unconditional generation. Right: OOD samples of semantic compositional generation by conditioning on diverse compositions of two DLCs corresponding to (1) jellyfish and mushroom, (2) komodor and carbonara and (3) tabby cat and golden retriever.

Natural language is recognized as a flexible and compositional representation, yet text captions often fall short as image descriptors, capturing only a few concepts while excluding crucial details. While text-to-image models have advanced, they often struggle with semantic consistency. Self-supervised learning (SSL) image embeddings offer a structured and expressive alternative, but their continuous nature poses challenges in learning and sampling distributions, as well as in achieving flexible compositionality. DLC aims to bridge the gap between image and text representations by providing a sequence of discrete image tokens that are easy to generate and composable.

Discrete Latent Code (DLC) Methodology

The DLC framework leverages SEMs, which are sequences of distributions over a vocabulary of image tokens learned with an SSL method. The process involves inferring DLCs from SEM encoders trained via a distillation objective. Specifically, an encoded representation $e_\theta(\bm x)$ is projected onto a V-dimensional simplex using a learnable linear projection $W_i$ , followed by a temperature-scaled softmax $\sigma_\tau$ , resulting in simplicial embeddings $S_i=\sigma_\tau(e_\theta(\bm x)\cdot W_i)$ . A discrete latent code $\bm c$ is then obtained by taking the argmax of each SEM: $T_i=\argmax S_i, i\in[L]$ , $\bm c = (T_1, T_2, ..., T_L)$ where $T_i$ is a token that takes a value in $\mathbb{N}^V$ .

Figure 2: Training: 441 mixtures

To improve unconditional generation, the paper proposes modeling the data distribution $p(\bm x)$ as the product of two generative models that are easier to learn: $p(\bm x) = \sum_{\bm c}p(\bm x|\bm c)\cdot p(\bm c)$ . Sampling from $p(\bm x)$ is achieved through ancestral sampling, first sampling from $p(\bm c)$ and then sampling the image $p(\bm x | \bm c)$ conditioned on the sampled code $\bm c$ . To model $p(\bm c)$ , the paper employs a discrete diffusion model, SEDD-Absorb, which samples a discrete code by iteratively unmasking a fully masked sequence. The token to be unmasked is determined via a learned concrete score $s'_\theta: \mathcal C\times \mathbb{R}\to \mathbb{R}^V$ which estimates a diffusion matrix that controls the mass transition from the mask token to the DLC token. A remasking strategy is also introduced to improve sampling, where tokens are remasked with a probability $\eta$ during the reverse diffusion process.

Experimental Results and Analysis

The paper presents a series of experiments to evaluate the performance of DLCs in various image generation tasks. The results demonstrate that diffusion models conditioned on DLCs push the state-of-the-art on unconditional ImageNet generation, outperforming generative models conditioned with continuous SSL embeddings and exhibiting compositional generation capabilities. The experiments also show that increasing the sequence length of DLCs leads to increased image generation fidelity. Specifically, a DiT-XL/2 model with DLC considerably improves the FID compared to the same DiT-XL/2 with label-conditioning, achieving a state-of-the-art FID of 1.59 for unconditional generation. The paper also investigates the trade-off between the sequence length and vocabulary size in DLCs, finding that longer sequence lengths lead to better performance but are more computationally expensive.

Figure 3: Discrete Latent Codes (DLCs) are Top Left: the output of a finetuned DINOv2 with SEM, followed by an argmax over the vocabulary. Top Right: we can generate semantically compositional images from a composition of two DLCs by selecting tokens from either code. Bottom Left: we enable text-to-image generation by finetuning a text diffusion model for text-to-DLC sampling. Bottom Right: we sample unconditionally by first sampling a DLC with SEDD then conditionally sampling an image with DiT.

In compositional generation experiments, DLC-based compositions successfully integrate visual features from multiple reference images, exhibiting greater sample diversity compared to continuous embeddings. The Vendi Score is used to quantify this diversity, with DLC compositions consistently outperforming continuous embeddings in generating diverse and semantically blended samples.

Finally, the paper explores text-conditioned DLC for image generation, leveraging large-scale pretrained LLMs. By treating DLCs as part of a LLM's vocabulary, the paper proposes a text-to-image generation pipeline that involves sampling a DLC from a text prompt using a LLM, followed by generating an image from the DLC via a pre-trained image diffusion model. This pipeline demonstrates the ability to generate novel images using text prompts OOD relative to the image diffusion model's ImageNet training.

Implications and Future Directions

The research presented in this paper has significant implications for the field of generative modeling, particularly in the context of diffusion models. The introduction of DLC as a compositional discrete representation offers a promising avenue for improving image generation fidelity and enabling productive generation capabilities. The finding that DLC improves on the state-of-the-art for unconditional image generation on ImageNet challenges the conventional wisdom that label or CLIP embeddings are necessary for high-quality results. Furthermore, the paper's demonstration of a novel text-to-image paradigm leveraging large-scale pretrained LLMs opens up new possibilities for unified text-image generation interfaces.

Figure 4: DLC greatly improves training efficiency for FID without CFG on ImageNet. Evaluating FID w/o CFG during intermediate steps, DLC is already improving on vanilla DiT performance at 1/4 of the steps. Baseline numbers taken from~\citet{yu_representation_2025}

Future research directions could explore the application of DLCs to other generative modeling tasks, such as video generation and 3D shape generation. Additionally, investigating methods for learning even more expressive and controllable DLCs could further enhance the capabilities of diffusion models. The development of more efficient algorithms for sampling from discrete diffusion models could also improve the scalability of the DLC framework.

Conclusion

This paper presents a compelling case for the use of Discrete Latent Codes as a means of improving the performance and capabilities of diffusion models. By leveraging a compositional discrete representation learned solely from images, the paper achieves state-of-the-art results on unconditional image generation, enables productive generation capabilities, and introduces a novel text-to-image paradigm. These findings highlight the importance of representation learning in the context of generative modeling and suggest promising avenues for future research.