Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 209 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Learning Diffusion Models with Flexible Representation Guidance (2507.08980v1)

Published 11 Jul 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.

Collections

Summary

The paper introduces a unified variational framework that flexibly incorporates pretrained semantic features into the reverse diffusion process.
It presents REED, which leverages multimodal representation alignment and curriculum strategies to significantly boost training efficiency and model performance.
Empirical results across image, protein, and molecule generation demonstrate dramatic speedups and improved quality compared to traditional diffusion approaches.

Learning Diffusion Models with Flexible Representation Guidance: An Expert Overview

This work presents a comprehensive theoretical and practical framework for enhancing diffusion models through flexible integration of pretrained representations. The authors systematically analyze and generalize prior empirical approaches, introduce new strategies for multimodal and curriculum-based guidance, and demonstrate strong empirical results across image, protein, and molecule generation tasks.

Theoretical Framework

The core contribution is a unified variational framework for incorporating auxiliary representations into diffusion models. The authors extend the standard DDPM formulation by introducing a latent variable $z$ (or a hierarchy $\{z_l\}$ ) representing pretrained semantic features. The generative process is parameterized to allow $z$ to be injected at arbitrary points in the reverse diffusion chain, controlled by a weighting schedule $\{\alpha_t\}$ . This leads to a hybrid conditional distribution that interpolates between unconditional and representation-guided reverse steps.

Key theoretical insights include:

Flexible Decomposition: The joint model $p_\theta(x_{0:T}, z)$ can be decomposed at any timestep $t$ , allowing $z$ to be introduced at different stages. This flexibility is formalized via a convex combination of decompositions, parameterized by $\{\alpha_t\}$ .
Multi-Latent Hierarchies: The framework naturally extends to multiple representations at different abstraction levels, enabling integration of diverse modalities and hierarchical features.
Unification of Prior Methods: Existing approaches such as RCG and REPA are shown to be special cases within this framework, corresponding to specific choices of $\{\alpha_t\}$ and latent structure.
Provable Distributional Benefits: Theoretical bounds are provided on the total variation distance between the model and data distributions, showing that representation alignment can provably reduce score estimation error and improve sample quality.

Practical Strategies: REED

Building on the theoretical foundation, the authors introduce REED (Representation-Enhanced Elucidation of Diffusion), which operationalizes two main strategies:

Multimodal Representation Alignment: By pairing data with synthetic or cross-modal representations (e.g., image-text, sequence-structure), the model leverages complementary information. Synthetic data is generated using auxiliary models (e.g., VLMs for images, AlphaFold3 for proteins), and alignment is enforced via similarity losses between model features and pretrained representations.
Curriculum Learning: Training is scheduled such that representation alignment is emphasized early, with the diffusion loss weight increasing over time. This phase-in protocol ensures that the model first learns to extract and align semantic features before focusing on data generation, improving both convergence and generalization.

Empirical Results

The framework is instantiated and evaluated in three domains:

Image Generation

Setup: Class-conditional ImageNet $256\times256$ with SiT architectures, aligning with DINOv2 and Qwen2-VL representations.
Results: REED achieves a 23.3× training speedup over vanilla SiT-XL and 4× over REPA, reaching FID=8.2 in 300K iterations (vs. 7M for SiT-XL). With classifier-free guidance, REED matches REPA's FID=1.80 at 200 epochs (vs. 800 for REPA).
Ablations: Optimal alignment is achieved by matching shallow model layers to low-level image features and deeper layers to high-level VLM embeddings, confirming the theoretical predictions about hierarchical representation utility.

Protein Inverse Folding

Setup: Discrete diffusion models (ProteinMPNN backbone) trained on PDB, with alignment to AlphaFold3 structure and sequence representations.
Results: REED accelerates training by 3.6× and improves sequence recovery, RMSD, and pLDDT metrics. For example, 41.5% sequence recovery is achieved in 70 epochs (vs. 250 for baseline).
Ablations: Pairwise residue representations contribute most to performance, but all representation types (single, pair, structure) are beneficial.

Molecule Generation

Setup: 3D molecule generation on GEOM-DRUG with SemlaFlow (E(3)-equivariant flow matching), aligned to Unimol representations.
Results: REED improves atom/molecule stability, validity, and energy/strain metrics, outperforming state-of-the-art models with significantly fewer epochs and lower sampling cost.

Implementation Considerations

Computational Efficiency: The curriculum and representation alignment strategies yield substantial reductions in training time and resource requirements.
Modularity: The framework is agnostic to the choice of pretrained representations and can be adapted to various modalities and architectures.
Scalability: The approach is demonstrated at scale (e.g., large transformer backbones, high-resolution images) and is compatible with both continuous and discrete diffusion/flow models.
Limitations: The effectiveness depends on the quality and relevance of the pretrained representations. Synthetic pairing for multimodal alignment may introduce biases if auxiliary models are not well-calibrated.

Implications and Future Directions

This work provides a principled and extensible approach for leveraging external representations in generative modeling. The demonstrated gains in efficiency and sample quality suggest that representation-guided diffusion will be a key paradigm, especially as high-quality pretrained models proliferate across domains.

Potential future developments include:

Adaptive Weighting Schedules: Learning or dynamically adjusting $\{\alpha_t\}$ based on training progress or data characteristics.
Broader Modalities: Extending to video, audio, and other structured data, leveraging domain-specific pretrained encoders.
Joint Representation and Generation Pretraining: Co-training encoders and diffusion models for end-to-end optimization.
Theoretical Analysis of Multimodal Alignment: Further quantifying the benefits and potential pitfalls of synthetic pairing and cross-modal guidance.

In summary, this paper advances both the theoretical understanding and practical methodology for integrating flexible representation guidance into diffusion models, with strong empirical validation across diverse and challenging generative tasks.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

GitHub

GitHub - ChenyuWang-Monica/REED: Code for paper: "Learning Diffusion Models with Flexible Representation Guidance"

Tweets

https://twitter.com/ChenyuW64562111/status/1945244911642501464

https://twitter.com/AllThingsApx/status/1946228232581509202

alphaXiv

Learning Diffusion Models with Flexible Representation Guidance (13 likes, 0 questions)