Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 209 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Learning Diffusion Models with Flexible Representation Guidance (2507.08980v1)

Published 11 Jul 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA. The code is available at https://github.com/ChenyuWang-Monica/REED.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a unified variational framework that flexibly incorporates pretrained semantic features into the reverse diffusion process.
  • It presents REED, which leverages multimodal representation alignment and curriculum strategies to significantly boost training efficiency and model performance.
  • Empirical results across image, protein, and molecule generation demonstrate dramatic speedups and improved quality compared to traditional diffusion approaches.

Learning Diffusion Models with Flexible Representation Guidance: An Expert Overview

This work presents a comprehensive theoretical and practical framework for enhancing diffusion models through flexible integration of pretrained representations. The authors systematically analyze and generalize prior empirical approaches, introduce new strategies for multimodal and curriculum-based guidance, and demonstrate strong empirical results across image, protein, and molecule generation tasks.

Theoretical Framework

The core contribution is a unified variational framework for incorporating auxiliary representations into diffusion models. The authors extend the standard DDPM formulation by introducing a latent variable zz (or a hierarchy {zl}\{z_l\}) representing pretrained semantic features. The generative process is parameterized to allow zz to be injected at arbitrary points in the reverse diffusion chain, controlled by a weighting schedule {αt}\{\alpha_t\}. This leads to a hybrid conditional distribution that interpolates between unconditional and representation-guided reverse steps.

Key theoretical insights include:

  • Flexible Decomposition: The joint model pθ(x0:T,z)p_\theta(x_{0:T}, z) can be decomposed at any timestep tt, allowing zz to be introduced at different stages. This flexibility is formalized via a convex combination of decompositions, parameterized by {αt}\{\alpha_t\}.
  • Multi-Latent Hierarchies: The framework naturally extends to multiple representations at different abstraction levels, enabling integration of diverse modalities and hierarchical features.
  • Unification of Prior Methods: Existing approaches such as RCG and REPA are shown to be special cases within this framework, corresponding to specific choices of {αt}\{\alpha_t\} and latent structure.
  • Provable Distributional Benefits: Theoretical bounds are provided on the total variation distance between the model and data distributions, showing that representation alignment can provably reduce score estimation error and improve sample quality.

Practical Strategies: REED

Building on the theoretical foundation, the authors introduce REED (Representation-Enhanced Elucidation of Diffusion), which operationalizes two main strategies:

  1. Multimodal Representation Alignment: By pairing data with synthetic or cross-modal representations (e.g., image-text, sequence-structure), the model leverages complementary information. Synthetic data is generated using auxiliary models (e.g., VLMs for images, AlphaFold3 for proteins), and alignment is enforced via similarity losses between model features and pretrained representations.
  2. Curriculum Learning: Training is scheduled such that representation alignment is emphasized early, with the diffusion loss weight increasing over time. This phase-in protocol ensures that the model first learns to extract and align semantic features before focusing on data generation, improving both convergence and generalization.

Empirical Results

The framework is instantiated and evaluated in three domains:

Image Generation

  • Setup: Class-conditional ImageNet 256×256256\times256 with SiT architectures, aligning with DINOv2 and Qwen2-VL representations.
  • Results: REED achieves a 23.3× training speedup over vanilla SiT-XL and 4× over REPA, reaching FID=8.2 in 300K iterations (vs. 7M for SiT-XL). With classifier-free guidance, REED matches REPA's FID=1.80 at 200 epochs (vs. 800 for REPA).
  • Ablations: Optimal alignment is achieved by matching shallow model layers to low-level image features and deeper layers to high-level VLM embeddings, confirming the theoretical predictions about hierarchical representation utility.

Protein Inverse Folding

  • Setup: Discrete diffusion models (ProteinMPNN backbone) trained on PDB, with alignment to AlphaFold3 structure and sequence representations.
  • Results: REED accelerates training by 3.6× and improves sequence recovery, RMSD, and pLDDT metrics. For example, 41.5% sequence recovery is achieved in 70 epochs (vs. 250 for baseline).
  • Ablations: Pairwise residue representations contribute most to performance, but all representation types (single, pair, structure) are beneficial.

Molecule Generation

  • Setup: 3D molecule generation on GEOM-DRUG with SemlaFlow (E(3)-equivariant flow matching), aligned to Unimol representations.
  • Results: REED improves atom/molecule stability, validity, and energy/strain metrics, outperforming state-of-the-art models with significantly fewer epochs and lower sampling cost.

Implementation Considerations

  • Computational Efficiency: The curriculum and representation alignment strategies yield substantial reductions in training time and resource requirements.
  • Modularity: The framework is agnostic to the choice of pretrained representations and can be adapted to various modalities and architectures.
  • Scalability: The approach is demonstrated at scale (e.g., large transformer backbones, high-resolution images) and is compatible with both continuous and discrete diffusion/flow models.
  • Limitations: The effectiveness depends on the quality and relevance of the pretrained representations. Synthetic pairing for multimodal alignment may introduce biases if auxiliary models are not well-calibrated.

Implications and Future Directions

This work provides a principled and extensible approach for leveraging external representations in generative modeling. The demonstrated gains in efficiency and sample quality suggest that representation-guided diffusion will be a key paradigm, especially as high-quality pretrained models proliferate across domains.

Potential future developments include:

  • Adaptive Weighting Schedules: Learning or dynamically adjusting {αt}\{\alpha_t\} based on training progress or data characteristics.
  • Broader Modalities: Extending to video, audio, and other structured data, leveraging domain-specific pretrained encoders.
  • Joint Representation and Generation Pretraining: Co-training encoders and diffusion models for end-to-end optimization.
  • Theoretical Analysis of Multimodal Alignment: Further quantifying the benefits and potential pitfalls of synthetic pairing and cross-modal guidance.

In summary, this paper advances both the theoretical understanding and practical methodology for integrating flexible representation guidance into diffusion models, with strong empirical validation across diverse and challenging generative tasks.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube