Transferability of noise scheduling and loss re-weighting to high-dimensional semantic latents

Determine whether previously proposed noise scheduling and loss re-weighting strategies for diffusion models trained on pixel-space or variational autoencoder latents transfer effectively to high-dimensional semantic token latents produced by pretrained representation encoders in Representation Autoencoders, and ascertain any conditions under which such strategies require modification.

Background

In adapting diffusion transformers to operate in the high-dimensional latent spaces produced by frozen representation encoders, the paper identifies several challenges responsible for initial training failures. Among these, prior noise scheduling and loss re-weighting methods were originally designed for pixel-space or low-dimensional VAE latents.

Because RAEs use high-dimensional semantic tokens, the applicability of these established scheduling and re-weighting techniques is uncertain. The authors highlight this uncertainty explicitly before introducing their dimension-dependent schedule shift as a solution, noting that the general transferability of earlier approaches to semantic latents has not been established.

References

Prior noise scheduling and loss re-weighting tricks are derived for image-based or VAE-based input, and it remains unclear if they transfer well to high-dimension semantic tokens.

Diffusion Transformers with Representation Autoencoders (2510.11690 - Zheng et al., 13 Oct 2025) in Section 4, "Taming Diffusion Transformers for RAEs" (hypotheses list under "DiT does not work out of the box")