Cause of training instability in naïve autoencoder replacement

Determine whether the training instability observed when adapting pre-trained video diffusion models to a new autoencoder by retaining the pre-trained Diffusion Transformer (DiT) blocks while randomly initializing the patch embedder and output head is caused by the substantial embedding space mismatch introduced by the new latent space together with the randomly initialized patch embedder, which prevents effective retention of knowledge from the pre-trained DiT weights.

Background

To adapt a pre-trained video diffusion model to a different video autoencoder, a straightforward method keeps the pre-trained DiT backbone intact while randomly reinitializing the patch embedder and output head tied to the new latent space. Empirically, this approach yielded poor semantic performance and training instability, with outputs degrading to noise after 20K steps.

The authors explicitly conjecture that this instability is due to an embedding space gap: the new latent space and a randomly initialized patch embedder create a mismatch that hinders the pre-trained backbone from retaining and leveraging its learned knowledge. Establishing the root cause would guide robust adaptation strategies for autoencoder replacement.

References

We conjecture that this instability arises from the substantial embedding space gap introduced by the new latent space and the randomly initialized patch embedder, which prevents the model from effectively retaining knowledge from the pre-trained DiT weights.

— DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder (2509.25182 - Chen et al., 29 Sep 2025) in Section 3.3.1, Naïve Approach, Challenge and Analysis

Cause of training instability in naïve autoencoder replacement

Background

References

Related Problems