Cause of training instability in naïve autoencoder replacement
Determine whether the training instability observed when adapting pre-trained video diffusion models to a new autoencoder by retaining the pre-trained Diffusion Transformer (DiT) blocks while randomly initializing the patch embedder and output head is caused by the substantial embedding space mismatch introduced by the new latent space together with the randomly initialized patch embedder, which prevents effective retention of knowledge from the pre-trained DiT weights.
References
We conjecture that this instability arises from the substantial embedding space gap introduced by the new latent space and the randomly initialized patch embedder, which prevents the model from effectively retaining knowledge from the pre-trained DiT weights.
— DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
(2509.25182 - Chen et al., 29 Sep 2025) in Section 3.3.1, Naïve Approach, Challenge and Analysis