Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 70 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 21 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder (2509.25182v1)

Published 29 Sep 2025 in cs.CV and cs.AI

Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

Summary

The paper introduces a deep compression video autoencoder (DC-AE-V) that achieves up to 14.8× inference speedup with minimal quality degradation.
It employs a novel chunk-causal temporal modeling strategy that balances intra-chunk bidirectional flow with cross-chunk causality to ensure scalable video generation.
The framework uses AE-Adapt-V with LoRA fine-tuning to align embedding spaces and reduce adaptation costs by 230× compared to full pre-training.

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Introduction

DC-VideoGen presents a post-training acceleration framework for video diffusion models, targeting the prohibitive computational costs associated with high-resolution and long-duration video synthesis. The framework is designed to be model-agnostic, enabling efficient adaptation of any pre-trained video diffusion model to a highly compressed latent space with minimal fine-tuning. The two principal innovations are: (1) a Deep Compression Video Autoencoder (DC-AE-V) with a novel chunk-causal temporal modeling strategy, and (2) AE-Adapt-V, a robust adaptation protocol for transferring pre-trained models into the new latent space. The result is a system that achieves up to 14.8× inference speedup and supports 2160×3840 video generation on a single NVIDIA H100 GPU, with negligible quality degradation and drastically reduced training costs.

Figure 1: DC-VideoGen can generate high-quality videos on a single NVIDIA H100 GPU with resolutions ranging from 480px to 2160px, achieving up to 14.8× acceleration over the Wan-2.1-T2V-1.3B model.

Deep Compression Video Autoencoder (DC-AE-V)

Motivation and Design

Conventional video autoencoders in latent diffusion models typically achieve moderate compression ratios (e.g., 8× spatial, 4× temporal), which are insufficient for efficient high-resolution or long video generation. DC-AE-V addresses this by achieving 32×/64× spatial and 4× temporal compression, while maintaining high reconstruction fidelity and generalization to longer videos.

The core innovation is the chunk-causal temporal modeling paradigm. Unlike strictly causal or non-causal designs, chunk-causal modeling divides the video into fixed-size chunks, applies bidirectional temporal modeling within each chunk, and enforces causality across chunks. This enables the model to exploit intra-chunk redundancy for reconstruction quality, while maintaining the ability to generalize to longer sequences at inference.

Figure 2: Illustration of chunk-causal temporal modeling in DC-AE-V, which enables bidirectional flow within chunks and causal flow across chunks, balancing reconstruction quality and generalization.

Empirical Analysis

Empirical results demonstrate that DC-AE-V outperforms both causal and non-causal autoencoders in terms of PSNR, SSIM, LPIPS, and FVD at equivalent or higher compression ratios. Notably, non-causal autoencoders, while achieving better reconstruction on short clips, fail to generalize to longer videos due to temporal dependency leakage, often resulting in flickering and boundary artifacts.

Figure 3: Video autoencoder reconstruction visualization. Causal autoencoders degrade under deep compression, while non-causal models generalize poorly to longer videos.

Ablation studies on chunk size reveal that increasing chunk size improves reconstruction up to a saturation point, beyond which computational cost outweighs marginal gains.

AE-Adapt-V: Post-Training Adaptation Protocol

Naïve Adaptation Pitfalls

Directly replacing the autoencoder in a pre-trained video diffusion model and randomly initializing the patch embedder and output head leads to severe training instability and suboptimal quality. The embedding space mismatch disrupts the transfer of semantic knowledge from the pre-trained backbone, often resulting in catastrophic forgetting or output collapse.

Video Embedding Space Alignment

AE-Adapt-V introduces a two-stage alignment protocol:

Patch Embedder Alignment: Freeze the base model's patch embedder, train a new patch embedder to minimize the MSE between the new and base embeddings (after spatial downsampling), ensuring the new latent space is mapped compatibly into the pre-trained embedding space.
Output Head Alignment: With the patch embedder aligned, jointly fine-tune the output head and patch embedder (keeping the backbone frozen) using the diffusion loss until convergence.
Figure 4: Illustration of video embedding space alignment, showing the necessity of aligning both the patch embedder and output head for effective adaptation.

This alignment protocol enables rapid recovery of the base model's quality and semantics in the new latent space, as evidenced by both quantitative and qualitative metrics.

End-to-End Fine-Tuning with LoRA

After alignment, lightweight end-to-end fine-tuning is performed using LoRA. LoRA tuning not only reduces the number of trainable parameters but also better preserves the pre-trained model's knowledge compared to full fine-tuning, resulting in higher VBench scores and improved visual quality.

Figure 5: Direct fine-tuning without AE-Adapt-V is unstable and yields poor quality. AE-Adapt-V enables rapid, stable adaptation and quality recovery.

Efficiency and Quality Benchmarks

DC-VideoGen achieves substantial acceleration across resolutions and video lengths. For example, at 2160×3840 resolution, DC-VideoGen delivers a 14.8× speedup over the Wan-2.1-T2V-1.3B baseline, with comparable or superior VBench scores. The framework also reduces adaptation cost to 10 H100 GPU days, a 230× reduction compared to full model pre-training.

Figure 6: DC-VideoGen overview, illustrating the post-training acceleration pipeline and efficiency gains.

Qualitative comparisons confirm that the accelerated models retain the visual fidelity and semantic consistency of the base models, even at high resolutions and long durations.

Figure 7: Visual comparison of DC-VideoGen-Wan2.1-I2V-14B and the base model, demonstrating preserved generation quality.

Figure 8: Visual comparison of DC-VideoGen-Wan2.1-T2V-14B and the base model, confirming quality retention post-acceleration.

Practical Implications and Limitations

DC-VideoGen enables practical deployment of high-fidelity video diffusion models in resource-constrained environments, democratizing access to large-scale video synthesis. The framework is model-agnostic and can be applied to any pre-trained video diffusion model, facilitating rapid iteration and innovation in the video generation community.

However, the efficacy of DC-VideoGen is contingent on the quality of the pre-trained model. As a post-training framework, it cannot compensate for deficiencies in the base model's generative capacity. Future work should explore extending the framework for ultra-long video generation and further reducing adaptation costs, potentially by integrating with sparse attention or quantization techniques.

Conclusion

DC-VideoGen establishes a new paradigm for efficient video generation by combining a deep compression autoencoder with a robust adaptation protocol. The framework achieves significant acceleration and cost reduction without sacrificing quality, making high-resolution, long-duration video synthesis feasible on commodity hardware. This work demonstrates that efficiency and fidelity in video generation are not mutually exclusive, and sets the stage for further advances in scalable, accessible generative video modeling.