- The paper presents a novel transformer-based diffusion framework using a highly compressed video autoencoder achieving 384× compression and strong fidelity.
- It introduces a diffusion transformer with a layer memory mechanism that aggregates previous layer states to accelerate convergence and reduce training loss.
- The system employs a multi-resolution upsampling pipeline combining latent upsampler and high-res refiner, yielding 42.3× faster inference with competitive video quality.
FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space
Overview
The FSVideo framework introduces a fast and efficient transformer-based diffusion model for image-to-video (I2V) generation. The design of FSVideo centers on three principal innovations: a highly-compressed video autoencoder (FSAE) enabling a 64×64×4 spatial-temporal downsampling ratio, an improved diffusion transformer (DIT) architecture with a novel layer memory mechanism, and a multi-resolution generation pipeline featuring a latent upsampler and a lightweight high-resolution refiner. Together, these components yield a system capable of competitive video quality at significantly reduced inference costs when compared to contemporary large-scale open-source video diffusion systems.
Highly-Compressed Video Autoencoder
FSVideo’s FSAE autoencoder achieves an unprecedented compression rate (384× reduction via 64×64×4 downsampling and 128 latent channels) without compromising significantly on reconstruction fidelity. This architecture is derived from deep-compressed autoencoder (DC-AE) designs but extends them by:
- Introducing additional transformer blocks to both encoder and decoder to reach the aggressive 64×64 spatial compression.
- Employing causal 3D convolutions enabling joint image-video training and compatibility with temporal splitting and tiling techniques.
- Making significant architectural optimizations in the decoder, including switching from causal to non-causal convolutions to eliminate flickering artifacts, and integrating cross-attention on first-frame features for the image-to-video setting.
Semantic alignment in the latent space is enforced through the Video Vision Foundation loss (Video VF Loss), extending VA-VAE’s feature-matching paradigm to the video domain by aligning FSAE latent representations with DINOv2-extracted features on a marginal (spatiotemporal location) basis. Intrinsic dimensionality analysis confirms that this regularization results in a lower-complexity, more generative-friendly latent manifold.
Two FSAE variants are described: the full FSAE-Standard prioritizing quality, and FSAE-Lite which substantially decreases memory and inference demands via reduced channel counts and group-causal convolutions.
FSVideo’s FSAE demonstrates strong numerical performance: for compression rates of $1:384$, FSAE-Standard achieves SSIMs of 0.806/0.872 (Inter-4K/WebVid-10M) and FVDs of 256.62/203.19, outperforming LTX-Video and matching or surpassing other state-of-the-art autoencoders at comparable or lower compression [(2602.02092), Table 2].
The FSVideo base model employs an enhanced DIT backbone (Wan2.1-14B-inspired), modified for operation in the aggressively-compressed FSAE latent space. The principal architectural contribution is the integration of a Layer Memory Self-Attention Mechanism. In this mechanism, each transformer layer constructs its keys and values not merely from its immediate predecessor, but from a dynamically-routed, learned aggregation of all previous layer hidden states. This design, inspired by cross-layer information routing techniques from the language modeling literature (e.g., LIMe), addresses representation collapse and facilitates depth-wise information reuse.
Analysis of router weights reveals that, while most attention is paid to neighboring layers (as in standard transformers), deeper layers occasionally access earlier or even first-layer representations, especially for tokens corresponding to the initial frame—reflecting the hybrid image-to-video conditioning setup. This mechanism demonstrably accelerates convergence and consistently achieves lower training losses, both from scratch and in fine-tuning scenarios.
The DIT training process incorporates staged curriculum: image pretraining for text-visual alignment, low-resolution video learning, and final high-resolution video generation, along with RL-based refinement using open-source reward models (VideoAlign, MPS) and resource-efficient strategies (e.g., selective frame evaluation, dynamic reward modeling).
Multiresolution Video Upsampler and Refiner
High spatial compression in FSAE inevitably reduces output fidelity. FSVideo addresses this with a two-step multiresolution upsampling pipeline:
- Latent Upsampler: A convolutional network (projection layer, pixel-shuffle, residual blocks) upsamples the low-resolution FSAE latents to the target resolution, using losses targeting both latent and image domains.
- High-Resolution Refiner DIT: A lightweight refiner (DIT-based, distillation-reduced NFEs) is trained to enhance fidelity. Training incorporates dynamic masking (where mask values for each frame reflect estimation error between real and upsampled first-frame latents), deviation-based latent estimation (to destabilize overfitting on low-res inputs), and extensive data augmentation (conditional dropout, frame shuffling).
Distillation techniques (CFG, progressive, SiDA) and efficient RL (GRPO, MixGRPO) are applied to minimize inference cost while maximizing restoration capability.
Experimental Results
FSVideo is evaluated against leading models on VBench 2.0 and through human preference studies. Key results:
- Video quality: FSVideo achieves a Total Score of 88.12% on VBench-2.0 (720×1280), closely matching the best public models (Step-Video-TI2V: 88.36%) and surpassing all Wan 2.1-14B-based models despite a much higher compression ratio [(2602.02092), Table 4].
- Human preferences: FSVideo is strongly preferred over HunyuanVideo and LTX-Video and matches Wan 2.1-14B in subjective quality.
- Inference speed: FSVideo is 42.3× faster (duo-GPU setting, 2× H100s) than Wan2.1-14B for 5s 720×1280 video generation at similar NFE. Prospects for further speed-ups via quantization and step-reduction are demonstrated [(2602.02092), Table 5].
FSVideo’s design allows combining with orthogonal acceleration methods (e.g., caching, step distillation) for multiplicative gains, and its compression-centric approach sidesteps the common tradeoff of reducing model capacity at the expense of realism.
Implications and Future Directions
FSVideo demonstrates that aggressive latent space compression, if balanced by architectural advances (layer memory attention, dynamic upsampling/refinement), can achieve substantial reductions in inference cost without corresponding losses in generation quality. This approach points toward a new axis of efficiency: minimizing token count per model step, rather than shrinking model width/depth.
Implications for future research include:
- Extending the latent compression paradigm for even higher spatial/temporal resolutions and longer video sequences.
- Integrating more advanced, possibly modality-agnostic, information routing mechanisms throughout transformer blocks to further counter over-smoothing and enhance compositionality.
- Exploring joint video-audio or multimodal generative models using the FSVideo pipeline.
- Investigating training protocols that maximize “token efficiency”–ensuring each compressed latent carries maximal information content, potentially leveraging representation learning advances from both vision and language domains.
- Broadening RL-based reward-model training pipelines, customizing feedback to application-specific attributes (temporal coherence, cross-frame consistency, etc.).
Conclusion
FSVideo presents a high-throughput, high-compression pipeline for image-to-video generation that retains strong perceptual quality. Its combination of a 384× downsampling autoencoder, a DIT with differentiable layer memory, and a staged high-res refinement process yields unprecedented speedups (over 40×) at scale without significant quality regression. The architecture and training innovations of FSVideo provide a blueprint for further research on efficient, scalable, and high-fidelity generative video models (2602.02092).