Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 209 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation (2309.15818v3)

Published 27 Sep 2023 in cs.CV

Abstract: Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

Citations (153)

View on Semantic Scholar

Collections

Summary

The paper introduces a hybrid model, Show-1, that marries pixel-based and latent diffusion techniques to enhance text-to-video generation.
It achieves significant GPU memory reduction (15G vs 72G) while maintaining high-quality outputs through a coarse-to-fine generation pipeline.
Benchmark tests on UCF-101 and MSR-VTT demonstrate improved inception scores and Fréchet Video Distance, underscoring its efficacy.

Insights into "Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation"

This paper presents "Show-1," an innovative approach to text-to-video generation that effectively combines pixel-based and latent-based Video Diffusion Models (VDMs). The research focuses on addressing the limitations inherent in each type of VDM: the high computational cost of pixel-based VDMs and the challenge of precise text-video alignment in latent-based VDMs.

Key Contributions

Hybrid Model Introduction: The authors introduce a hybrid model, Show-1, that marries pixel-based and latent-based VDMs. Initially, pixel-based VDMs are utilized to generate low-resolution videos with strong text-video correlation. Subsequently, a novel latent-based VDM-driven expert translation method is employed to upscale these low-resolution videos, ensuring both efficiency and quality.
Computational Efficiency: Show-1 demonstrates a significant reduction in GPU memory usage compared to pixel VDMs (15G vs. 72G). This efficiency is achieved while maintaining high-quality output, thus presenting a balance between resource usage and output fidelity.
Benchmark Performance: The model's efficacy is validated against standard video generation benchmarks, such as UCF-101 and MSR-VTT, where it achieves superior or comparable performance in metrics like inception score (IS) and Fréchet Video Distance (FVD).

Technical Approach

The proposed Show-1 model follows a coarse-to-fine video generation pipeline:

Keyframe Generation: Initial keyframes are produced using pixel-based VDMs, resulting in low-resolution sequences that prioritize accurate text-video alignment.
Temporal Interpolation: A pixel-based temporal interpolation module enhances temporal resolution, interpolating between keyframes to improve motion coherence.
Super-Resolution: The core innovation lies in the super-resolution phase, where latent-based VDMs perform expert translation to upscale video from low to high resolution. This hierarchical structure allows for minimal computational cost while retaining high-quality textual alignment and visual fidelity.

The strategy of employing latent VDMs for final super-resolution transformation sets Show-1 apart, offering a computationally lightweight solution that maintains the visual and semantic integrity of the input.

Implications and Future Developments

The research demonstrates a promising direction for enhancing text-to-video generation models, particularly in balancing computational efficiency with output quality. By integrating strong text-video alignment capabilities with efficient super-resolution techniques, Show-1 can be adapted for real-time applications and larger-scale implementations.

Future developments in AI could explore further optimization of latent-based VDMs for more intricate video details and investigate potential biases inherent in datasets to improve the ethical deployment of such models. Additionally, expanding the training datasets and diversifying input scenarios could enhance the model's generalizability across various use cases.

In conclusion, the Show-1 model exemplifies an effective synthesis of different VDM strategies, offering a robust framework for the evolving domain of text-to-video generation. Researchers and practitioners should find this approach beneficial in advancing the capabilities of generative models and applying them to complex, real-world scenarios.