Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Abstract: We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Pix2video: Video editing using image diffusion. CoRR, abs/2303.12688, 2023.
- Generative adversarial nets. In NeurIPS, 2014.
- Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. CoRR, abs/2303.13439, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Adafactor: Adaptive learning rates with sublinear memory cost. In Jennifer G. Dy and Andreas Krause, editors, ICML, 2018.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- Denoising diffusion implicit models. In ICLR, 2021.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. CoRR, abs/2212.11565, 2022.
- AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
- Exploiting temporal consistency for real-time video depth estimation. In ICCV, 2019.
- StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- Towards consistent video editing with text-to-image diffusion models. CoRR, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.