Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Published 18 Jan 2024 in cs.CV | (2401.10404v1)

Abstract: We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  2. Pix2video: Video editing using image diffusion. CoRR, abs/2303.12688, 2023.
  3. Generative adversarial nets. In NeurIPS, 2014.
  4. Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022.
  5. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  6. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
  7. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. CoRR, abs/2303.13439, 2023.
  8. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
  9. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  10. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  11. Adafactor: Adaptive learning rates with sublinear memory cost. In Jennifer G. Dy and Andreas Krause, editors, ICML, 2018.
  12. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  13. Denoising diffusion implicit models. In ICLR, 2021.
  14. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  15. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. CoRR, abs/2212.11565, 2022.
  16. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  17. Exploiting temporal consistency for real-time video depth estimation. In ICCV, 2019.
  18. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  19. Towards consistent video editing with text-to-image diffusion models. CoRR, 2023.
Citations (3)

Summary

  • The paper introduces an innovative approach that inflates text-to-image models with a temporal adapter to generate coherent, high-quality video from text.
  • It adapts a U-Net diffusion framework to share weights across frames, balancing fine image detail and smooth temporal transitions.
  • Empirical tests on diverse datasets reveal favorable trade-offs between video quality and efficiency, measured by PSNR and TCC.

Overview of the Paper

The paper introduces a novel architecture designed to harness the power of existing text-to-image super-resolution models for the purpose of text-to-video super-resolution. By "inflating" the weights of these image models, the researchers have developed a system that can generate high-quality video from text descriptions without the need for extensive training on video data. To ensure smooth transitions between video frames, a specialized temporal adapter is integrated into the model.

Inflation Technique and Temporal Consistency

Central to this work is the strategy of inflating a text-to-image super-resolution model for video generation. The architecture, based on a U-Net framework commonly used within image diffusion models, is adapted to handle sequences of images—that is, video frames—by sharing weights across the temporal dimension. To preserve temporal coherence, an adapter is introduced to manage the transitions between frames. This design choice represents a balance between image detail and temporal flow, preventing jarring transitions that could detract from the viewer's experience.

Empirical Validation

The authors conducted rigorous testing on a diverse video dataset to validate their approach. A text-to-image super-resolution model pre-trained on a large dataset was repurposed for video generation and then fine-tuned using their method. Various tuning strategies were tested, revealing trade-offs between video quality and the computational efficiency. The metrics used for evaluation included Peak signal to noise ratio (PSNR) for visual quality and temporal change consistency (TCC) for measuring smoothness of motion in videos.

Advancements and Future Directions

This research contributes to the field by demonstrating an efficient and practical way to repurpose image diffusion models for video super-resolution. It is the first of its kind to directly handle pixel-level diffusion, as opposed to latent space. The findings show a favorable compromise between visual quality, temporal coherence, and computational demands. The foundation set by this research opens up future possibilities such as scaling to higher resolutions and longer time frames, which would further explore the balance between resource consumption and output quality.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 184 likes about this paper.