Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution (2401.10404v1)

Published 18 Jan 2024 in cs.CV

Abstract: We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .

Overview of the Paper

The paper introduces a novel architecture designed to harness the power of existing text-to-image super-resolution models for the purpose of text-to-video super-resolution. By "inflating" the weights of these image models, the researchers have developed a system that can generate high-quality video from text descriptions without the need for extensive training on video data. To ensure smooth transitions between video frames, a specialized temporal adapter is integrated into the model.

Inflation Technique and Temporal Consistency

Central to this work is the strategy of inflating a text-to-image super-resolution model for video generation. The architecture, based on a U-Net framework commonly used within image diffusion models, is adapted to handle sequences of images—that is, video frames—by sharing weights across the temporal dimension. To preserve temporal coherence, an adapter is introduced to manage the transitions between frames. This design choice represents a balance between image detail and temporal flow, preventing jarring transitions that could detract from the viewer's experience.

Empirical Validation

The authors conducted rigorous testing on a diverse video dataset to validate their approach. A text-to-image super-resolution model pre-trained on a large dataset was repurposed for video generation and then fine-tuned using their method. Various tuning strategies were tested, revealing trade-offs between video quality and the computational efficiency. The metrics used for evaluation included Peak signal to noise ratio (PSNR) for visual quality and temporal change consistency (TCC) for measuring smoothness of motion in videos.

Advancements and Future Directions

This research contributes to the field by demonstrating an efficient and practical way to repurpose image diffusion models for video super-resolution. It is the first of its kind to directly handle pixel-level diffusion, as opposed to latent space. The findings show a favorable compromise between visual quality, temporal coherence, and computational demands. The foundation set by this research opens up future possibilities such as scaling to higher resolutions and longer time frames, which would further explore the balance between resource consumption and output quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  2. Pix2video: Video editing using image diffusion. CoRR, abs/2303.12688, 2023.
  3. Generative adversarial nets. In NeurIPS, 2014.
  4. Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022.
  5. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  6. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
  7. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. CoRR, abs/2303.13439, 2023.
  8. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
  9. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  10. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  11. Adafactor: Adaptive learning rates with sublinear memory cost. In Jennifer G. Dy and Andreas Krause, editors, ICML, 2018.
  12. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  13. Denoising diffusion implicit models. In ICLR, 2021.
  14. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  15. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. CoRR, abs/2212.11565, 2022.
  16. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  17. Exploiting temporal consistency for real-time video depth estimation. In ICCV, 2019.
  18. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  19. Towards consistent video editing with text-to-image diffusion models. CoRR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xin Yuan (198 papers)
  2. Jinoo Baek (2 papers)
  3. Keyang Xu (12 papers)
  4. Omer Tov (11 papers)
  5. Hongliang Fei (10 papers)
Citations (3)