Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution (2401.10404v1)

Published 18 Jan 2024 in cs.CV

Abstract: We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .

PDF HTML Abstract

Overview of the Paper

The paper introduces a novel architecture designed to harness the power of existing text-to-image super-resolution models for the purpose of text-to-video super-resolution. By "inflating" the weights of these image models, the researchers have developed a system that can generate high-quality video from text descriptions without the need for extensive training on video data. To ensure smooth transitions between video frames, a specialized temporal adapter is integrated into the model.

Inflation Technique and Temporal Consistency

Central to this work is the strategy of inflating a text-to-image super-resolution model for video generation. The architecture, based on a U-Net framework commonly used within image diffusion models, is adapted to handle sequences of images—that is, video frames—by sharing weights across the temporal dimension. To preserve temporal coherence, an adapter is introduced to manage the transitions between frames. This design choice represents a balance between image detail and temporal flow, preventing jarring transitions that could detract from the viewer's experience.

Empirical Validation

The authors conducted rigorous testing on a diverse video dataset to validate their approach. A text-to-image super-resolution model pre-trained on a large dataset was repurposed for video generation and then fine-tuned using their method. Various tuning strategies were tested, revealing trade-offs between video quality and the computational efficiency. The metrics used for evaluation included Peak signal to noise ratio (PSNR) for visual quality and temporal change consistency (TCC) for measuring smoothness of motion in videos.

Advancements and Future Directions

This research contributes to the field by demonstrating an efficient and practical way to repurpose image diffusion models for video super-resolution. It is the first of its kind to directly handle pixel-level diffusion, as opposed to latent space. The findings show a favorable compromise between visual quality, temporal coherence, and computational demands. The foundation set by this research opens up future possibilities such as scaling to higher resolutions and longer time frames, which would further explore the balance between resource consumption and output quality.

PDF Markdown Bookmark Chat (Pro)

References (19)

Authors (5)

Xin Yuan (198 papers)
Jinoo Baek (2 papers)
Keyang Xu (12 papers)
Omer Tov (11 papers)
Hongliang Fei (10 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1749254341507039644

https://twitter.com/fly51fly/status/1749556432352231664