Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

9 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback (2405.18750v2)

Published 29 May 2024 in cs.CV

Abstract: Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $\textbf{both fast and high-quality video generation}$. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

References (75)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces T2V-Turbo, which integrates image-text and video-text reward models into a consistency distillation process to optimize single-step video generation.
The method outperforms proprietary models on VBench, achieving superior quality with just 4 inference steps compared to longer, computationally intensive baselines.
Human evaluations confirm T2V-Turbo delivers more appealing videos at over tenfold acceleration, highlighting its potential for real-time text-to-video applications.

Fast and High-Quality Text-to-Video Generation via T2V-Turbo

The paper presents "T2V-Turbo," a novel approach designed to accelerate the sampling process of diffusion-based text-to-video (T2V) models while preserving, and even enhancing, the quality of the generated videos. The method addresses a significant challenge in contemporary T2V models: the need to reconcile the high computational demands of iterative sampling processes with the desire for real-time video generation.

Background and Motivation

Diffusion-based models have exhibited notable success in generating high-quality images and videos, largely attributed to their iterative sampling processes. However, these sampling processes are computationally intensive, impeding real-time applications. Recent attempts to expedite this process through Consistency Models (CM) and appropriate distillation strategies have been promising but exhibit a marked drop in output quality when reducing the number of inference steps.

Methodology

The proposed T2V-Turbo leverages a novel integration of reward feedback from multiple reward models (RMs) into the consistency distillation (CD) process. It aims to break the quality bottleneck observed in prior video consistency models (VCMs) and achieve both accelerated and high-quality video generation.

Consistency Distillation with Reward Feedback

T2V-Turbo integrates feedback from an image-text RM and a video-text RM into the distillation process. This setup includes the following elements:

Utilization of Single-Step Generations: Instead of backpropagating through the entire iterative sampling process, T2V-Turbo directly optimizes rewards associated with single-step generations, effectively reducing the memory constraints and computational load.
Image-Text and Video-Text RMs: The image-text RM aligns each video frame with human preferences, while the video-text RM evaluates the temporal dynamics and transitions in the generated videos. This dual-RM approach ensures comprehensive optimization from both spatial and temporal perspectives.

Empirical Evaluation

The empirical assessments were conducted using the VBench benchmark and human evaluations via the EvalCrafter dataset.

Automatic Evaluation

The VBench benchmark allows for detailed evaluation across multiple dimensions of video quality and semantic alignment. The 4-step generations of T2V-Turbo (VC2) and T2V-Turbo (MS) outperformed all other models in terms of Total Score, Quality Score, and Semantic Score. Notably, T2V-Turbo outperformed proprietary systems like Gen-2 and Pika despite using fewer resources.

Human Evaluation

Human evaluations performed using 700 prompts from the EvalCrafter dataset revealed that the 4-step generations from T2V-Turbo were preferred over the 50-step DDIM samples from their teacher models, evidencing more than tenfold acceleration in inference while improving quality perception. The 8-step generations further enhanced preferences, underscoring the robustness of T2V-Turbo in balancing quality and speed.

Contributions and Implications

The paper's contributions can be summarized as follows:

Introduction of a T2V model integrating feedback from a mixture of RMs, including a video-text model.
Establishment of a new state-of-the-art on the VBench with only 4 inference steps, surpassing several proprietary models.
Human evaluations validating the preference for T2V-Turbo generated videos over those from the original, more computationally intensive teacher models.

Future Developments

The findings suggest a promising direction for enhancing real-time text-to-video applications. Incorporating additional video-text RMs tailored for specific domains, further optimizing RMs through direct human feedback, and exploring advanced integration techniques may offer exciting avenues for future research.

Conclusion

T2V-Turbo represents a significant advancement in the pursuit of fast and high-quality text-to-video generation. By effectively integrating reward feedback from image-text and video-text RMs, T2V-Turbo not only accelerates the inference process but also improves the video generation quality. These achievements underscore the potential for adopting similar strategies across various generative model applications beyond video synthesis.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1796184302494916775

https://twitter.com/djacobs7/status/1799230165966528598

https://twitter.com/miru_why/status/1796025275312144429

https://twitter.com/WilliamLamkin/status/1796191543985852473

https://twitter.com/CSVisionPapers/status/1796217120818188576