Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback (2405.18750v2)

Published 29 May 2024 in cs.CV

Abstract: Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve $\textbf{both fast and high-quality video generation}$. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  2. Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109, 2023a.
  3. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  4. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  5. Pika Labs. Accessed september 25, 2023, 2023. URL https://www.pika.art/.
  6. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  7. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  8. Improving image generation with better captions. 2023. URL https://api.semanticscholar.org/CorpusID:264403242.
  9. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  10. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  11. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  12. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  13. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022a.
  14. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  15. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
  16. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  17. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  18. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  19. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023c.
  20. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  21. Consistency models. International conference on machine learning, 2023.
  22. Improved techniques for training consistency models. In The Twelfth International Conference on Learning Representations, 2023.
  23. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
  24. Instructvideo: Instructing video diffusion models with human feedback. arXiv preprint arXiv:2312.12490, 2023.
  25. Raft: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023.
  26. Directly fine-tuning diffusion models on differentiable rewards. In The Twelfth International Conference on Learning Representations, 2023.
  27. Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
  28. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  29. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
  30. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020a.
  31. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  32. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  33. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020b.
  34. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023a.
  35. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  36. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  37. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023b.
  38. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024.
  39. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023d.
  40. Open-Sora. Open-sora: Democratizing efficient video production for all, 2024. URL https://github.com/hpcaitech/Open-Sora.
  41. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In The Eleventh International Conference on Learning Representations, 2022.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  43. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  44. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  45. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
  46. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  47. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022b.
  48. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  49. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  50. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. arXiv preprint arXiv:2202.09671, 2022.
  51. Genie: Higher-order denoising diffusion solvers. Advances in Neural Information Processing Systems, 35:30150–30166, 2022.
  52. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080, 2021.
  53. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
  54. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021.
  55. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  56. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pages 42390–42402. PMLR, 2023.
  57. Reward guided latent consistency distillation. arXiv preprint arXiv:2403.11027, 2024.
  58. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2096–2105, 2023b.
  59. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  60. Towards a better metric for text-to-video generation. arXiv preprint arXiv:2401.07781, 2024.
  61. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
  62. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  63. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  64. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  65. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  66. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  67. Large-scale reinforcement learning for diffusion models. arXiv preprint arXiv:2401.12244, 2024.
  68. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  69. Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023a.
  70. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  71. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021.
  72. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  73. Unmasked teacher: Towards training-efficient video foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19948–19960, 2023b.
  74. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023a.
  75. Tag2text: Guiding vision-language model via image tagging. In The Twelfth International Conference on Learning Representations, 2023b.
Citations (11)

Summary

  • The paper introduces T2V-Turbo, which integrates image-text and video-text reward models into a consistency distillation process to optimize single-step video generation.
  • The method outperforms proprietary models on VBench, achieving superior quality with just 4 inference steps compared to longer, computationally intensive baselines.
  • Human evaluations confirm T2V-Turbo delivers more appealing videos at over tenfold acceleration, highlighting its potential for real-time text-to-video applications.

Fast and High-Quality Text-to-Video Generation via T2V-Turbo

The paper presents "T2V-Turbo," a novel approach designed to accelerate the sampling process of diffusion-based text-to-video (T2V) models while preserving, and even enhancing, the quality of the generated videos. The method addresses a significant challenge in contemporary T2V models: the need to reconcile the high computational demands of iterative sampling processes with the desire for real-time video generation.

Background and Motivation

Diffusion-based models have exhibited notable success in generating high-quality images and videos, largely attributed to their iterative sampling processes. However, these sampling processes are computationally intensive, impeding real-time applications. Recent attempts to expedite this process through Consistency Models (CM) and appropriate distillation strategies have been promising but exhibit a marked drop in output quality when reducing the number of inference steps.

Methodology

The proposed T2V-Turbo leverages a novel integration of reward feedback from multiple reward models (RMs) into the consistency distillation (CD) process. It aims to break the quality bottleneck observed in prior video consistency models (VCMs) and achieve both accelerated and high-quality video generation.

Consistency Distillation with Reward Feedback

T2V-Turbo integrates feedback from an image-text RM and a video-text RM into the distillation process. This setup includes the following elements:

  • Utilization of Single-Step Generations: Instead of backpropagating through the entire iterative sampling process, T2V-Turbo directly optimizes rewards associated with single-step generations, effectively reducing the memory constraints and computational load.
  • Image-Text and Video-Text RMs: The image-text RM aligns each video frame with human preferences, while the video-text RM evaluates the temporal dynamics and transitions in the generated videos. This dual-RM approach ensures comprehensive optimization from both spatial and temporal perspectives.

Empirical Evaluation

The empirical assessments were conducted using the VBench benchmark and human evaluations via the EvalCrafter dataset.

Automatic Evaluation

The VBench benchmark allows for detailed evaluation across multiple dimensions of video quality and semantic alignment. The 4-step generations of T2V-Turbo (VC2) and T2V-Turbo (MS) outperformed all other models in terms of Total Score, Quality Score, and Semantic Score. Notably, T2V-Turbo outperformed proprietary systems like Gen-2 and Pika despite using fewer resources.

Human Evaluation

Human evaluations performed using 700 prompts from the EvalCrafter dataset revealed that the 4-step generations from T2V-Turbo were preferred over the 50-step DDIM samples from their teacher models, evidencing more than tenfold acceleration in inference while improving quality perception. The 8-step generations further enhanced preferences, underscoring the robustness of T2V-Turbo in balancing quality and speed.

Contributions and Implications

The paper's contributions can be summarized as follows:

  • Introduction of a T2V model integrating feedback from a mixture of RMs, including a video-text model.
  • Establishment of a new state-of-the-art on the VBench with only 4 inference steps, surpassing several proprietary models.
  • Human evaluations validating the preference for T2V-Turbo generated videos over those from the original, more computationally intensive teacher models.

Future Developments

The findings suggest a promising direction for enhancing real-time text-to-video applications. Incorporating additional video-text RMs tailored for specific domains, further optimizing RMs through direct human feedback, and exploring advanced integration techniques may offer exciting avenues for future research.

Conclusion

T2V-Turbo represents a significant advancement in the pursuit of fast and high-quality text-to-video generation. By effectively integrating reward feedback from image-text and video-text RMs, T2V-Turbo not only accelerates the inference process but also improves the video generation quality. These achievements underscore the potential for adopting similar strategies across various generative model applications beyond video synthesis.