Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

POS: A Prompts Optimization Suite for Augmenting Text-to-Video Generation (2311.00949v3)

Published 2 Nov 2023 in cs.CV

Abstract: This paper targets to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. Accommodated with this goal, we propose POS, a training-free Prompt Optimization Suite to boost text-to-video models. POS is motivated by two observations: (1) Video generation shows instability in terms of noise. Given the same text, different noises lead to videos that differ significantly in terms of both frame quality and temporal consistency. This observation implies that there exists an optimal noise matched to each textual input; To capture the potential noise, we propose an optimal noise approximator to approach the potential optimal noise. Particularly, the optimal noise approximator initially searches a video that closely relates to the text prompt and then inverts it into the noise space to serve as an improved noise prompt for the textual input. (2) Improving the text prompt via LLMs often causes semantic deviation. Many existing text-to-vision works have utilized LLMs to improve the text prompts for generation enhancement. However, existing methods often neglect the semantic alignment between the original text and the rewritten one. In response to this issue, we design a semantic-preserving rewriter to impose contraints in both rewritng and denoising phrases to preserve the semantic consistency. Extensive experiments on popular benchmarks show that our POS can improve the text-to-video models with a clear margin. The code will be open-sourced.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  4. Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
  5. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  6. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022a.
  7. Latent video diffusion models for high-fidelity long video generation. 2022b.
  8. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  9. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  10. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  11. Video diffusion models. arXiv:2204.03458, 2022b.
  12. Large language models are frame-level directors for zero-shot text-to-video generation, 2023.
  13. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  14. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  15. The role of imagenet classes in fr\\\backslash\’echet inception distance. arXiv preprint arXiv:2203.06026, 2022.
  16. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  17. Video-p2p: Video editing with cross-attention control, 2023.
  18. Tf-icon: Diffusion-based training-free cross-domain image composition, 2023.
  19. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  20. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  21. OpenAI. Dall-e3. https://cdn.openai.com/papers/DALL_E_3_System_Card.pdf.
  22. OpenAI. Gpt-4 technical report, 2023.
  23. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  24. Fatezero: Fusing attentions for zero-shot text-based video editing, 2023.
  25. Learning transferable visual models from natural language supervision, 2021.
  26. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  27. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
  28. High-resolution image synthesis with latent diffusion models, 2021.
  29. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  30. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  31. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  32. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  34. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  35. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  36. Action recognition? a new model and the kinetics dataset. Joao Carreira, Andrew Zisserman.
  37. Edict: Exact diffusion inversion via coupled transformations, 2022.
  38. Modelscope text-to-video technical report, 2023a.
  39. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023b.
  40. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023c.
  41. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  42. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  43. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  44. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  45. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  46. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  47. Moviefactory: Automatic movie creation from text using large generative models for language and images, 2023.

Summary

We haven't generated a summary for this paper yet.