Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VidToMe: Video Token Merging for Zero-Shot Video Editing (2312.10656v2)

Published 17 Dec 2023 in cs.CV
VidToMe: Video Token Merging for Zero-Shot Video Editing

Abstract: Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

Overview

The field of artificial intelligence has long engaged in improving how machines can interpret and manipulate visual media. While significant strides have been made in image generation with diffusion models, video generation remains a complex challenge due to the intricacies of temporal motion. The paper introduces "VidToMe," an innovative method that improves temporal consistency in video editing without the need to train on vast amounts of data. This technique is particularly suited for zero-shot video editing, where pre-trained image diffusion models translate source videos into new ones while retaining the original motion.

Temporal Coherence

One of the main issues with current video generation techniques is ensuring strict temporal consistency. Existing models often struggle to produce frames with consistent details over time, negatively impacting the perceived quality. The VidToMe method directly addresses this by aligning and compressing tokens (fundamental processing units in self-attention operations of diffusion models) across frames, effectively enhancing temporal coherence. The approach matches tokens based on temporal correspondence between frames, allowing the natural flow of time to be maintained in the generated content.

Computational Efficiency

Processing video involves handling a tremendous amount of data, making computational efficiency a key challenge. VidToMe innovatively handles this by introducing intra-chunk local token merging and inter-chunk global token merging. This strategy not only ensures the needed short-term continuity within video chunks but also maintains long-term content consistency throughout the video. By operating in chunks, VidToMe manages the complexity of video processing in a way that reduces memory consumption and computational burden.

Integration and Performance

The proposed video editing approach is seamlessly extendable, leveraging advancements from image editing diffusion models. VidToMe can work in harmony with existing image editing methods, bringing about text-aligned and temporally consistent video editing results. Through comprehensive experimentation, VidToMe has demonstrated superior performance over existing state-of-the-art methods in producing temporally consistent videos with high fidelity to editing prompts.

Contributions

The paper outlines three main contributions of VidToMe to the field of AI-based video editing:

  • A novel method for enhancing temporal consistency in video generation by merging self-attention tokens across frames.
  • A dual strategy for local and global token merging, facilitating both short-term and long-term consistency in videos.
  • Demonstrated superiority in maintaining temporal consistency and computational efficiency compared to state-of-the-art zero-shot video editing methods.

In conclusion, "VidToMe: Video Token Merging for Zero-Shot Video Editing" presents a breakthrough in the approach to video editing, advancing the capabilities of AI in understanding and manipulating temporal media. With its improved consistency and efficiency, this method sets a new standard for future research and applications in video generation and editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Video generative adversarial networks: a review. ACM Computing Surveys (CSUR), 55(2):1–25, 2022.
  2. Blended diffusion for text-driven editing of natural images. In CVPR, pages 18208–18218, 2022.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
  4. Token merging for fast stable diffusion. CVPR Workshop on Efficient Deep Learning for Computer Vision, 2023.
  5. Token merging: Your ViT but faster. In ICLR, 2023.
  6. Pix2video: Video editing using image diffusion. In ICCV, pages 23206–23217, 2023.
  7. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  8. Diffusion models in vision: A survey. IEEE TPAMI, 2023.
  9. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
  10. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  11. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2022a.
  12. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022b.
  13. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  14. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2022.
  15. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  16. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  17. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  18. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
  19. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, pages 9000–9008, 2018.
  20. Elucidating the design space of diffusion-based generative models. NeurIPS, 35:26565–26577, 2022.
  21. Denoising diffusion restoration models. NeurIPS, 35:23593–23606, 2022.
  22. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  23. Auto-Encoding Variational Bayes. In ICLR, 2014.
  24. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
  25. Generative adversarial networks for image and video synthesis: Algorithms and applications. Proceedings of the IEEE, 109(5):839–862, 2021.
  26. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, pages 11461–11471, 2022.
  27. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2021.
  28. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047, 2023.
  29. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  30. Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
  31. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804. PMLR, 2022.
  32. Zero-shot image-to-image translation. In SIGGRAPH, pages 1–11, 2023.
  33. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  34. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
  35. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  36. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  37. Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
  38. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  39. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
  40. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  41. Palette: Image-to-image diffusion models. In SIGGRAPH, pages 1–10, 2022a.
  42. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022b.
  43. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  44. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  46. Denoising diffusion implicit models. In ICLR, 2020a.
  47. Score-based generative modeling through stochastic differential equations. In ICML, 2020b.
  48. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
  49. Attention is all you need. NeurIPS, 30, 2017.
  50. Sketch-guided text-to-image diffusion models. In SIGGRAPH, pages 1–11, 2023.
  51. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
  52. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023a.
  53. Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003, 2023b.
  54. Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH, 2023.
  55. Video probabilistic diffusion models in projected latent space. In CVPR, pages 18456–18466, 2023.
  56. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xirui Li (10 papers)
  2. Chao Ma (187 papers)
  3. Xiaokang Yang (207 papers)
  4. Ming-Hsuan Yang (376 papers)
Citations (24)
Github Logo Streamline Icon: https://streamlinehq.com