Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing (2306.08707v4)

Published 14 Jun 2023 in cs.CV

Abstract: Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Blended diffusion for text-driven editing of natural images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2022. doi: 10.1109/cvpr52688.2022.01767. URL https://doi.org/10.1109%2Fcvpr52688.2022.01767.
  2. Text2live: Text-driven layered image and video editing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 707–723. Springer, 2022.
  3. Pix2video: Video editing using image diffusion. arXiv preprint arXiv:2303.12688, 2023.
  4. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  5. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  6. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  7. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  8. A style-based generator architecture for generative adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018.
  9. Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG), 40(6):1–12, 2021.
  10. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  11. Segment anything. ArXiv, abs/2304.02643, 2023.
  12. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  13. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  14. Text-driven stylization of video objects. In ECCV Workshops, 2022.
  15. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  16. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  17. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. ArXiv, abs/2302.08453, 2023.
  18. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  19. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  20. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  21. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  22. A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication, 61:33–43, 2018. ISSN 0923-5965. doi: https://doi.org/10.1016/j.image.2017.11.001. URL https://www.sciencedirect.com/science/article/pii/S0923596517302187.
  23. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  24. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  25. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  26. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  27. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
  28. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  29. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  30. Open-vocabulary panoptic segmentation with text-to-image diffusion models. ArXiv, abs/2303.04803, 2023.
  31. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  32. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  33. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  34. Segment everything everywhere all at once. ArXiv, abs/2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Paul Couairon (4 papers)
  2. Clément Rambour (13 papers)
  3. Jean-Emmanuel Haugeard (3 papers)
  4. Nicolas Thome (53 papers)
Citations (25)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com