Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing (2402.13185v4)

Published 20 Feb 2024 in cs.CV

Abstract: Recent advances in text-guided video editing have showcased promising results in appearance editing (e.g., stylization). However, video motion editing in the temporal dimension (e.g., from eating to waving), which distinguishes video editing from image editing, is underexplored. In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. To realize motion editing while preserving source video content, based on the insights that temporal and spatial self-attention layers encode inter-frame and intra-frame dependency respectively, we introduce auxiliary motion-reference and reconstruction branches to produce text-guided motion and source features respectively. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers. Extensive experiments demonstrate that UniEdit covers video motion editing and various appearance editing scenarios, and surpasses the state-of-the-art methods. Our code will be publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023b.
  4. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  6. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  23206–23217, 2023.
  7. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  23040–23050, 2023.
  8. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  9. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023b.
  10. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023c.
  11. Dragvideo: Interactive drag-style video editing. arXiv preprint arXiv:2312.02216, 2023.
  12. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7346–7356, 2023.
  13. Ccedit: Creative and controllable video editing via diffusion models. arXiv preprint arXiv:2309.16496, 2023.
  14. Tokenflow: Consistent diffusion features for consistent video editing. In International Conference on Learning Representations (ICLR), 2024.
  15. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  16. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023a.
  17. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023b.
  18. Gaia: Zero-shot talking avatar generation. In International Conference on Learning Representations (ICLR), 2024.
  19. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  20. Prompt-to-prompt image editing with cross attention control. In International Conference on Learning Representations (ICLR), 2023.
  21. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  24. Video diffusion models. arXiv:2204.03458, 2022b.
  25. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 2023.
  26. Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv preprint arXiv:2310.01107, 2023.
  27. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint arXiv:2312.00845, 2023.
  28. Videobooth: Diffusion-based video generation with image prompts. arXiv preprint arXiv:2312.00777, 2023.
  29. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. arXiv preprint arXiv:2312.04524, 2023.
  30. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
  31. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  32. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  33. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  34. Customizing motion in text-to-video diffusion models. arXiv preprint arXiv:2312.04966, 2023.
  35. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  36. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6038–6047, 2023.
  37. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  38. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023.
  39. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  40. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  41. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  43. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  44. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision, pp.  2830–2839, 2017.
  45. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  46. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021.
  47. Drag-a-video: Non-rigid video editing with point-based interaction. arXiv preprint arXiv:2312.02936, 2023.
  48. Motioneditor: Editing video motion via content-aware diffusion. arXiv preprint arXiv:2311.18830, 2023.
  49. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1921–1930, 2023.
  50. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  51. Generating videos with scene dynamics. Advances in neural information processing systems, 29, 2016.
  52. Dreamvideo: High-fidelity image-to-video generation with image retention and text guidance. arXiv preprint arXiv:2312.03018, 2023a.
  53. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
  54. Few-shot video-to-video synthesis. Advances in Neural Information Processing Systems, 32, 2019.
  55. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023c.
  56. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023d.
  57. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023e.
  58. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7623–7633, 2023.
  59. Make-your-video: Customized video generation using textual and structural guidance. arXiv preprint arXiv:2306.00943, 2023.
  60. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  61. Motion-conditioned image animation for video editing. arXiv preprint arXiv:2311.18827, 2023.
  62. Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia 2023 Conference Proceedings, 2023.
  63. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10459–10469, 2023.
  64. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  65. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023b.
  66. Motioncrafter: One-shot motion customization of diffusion models. arXiv preprint arXiv:2312.05288, 2023c.
  67. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023d.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jianhong Bai (14 papers)
  2. Tianyu He (52 papers)
  3. Yuchi Wang (11 papers)
  4. Junliang Guo (39 papers)
  5. Haoji Hu (30 papers)
  6. Zuozhu Liu (78 papers)
  7. Jiang Bian (229 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.