Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing (2405.04496v3)

Published 7 May 2024 in cs.CV

Abstract: Existing diffusion-based methods have achieved impressive results in human motion editing. However, these methods often exhibit significant ghosting and body distortion in unseen in-the-wild cases. In this paper, we introduce Edit-Your-Motion, a video motion editing method that tackles these challenges through one-shot fine-tuning on unseen cases. Specifically, firstly, we utilized DDIM inversion to initialize the noise, preserving the appearance of the source video and designed a lightweight motion attention adapter module to enhance motion fidelity. DDIM inversion aims to obtain the implicit representations by estimating the prediction noise from the source video, which serves as a starting point for the sampling process, ensuring the appearance consistency between the source and edited videos. The Motion Attention Module (MA) enhances the model's motion editing ability by resolving the conflict between the skeleton features and the appearance features. Secondly, to effectively decouple motion and appearance of source video, we design a spatio-temporal two-stage learning strategy (STL). In the first stage, we focus on learning temporal features of human motion and propose recurrent causal attention (RCA) to ensure consistency between video frames. In the second stage, we shift focus on learning the appearance features of the source video. With Edit-Your-Motion, users can edit the motion of humans in the source video, creating more engaging and diverse content. Extensive qualitative and quantitative experiments, along with user preference studies, show that Edit-Your-Motion outperforms other methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing. arXiv preprint arXiv:2402.13185 (2024).
  2. Text2live: Text-driven layered image and video editing. In European conference on computer vision. Springer, 707–723.
  3. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18392–18402.
  4. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570.
  5. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 23040–23050.
  6. Control3d: Towards controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia. 1148–1156.
  7. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  8. DeNoL: A Few-Shot-Sample-Based Decoupling Noise Layer for Cross-channel Watermarking Robustness. In Proceedings of the 31st ACM International Conference on Multimedia. 7345–7353.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
  10. TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design. In Proceedings of the 31st ACM International Conference on Multimedia. 7236–7246.
  11. Text-to-Audio Generation using Instruction Guided Latent Diffusion Model. In Proceedings of the 31st ACM International Conference on Multimedia. 3590–3598.
  12. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Systems 36 (2024).
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  14. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  15. Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  17. VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models. arXiv preprint arXiv:2312.00845 (2023).
  18. Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–12.
  19. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6007–6017.
  20. Shape-aware text-driven layered video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14317–14326.
  21. AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation. In Proceedings of the 31st ACM International Conference on Multimedia. 1250–1260.
  22. MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model. In Proceedings of the 31st ACM International Conference on Multimedia. 6734–6743.
  23. Exploring Dual Representations in Large-Scale Point Clouds: A Simple Weakly Supervised Semantic Segmentation Framework. In Proceedings of the 31st ACM International Conference on Multimedia. 2371–2380.
  24. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023).
  25. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410 (2023).
  26. Follow your pose: Pose-guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4117–4125.
  27. Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. 5321–5329.
  28. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021).
  29. LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. In Proceedings of the 31st ACM International Conference on Multimedia. 8580–8589.
  30. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International conference on machine learning. PMLR, 8162–8171.
  31. Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning. arXiv preprint arXiv:2311.17536 (2023).
  32. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15932–15942.
  33. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In Proceedings of the 31st ACM International Conference on Multimedia. 643–654.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  35. Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models. arXiv preprint arXiv:2402.14780 (2024).
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
  39. Incremental Few Shot Semantic Segmentation via Class-agnostic Mask Proposal and Language-driven Classifier. In Proceedings of the 31st ACM International Conference on Multimedia. 8561–8570.
  40. First order motion model for image animation. Advances in neural information processing systems 32 (2019).
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
  42. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
  43. Relation Triplet Construction for Cross-modal Text-to-Video Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia. 4759–4767.
  44. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
  45. MotionEditor: Editing Video Motion via Content-Aware Diffusion. arXiv preprint arXiv:2311.18830 (2023).
  46. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1921–1930.
  47. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477 2, 3 (2022), 5.
  48. Attention is all you need. Advances in neural information processing systems 30 (2017).
  49. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023).
  50. Cal-SFDA: Source-Free Domain-adaptive Semantic Segmentation with Differentiable Expected Calibration Error. In Proceedings of the 31st ACM International Conference on Multimedia. 1167–1178.
  51. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7623–7633.
  52. Scene Graph Masked Variational Autoencoders for 3D Scene Generation. In Proceedings of the 31st ACM International Conference on Multimedia. 5725–5733.
  53. 3dstyle-diffusion: Pursuing fine-grained text-driven 3d stylization with 2d diffusion models. In Proceedings of the 31st ACM International Conference on Multimedia. 6860–6868.
  54. What2comm: Towards communication-efficient collaborative perception via feature decoupling. In Proceedings of the 31st ACM International Conference on Multimedia. 7686–7695.
  55. Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling. In Proceedings of the 31st ACM International Conference on Multimedia. 3954–3964.
  56. Scene-aware human pose generation using transformer. In Proceedings of the 31st ACM International Conference on Multimedia. 2847–2855.
  57. Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer. arXiv preprint arXiv:2311.17009 (2023).
  58. Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation. In Proceedings of the 31st ACM International Conference on Multimedia. 6841–6850.
  59. Leveraging the latent diffusion models for offline facial multiple appropriate reactions generation. In Proceedings of the 31st ACM International Conference on Multimedia. 9561–9565.
  60. Feature Decoupling-Recycling Network for Fast Interactive Segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 6665–6675.
  61. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  62. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  63. Automatic Generation of Commercial Scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 1137–1147.
  64. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077 (2023).
  65. Controlvideo: Adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098 (2023).
  66. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465 (2023).
  67. Cut-and-Paste: Subject-Driven Video Editing with Attention Control. arXiv preprint arXiv:2311.11697 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yi Zuo (12 papers)
  2. Lingling Li (34 papers)
  3. Licheng Jiao (109 papers)
  4. Fang Liu (801 papers)
  5. Xu Liu (213 papers)
  6. Wenping Ma (25 papers)
  7. Shuyuan Yang (36 papers)
  8. Yuwei Guo (20 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.