Papers
Topics
Authors
Recent
2000 character limit reached

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer (2311.17009v2)

Published 28 Nov 2023 in cs.CV

Abstract: We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8340–8348, 2018.
  2. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723. Springer, 2022.
  3. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023a.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  6. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing, 2023.
  7. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  8. cerspense. https://huggingface.co/cerspense/zeroscope_v2_576w, 2023.
  9. Pix2video: Video editing using image diffusion. ArXiv, abs/2303.12688, 2023.
  10. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23040–23050, 2023.
  11. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019.
  12. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
  13. Control-a-video: Controllable text-to-video generation with diffusion models, 2023.
  14. Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747, 2022.
  15. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 2021.
  16. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  17. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  18. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  19. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
  20. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  21. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
  22. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  23. Improving sample quality of diffusion models using self-attention guidance. arXiv preprint arXiv:2210.00939, 2022a.
  24. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022b.
  25. Normal-guided garment uv prediction for human re-texturing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4627–4636, 2023.
  26. Cotracker: It is better to track together. arXiv:2307.07635, 2023.
  27. Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG), 2021.
  28. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  29. Deep video portraits. ACM transactions on graphics (TOG), 2018.
  30. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  31. Shape-aware text-driven layered video editing. arXiv preprint arXiv:2301.13173, 2023.
  32. Video generation from text. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  33. Motion magnification. ACM transactions on graphics (TOG), 24(3):519–526, 2005.
  34. Text-driven stylization of video objects. In European Conference on Computer Vision, pages 594–609. Springer, 2022.
  35. Directed diffusion: Direct control of object placement through attention guidance. arXiv preprint arXiv:2302.13153, 2023.
  36. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
  37. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  38. Sync-draw: Automatic video generation using deep recurrent attentive architectures. In Proceedings of the 25th ACM international conference on Multimedia, pages 1096–1104, 2017.
  39. Jokr: Joint keypoint representation for unsupervised cross-domain motion retargeting. arXiv preprint arXiv:2106.09679, 2021.
  40. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  41. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
  42. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  43. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789–1798, 2017.
  44. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  45. Localizing object-level shape variations with text-to-image diffusion models. arXiv preprint arXiv:2303.11306, 2023.
  46. State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204, 2023.
  47. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  48. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.
  49. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 2021.
  50. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022a.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022b.
  52. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
  53. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2377–2386, 2019a.
  54. First order motion model for image animation. Advances in neural information processing systems, 32, 2019b.
  55. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  56. Make-a-video: Text-to-video generation without text-video data, 2022.
  57. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  58. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  59. Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023.
  60. Structure-aware motion transfer with deformable anchor model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3637–3646, 2022.
  61. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022.
  62. Plug-and-play diffusion features for text-driven image-to image translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023.
  63. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  64. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  65. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  66. C2f-fwn: Coarse-to-fine flow warping network for spatial-temporal consistent motion transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2852–2860, 2021.
  67. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV), pages 670–686, 2018.
  68. Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022a.
  69. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022b.
  70. Restart sampling for improving generative processes. CoRR, abs/2306.14878, 2023.
  71. Deformable sprites for unsupervised video decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657–2666, 2022.
Citations (25)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.