Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LivePhoto: Real Image Animation with Text-guided Motion Control (2312.02928v1)

Published 5 Dec 2023 in cs.CV

Abstract: Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  3. Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, 2023.
  4. Pixart: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv:2310.00426, 2023a.
  5. Motion-conditioned diffusion model for controllable video synthesis. arXiv:2304.14404, 2023b.
  6. Anydoor: Zero-shot object-level image customization. arXiv:2307.09481, 2023c.
  7. Time flies: Animating a still image with time-lapse video as reference. In CVPR, 2020.
  8. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
  9. Alibaba group. I2vgen-xl. https://modelscope.cn/models/damo/Video-to-Video/summary, 2023.
  10. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725, 2023.
  11. Denoising diffusion probabilistic models. NeurIPS, 2020.
  12. Video diffusion models. arXiv:2204.03458, 2022.
  13. Animating pictures with eulerian motion fields. In CVPR, 2021.
  14. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  15. Make it move: controllable image-to-video generation with text descriptions. In CVPR, 2022.
  16. Animating still landscape photographs through cloud motion creation. TMM, 2015.
  17. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv:2304.06025, 2023.
  18. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  19. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv:2303.13439, 2023.
  20. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
  21. Generative image dynamics. arXiv:2309.07906, 2023.
  22. Magicedit: High-fidelity and temporally coherent video editing. arXiv:2308.14749, 2023.
  23. Cones: Concept neurons in diffusion models for customized generation. arXiv:2303.05125, 2023a.
  24. Cones 2: Customizable image synthesis with multiple subjects. arXiv:2305.19327, 2023b.
  25. Tyler Luan. Animatediff-i2v. https://github.com/ykk648/AnimateDiff-I2V, 2023.
  26. Controllable animation of fluid elements in still images. In CVPR, 2022.
  27. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv:2108.01073, 2021.
  28. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv:2302.08453, 2023.
  29. Animating pictures of fluid using video examples. In Computer Graphics Forum. Wiley Online Library, 2009.
  30. Dinov2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
  31. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv:2307.01952, 2023.
  32. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., 2021.
  33. PikaLabs reseachers. Pikalabs: An innovative text-to-video platform. https://www.pika.art/, 2023.10a.
  34. Runway reseachers. Gen-2: The next step forward for generative ai. https://research.runwayml.com/gen2, 2023.10b.
  35. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., 2022.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
  37. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 2022.
  38. Image animation with perturbed masks. In CVPR, 2022.
  39. First order motion model for image animation. NeurIPS, 2019.
  40. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, 2022.
  41. talesofai. Animatediff talesofai. https://github.com/talesofai/AnimateDiff, 2023.
  42. Disco: Disentangled control for referring human dance generation in real world. arXiv:2307.00040, 2023a.
  43. Videocomposer: Compositional video synthesis with motion controllability. NeurIPS, 2023b.
  44. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
  45. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  46. Raphael: Text-to-image generation via large mixture of diffusion paths. NeurIPS, 2023.
  47. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv:2303.12346, 2023.
  48. Magicavatar: Multimodal avatar generation and animation. arXiv:2308.14748, 2023.
  49. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023.
  50. Thin-plate spline motion model for image animation. In CVPR, 2022.
  51. Sparse to dense motion transfer for face image animation. In ICCV, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xi Chen (1036 papers)
  2. Zhiheng Liu (22 papers)
  3. Mengting Chen (10 papers)
  4. Yutong Feng (33 papers)
  5. Yu Liu (786 papers)
  6. Yujun Shen (111 papers)
  7. Hengshuang Zhao (118 papers)
Citations (20)