Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ActAnywhere: Subject-Aware Video Background Generation (2401.10822v1)

Published 19 Jan 2024 in cs.CV

Abstract: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Virtual Production. https://en.wikipedia.org/wiki/On-set_virtual_production.
  2. Adobe. Firefly. 2023a. https://www.adobe.com/sensei/generative-ai/firefly.html.
  3. Adobe. Photoshop, version. 2023b. https://www.adobe.com/products/photoshop.html.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  5. Text2live: Text-driven layered image and video editing. In ECCV, 2022.
  6. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  7. Hallucinating pose-compatible scenes. In ECCV, 2022.
  8. Pix2video: Video editing using image diffusion. In ICCV, 2023.
  9. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  10. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
  11. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
  12. Large scale holistic video understanding. In ECCV, 2020.
  13. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
  14. Hierarchical masked 3d diffusion model for video outpainting. In ACM MM, 2023.
  15. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022.
  16. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  17. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  18. Mask r-cnn. In ICCV, 2017.
  19. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2023.
  20. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  21. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  22. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073, 2023.
  23. Layered neural atlases for consistent video editing. ACM TOG, 40(6), 2021.
  24. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  25. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  26. Putting people in their place: Affordance-aware human insertion into scenes. In CVPR, 2023.
  27. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  28. Moments in time dataset: one million videos for event understanding. IEEE TPAMI, 42(2), 2019.
  29. OpenAI. ChatGPT. 2023. https://chat.openai.com/.
  30. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  31. Learning transferable visual models from natural language supervision. In ICML, 2021.
  32. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  33. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3), 2020.
  34. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  35. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  36. Palette: Image-to-image diffusion models. In ACM SIGGRAPH, 2022.
  37. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
  38. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
  39. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  40. Generating videos with scene dynamics. In NeurIPS, 2016.
  41. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  42. Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, 2023.
  43. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
  44. Magvit: Masked generative video transformer. In CVPR, 2023a.
  45. Video probabilistic diffusion models in projected latent space. In CVPR, 2023b.
  46. Adding conditional control to text-to-image diffusion models. In CVPR, 2023.
  47. Propainter: Improving propagation and transformer for video inpainting. In ICCV, 2023.
Citations (1)

Summary

  • The paper introduces a novel diffusion-based model that automates the integration of subject motion with background images.
  • It combines marked foreground sequences with reference scenes to generate coherent videos that accurately reflect subject-background interactions.
  • Extensive evaluations show robust generalization to non-human subjects, streamlining creative video production for diverse applications.

Introduction to ActAnywhere Technology

Video production, particularly in the fields of filmmaking and visual effects, often comes with the challenge of integrating a foreground subject realistically within a new background environment. Traditionally, this integration has been a labor-intensive process involving 3D scene creation and the use of advanced technologies such as LED-walled studios. However, the recently proposed generative model, ActAnywhere, revolutionizes this workflow by automating the process of subject-aware video background generation.

Core Mechanism of ActAnywhere

ActAnywhere, a novel solution within the generative AI landscape, operates by taking a marked sequence of foreground subject motion and an image depicting the desired scene. The model weaves these elements together into a coherent video while ensuring that the resulting foreground-background interactions align with the original conditions. ActAnywhere impresses with its adaptability, generating detailed backgrounds such as diverse landscapes or moving objects in sync with the subject's activity. This capacity reflects the model's understanding of human-scene interactions and the completion of the broader visual context beyond visible segments.

Model Capabilities and Evaluations

Trained on a large-scale dataset of human-scene interaction videos, ActAnywhere has shown excellent performance in creating realistic videos that respect the motion of the subject and adhere to the conditions of the background image. Notably, even though the model's training was primarily using human subjects, it has also demonstrated a remarkable zero-shot generalization to a myriad of non-human subjects, such as animals and inanimate objects. The utility of ActAnywhere extends across various practical applications, offering strong generalization capabilities vital for integrating different subjects into diverse backgrounds with authenticity.

ActAnywhere's Contributions and Potential Impact

ActAnywhere stands as a testament to the potential held in the intersection of AI and creative industries, promising to streamline the creative process significantly. The key contributions of this innovative model include the formulation of a guiding problem in subject-aware video background generation, the proposal of an effective video diffusion-based solution, and its utility demonstrated through extensive evaluations with positive outcomes. ActAnywhere is well poised to offer the movie and visual effects industries a practical tool to craft scenes quickly while also unlocking new opportunities for hobbyists and the broader public to imagine and realize near-limitless visual scenarios.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com