ActAnywhere: Subject-Aware Video Background Generation (2401.10822v1)
Abstract: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.
- Virtual Production. https://en.wikipedia.org/wiki/On-set_virtual_production.
- Adobe. Firefly. 2023a. https://www.adobe.com/sensei/generative-ai/firefly.html.
- Adobe. Photoshop, version. 2023b. https://www.adobe.com/products/photoshop.html.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Text2live: Text-driven layered image and video editing. In ECCV, 2022.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Hallucinating pose-compatible scenes. In ECCV, 2022.
- Pix2video: Video editing using image diffusion. In ICCV, 2023.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
- Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
- Large scale holistic video understanding. In ECCV, 2020.
- Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
- Hierarchical masked 3d diffusion model for video outpainting. In ACM MM, 2023.
- Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Mask r-cnn. In ICCV, 2017.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2023.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073, 2023.
- Layered neural atlases for consistent video editing. ACM TOG, 40(6), 2021.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Putting people in their place: Affordance-aware human insertion into scenes. In CVPR, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Moments in time dataset: one million videos for event understanding. IEEE TPAMI, 42(2), 2019.
- OpenAI. ChatGPT. 2023. https://chat.openai.com/.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3), 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH, 2022.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
- Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
- Generating videos with scene dynamics. In NeurIPS, 2016.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
- Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, 2023.
- Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
- Magvit: Masked generative video transformer. In CVPR, 2023a.
- Video probabilistic diffusion models in projected latent space. In CVPR, 2023b.
- Adding conditional control to text-to-image diffusion models. In CVPR, 2023.
- Propainter: Improving propagation and transformer for video inpainting. In ICCV, 2023.