VASE: Object-Centric Appearance and Shape Manipulation of Real Videos (2401.02473v1)
Abstract: Recently, several works tackled the video editing task fostered by the success of large-scale text-to-image generative models. However, most of these methods holistically edit the frame using the text, exploiting the prior given by foundation diffusion models and focusing on improving the temporal consistency across frames. In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object. We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control. We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities. Further details, code and examples are available on our project page: https://helia95.github.io/vase-website/
- Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Text2live: Text-driven layered image and video editing. In ECCV, 2022.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Pix2video: Video editing using image diffusion. In ICCV, 2023.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint, 2023a.
- Anydoor: Zero-shot object-level image customization. arXiv preprint, 2023b.
- Diffusion self-guidance for controllable image generation. arXiv preprint, 2023.
- Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
- Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint, 2023.
- Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint, 2022.
- Classifier-free diffusion guidance. arXiv preprint, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint, 2022.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint, 2023.
- Layered neural atlases for consistent video editing. ACM TOG, 2021.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint, 2023.
- Learning blind video temporal consistency. In ECCV, 2018.
- Shape-aware text-driven layered video editing. In CVPR, 2023.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Video-p2p: Video editing with cross-attention control. arXiv preprint, 2023a.
- Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint, 2023b.
- Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint, 2021.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint, 2021.
- Softmax splatting for video frame interpolation. In CVPR, 2020.
- Dinov2: Learning robust visual features without supervision. arXiv preprint, 2023.
- Codef: Content deformation fields for temporally consistent video processing. arXiv preprint, 2023.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint, 2022.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint, 2022.
- Film: Frame interpolation for large motion. In ECCV, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Palette: Image-to-image diffusion models. In ACM MM. ACM MM, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022b.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint, 2021.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Generative modeling by estimating gradients of the data distribution. NeurIPS, 2019.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint, 2018.
- Fvd: A new metric for video generation. 2019.
- Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint, 2023.
- Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In ICCV, 2023a.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023b.
- Youtube-vos: Sequence-to-sequence video object segmentation. In CVPR, 2018.
- Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint, 2023.
- Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023a.
- Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint, 2023b.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- ProPainter: Improving propagation and transformer for video inpainting. In ICCV, 2023.