GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models (2404.12541v1)
Abstract: Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.
- Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
- Text2live: Text-driven layered image and video editing. In ECCV, 2022.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Pix2video: Video editing using image diffusion. In ICCV, 2023.
- Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, 2023.
- Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR, 2023.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing, 2023.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. 2023.
- Prompt-to-prompt image editing with cross attention control, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Imagen video: High definition video generation with diffusion models, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022.
- Scaling up gans for text-to-image synthesis. CVPR, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv, 2023.
- Segment anything. arXiv, 2023.
- Ablating concepts in text-to-image diffusion models. In ICCV, 2023.
- Shape-aware text-driven layered video editing demo. arXiv preprint arXiv:2301.13173, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv, 2023a.
- Video-p2p: Video editing with cross-attention control, 2023b.
- Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
- Sdedit: Image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- Dreamix: Video diffusion models are general video editors, 2023.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Aaron Van Den Oord et al. The frobnicatable foo filter, 2017. Neural discrete representation learning. In NeurIPS.
- Zero-shot image-to-image translation. In SIGGRAPH, 2023.
- Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv, 2023.
- Fatezero: Fusing attentions for zero-shot text-based video editing. In ICCV, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Coralstyleclip: Co-optimized region and layer selection for image editing. In CVPR, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Make-a-video: Text-to-video generation without text-video data, 2022.
- Denoising diffusion implicit models. In ICLR, 2023a.
- Objectstitch: Object compositing with diffusion model. In CVPR, 2023b.
- Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv, 2022.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Make-a-protagonist: Generic video editing with an ensemble of experts, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv, 2022.
- Sai Sree Harsha (6 papers)
- Ambareesh Revanur (9 papers)
- Dhwanit Agarwal (6 papers)
- Shradha Agrawal (3 papers)