Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion (2403.14617v3)
Abstract: We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
- Adobe. 2023. Adobe premiere pro. https://www.adobe.com/products/premiere.html. Accessed: 2024-03-03.
- Apple. 2023. Final cut pro. https://www.apple.com/final-cut-pro/. Accessed: 2024-03-03.
- Anand Bhattad and David A Forsyth. 2022. Cut-and-paste object insertion by enabling deep image prior for reshading. In 2022 International Conference on 3D Vision (3DV), pages 332–341. IEEE.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. Preprint, arXiv:2311.15127.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575.
- Pix2video: Video editing using image diffusion. Preprint, arXiv:2303.12688.
- Everybody dance now. Preprint, arXiv:1808.07371.
- Diffusionatlas: High-fidelity consistent diffusion video editing. Preprint, arXiv:2312.03772.
- Videocrafter2: Overcoming data limitations for high-quality video diffusion models. Preprint, arXiv:2401.09047.
- Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481.
- Videdit: Zero-shot and spatially aware text-driven video editing. Preprint, arXiv:2306.08707.
- 3d paintbrush: Local stylization of 3d shapes with cascaded score distillation. arXiv preprint arXiv:2311.09571.
- 3d highlighter: Localizing regions on 3d shapes via text descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20930–20939.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618.
- Tokenflow: Consistent diffusion features for consistent video editing. Preprint, arXiv:2307.10373.
- Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546.
- Generative adversarial nets. Advances in neural information processing systems, 27.
- Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- Denoising diffusion probabilistic models. Preprint, arXiv:2006.11239.
- Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. Preprint, arXiv:2303.11897.
- Hyeonho Jeong and Jong Chul Ye. 2024. Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. Preprint, arXiv:2310.01107.
- Object-centric diffusion for efficient video editing. Preprint, arXiv:2401.05735.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134.
- Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models. Preprint, arXiv:2312.04524.
- Cotracker: It is better to track together. Preprint, arXiv:2307.07635.
- Elucidating the design space of diffusion-based generative models. Preprint, arXiv:2206.00364.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410.
- Layered neural atlases for consistent video editing. Preprint, arXiv:2109.11418.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125.
- Wendy Mackay and Daniele Pagani. 1994. Video mosaic: Laying out time in a physical space. In Proceedings of the second ACM international conference on Multimedia, pages 165–172.
- Fixed-point inversion for text-to-image diffusion models. Preprint, arXiv:2312.12540.
- Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. arXiv preprint arXiv:2402.14797.
- Object 3dit: Language-guided 3d-aware image editing. Advances in Neural Information Processing Systems, 36.
- Null-text inversion for editing real images using guided diffusion models. Preprint, arXiv:2211.09794.
- Hotshot-XL.
- Fatezero: Fusing attentions for zero-shot text-based video editing. Preprint, arXiv:2303.09535.
- Learning transferable visual models from natural language supervision. Preprint, arXiv:2103.00020.
- Customize-a-video: One-shot motion customization of text-to-video diffusion models. Preprint, arXiv:2402.14780.
- High-resolution image synthesis with latent diffusion models. Preprint, arXiv:2112.10752.
- Direct space-time trajectory control for visual media editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1149–1158.
- Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515.
- Edit-a-video: Single video editing with object-aware consistency. Preprint, arXiv:2303.07945.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792.
- Denoising diffusion implicit models. Preprint, arXiv:2010.02502.
- Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. Preprint, arXiv:2003.12039.
- Pascal Vincent. 2011. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674.
- Edict: Exact diffusion inversion via coupled transformations. Preprint, arXiv:2211.12446.
- Modelscope text-to-video technical report. Preprint, arXiv:2308.06571.
- Videocomposer: Compositional video synthesis with motion controllability. Preprint, arXiv:2306.02018.
- Advancing high-resolution video-language representation with large-scale video transcriptions. Preprint, arXiv:2111.10337.
- Rerender a video: Zero-shot text-guided video-to-video translation. Preprint, arXiv:2306.07954.
- Space-time diffusion features for zero-shot text-driven motion transfer. Preprint, arXiv:2311.17009.
- Image sculpting: Precise object editing with 3d geometry control. arXiv preprint arXiv:2401.01702.
- Dance style transfer with cross-modal transformer. Preprint, arXiv:2208.09406.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. Preprint, arXiv:2309.15818.
- Exact diffusion inversion via bi-directional integration approximation. Preprint, arXiv:2307.10829.
- Magicbrush: A manually annotated dataset for instruction-guided image editing. Preprint, arXiv:2306.10012.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847.
- Cut-and-paste: Subject-driven video editing with attention control. Preprint, arXiv:2311.11697.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.