RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives (2405.18406v3)
Abstract: Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Pix2video: Video editing using image diffusion. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- G. Cleuziou. A generalization of k-means for overlapping clustering. Rapport technique, 2007.
- Videdit: Zero-shot and spatially aware text-driven video editing. arXiv preprint arXiv:2306.08707, 2023.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.14104, 2, 2021.
- Videoshop: Localized semantic video editing with noise-extrapolated diffusion inversion. arXiv preprint arXiv:2403.14617, 2024.
- Actor and action video segmentation from a sentence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5958–5966, 2018.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- Superpixel-based video object segmentation using perceptual organization and location prior. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- gpt 4o. https://openai.com/index/hello-gpt-4o/. May 2024.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- A. Hore and D. Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
- Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
- H. Jeong and J. C. Ye. Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv preprint arXiv:2310.01107, 2023.
- Zero-shot dense video captioning by jointly optimizing text and moment. arXiv preprint arXiv:2307.02682, 2023.
- Learning hierarchical image segmentation for recognition and by recognition. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- An improved overlapping k-means clustering method for medical applications. Expert Systems with Applications, 2017.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Segment anything. arXiv:2304.02643, 2023.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
- Z. Li and J. Chen. Superpixel segmentation using linear spectral clustering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
- Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
- Mitigating hallucination in large multi-modal models via robust instruction tuning. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- World-to-words: Grounded open vocabulary acquisition through fast mapping in vision-language models. arXiv preprint arXiv:2306.08685, 2023.
- Improving vision-and-language navigation with image-text pairs from the web. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 259–274. Springer, 2020.
- K. Mei and V. Patel. Vidm: Video implicit diffusion models. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI), 2023.
- Pg-video-llava: Pixel grounding large video-language models. arXiv preprint arXiv:2311.13435, 2023.
- openai. https://openai.com/sora. 2024.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1916–1924, 2017.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Zero-shot video captioning with evolving pseudo-tokens. arXiv preprint arXiv:2207.11100, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Fvd: A new metric for video generation. In ICLR workshop, 2019.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Reconstruction network for video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In International Conference on Multimedia Modeling. Springer, 2024.
- End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
- Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
- Non-exhaustive, overlapping k-means. In Proceedings of the 2015 SIAM international conference on data mining, pages 936–944. SIAM, 2015.
- Towards language-driven video inpainting via multimodal large language models. arXiv preprint arXiv:2401.10226, 2024.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Cap4video: What can auxiliary captions do for text-video retrieval? In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Smartbrush: Text and shape guided object inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22428–22437, 2023.
- Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Superpixel segmentation with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Eva: Zero-shot accurate attributes and multi-object video editing. arXiv preprint arXiv:2403.16111, 2024.
- Associating objects with transformers for video object segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Z. Yang and Y. Yang. Decoupling features in hierarchical propagation for video object segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion. arXiv preprint arXiv:2402.05889, 2024.
- Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
- Llava-next: A strong zero-shot video understanding model. https://llava-vl.github.io/blog/2024-04-30-llava-next-video/. April 2024.
- Towards consistent video editing with text-to-image diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Avid: Any-length video inpainting with diffusion model. arXiv preprint arXiv:2312.03816, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8739–8748, 2018.
- Analyzing and mitigating object hallucination in large vision-language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024.