Looking Backward: Streaming Video-to-Video Translation with Feature Banks (2405.15757v3)
Abstract: This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.
- Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461.
- Daniel Bolya and Judy Hoffman. 2023. Token Merging for Fast Stable Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4598–4602.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402.
- Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217.
- Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. arXiv preprint arXiv:2305.13840.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- Style aligned image generation via shared attention. arXiv preprint arXiv:2312.02133.
- Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
- Video diffusion models. arXiv:2204.03458.
- Real-Time Intermediate Flow Estimation for Video Frame Interpolation. In Proceedings of the European Conference on Computer Vision (ECCV).
- Ondrej Jamriska. 2018. Ebsynth: Fast Example-based Image Synthesis and Style Transfer. https://github.com/jamriska/ebsynth.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439.
- StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation. arXiv preprint arXiv:2312.12491.
- AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks. arXiv preprint arXiv:2403.14468.
- Learning blind video temporal consistency. In Proceedings of the European conference on computer vision (ECCV), pages 170–185.
- xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers.
- FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis. arXiv preprint arXiv:2312.17681.
- Shanchuan Lin and Xiao Yang. 2024. AnimateDiff-Lightning: Cross-Model Diffusion Distillation. arXiv preprint arXiv:2403.12706.
- Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence. In Advances in Neural Information Processing Systems.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378.
- Deepcache: Accelerating diffusion models for free. arXiv preprint arXiv:2312.00858.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
- On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453.
- Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926.
- Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 464–474.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
- Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042.
- Video Editing via Factorized Diffusion Distillation. arXiv preprint arXiv:2403.09334.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Consistency models.
- Emergent correspondence from image diffusion. Advances in Neural Information Processing Systems, 36.
- Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer.
- Training-Free Consistent Text-to-Image Generation. arXiv preprint arXiv:2402.03286.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599.
- VideoComposer: Compositional Video Synthesis with Motion Controllability. arXiv preprint arXiv:2306.02018.
- Cache Me if You Can: Accelerating Diffusion Models through Block Caching. arXiv preprint arXiv:2312.03209.
- Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis. arXiv preprint arXiv:2312.13834.
- Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633.
- Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. arXiv preprint arXiv:2306.07954.
- One-step Diffusion with Distribution Matching Distillation. arXiv preprint arXiv:2311.18828.
- ControlVideo: Training-free Controllable Text-to-Video Generation. arXiv preprint arXiv:2305.13077.
- ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing. arXiv preprint arXiv:2305.17098.
- StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation. arXiv preprint arXiv:2405.01434.