Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Object-Centric Diffusion for Efficient Video Editing (2401.05735v3)

Published 11 Jan 2024 in cs.CV and cs.LG

Abstract: Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.

Efficiency Improvements in Video Editing with Object-Centric Diffusion

Introduction to the Efficiency Challenge

Video editing powered by diffusion models has made great strides in quality and capability. Such models can now incorporate textual edit prompts to modify the global style, local structure, and attributes of video footage. Nonetheless, these advancements come with a significant computational load. Traditional techniques, including diffusion inversion and cross-frame self-attention, ensure temporal coherence but are computationally intense. This paper seeks to address these inefficiencies by proposing modifications that conserve quality while drastically speeding up the editing process.

Breaking Down Inefficiencies

Investigations into the current frameworks for video editing have identified major points of inefficiency, particularly in memory and computational demands. These problems largely stem from attention-based guidance and a high volume of diffusion steps during the video generation process. This research identifies ways to leverage existing optimizations, such as efficient samplers and token reduction in attention layers, to amplify speed without deterring the quality of edited content.

Object-Centric Techniques

The core innovation of the paper revolves around Object-Centric Diffusion (OCD). This method focuses computational efforts on the foreground, harnessing the concept that edits are often most crucial where the action is. Two novel techniques are introduced:

  1. Object-Centric Sampling: This method differentiates diffusion processes between edited and background areas, allowing for a more computationally efficient focus on the regions of interest.
  2. Object-Centric 3D Token Merging: This approach streamlines cross-frame attention by merging tokens in the less significant background regions, exploiting redundancies to reduce workload.

These techniques can be applied to existing video editing models swiftly, without retraining, while significantly lowering memory usage and computational costs.

Demonstrated Results and Contributions

Applying the proposed techniques to existing inversion-based and ControlNet-based video editing frameworks, the researchers attained impressive results. Notably, they achieved a latency reduction by a factor of 10x in inversion-based models and 6x in ControlNet-based models, with memory savings up to 17x, all the while maintaining comparative synthesis quality.

The contributions can be summarized as follows:

  • An analysis and suggestions for acceleration in current video editing models.
  • Introduction of Object-Centric Sampling for focused diffusion processing.
  • Introduction of Object-Centric 3D Token Merging which reduces the number of cross-frame attention tokens.
  • Optimization of two recent video editing models, showcasing rapid editing speeds without sacrificing quality.

Through extensive experiments, this paper affirms that focusing computational resources on imperative regions using object-centric solutions enhances the quality and efficiency of video editing. This work presents a meaningful step toward more efficient, high-quality video editing that can be of great benefit in various applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
  3. Token merging for fast stable diffusion. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
  4. Token merging: Your vit but faster. International Conference on Learning Representations, 2023.
  5. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
  6. Structure and content-guided video synthesis with diffusion models. In IEEE International Conference on Computer Vision, 2023a.
  7. Structure and content-guided video synthesis with diffusion models. In IEEE International Conference on Computer Vision, 2023b.
  8. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  9. Generative adversarial networks. Communications of the ACM, 2020.
  10. Skip-convolutions for efficient video processing. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2021.
  11. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  12. Denoising diffusion probabilistic models. Neural Information Processing Systems, 2020.
  13. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  14. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 2022b.
  15. Video diffusion models, 2022c.
  16. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. International Conference on Learning Representations, 2023.
  17. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  18. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023a.
  19. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023b.
  20. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  21. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  22. Efficient spatially sparse inference for conditional gans and diffusion models. Neural Information Processing Systems, 2022.
  23. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023.
  24. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023a.
  25. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023b.
  26. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Neural Information Processing Systems, 2022a.
  27. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  28. Sdedit: Image synthesis and editing with stochastic differential equations. International Conference on Learning Representations, 2022.
  29. On distillation of guided diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
  30. Microsoft. Microsoft deepspeed. https://github.com/microsoft/DeepSpeed.
  31. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
  32. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
  33. Fatezero: Fusing attentions for zero-shot text-based video editing. IEEE International Conference on Computer Vision, 2023.
  34. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 2020.
  35. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  36. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  37. Sbnet: Sparse blocks network for fast inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2022.
  39. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015.
  40. Progressive distillation for fast sampling of diffusion models. International Conference on Learning Representations, 2022.
  41. Laion-5b: An open large-scale dataset for training next generation image-text models. Neural Information Processing Systems, 2022.
  42. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  43. Denoising diffusion implicit models. International Conference on Learning Representations, 2021.
  44. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
  45. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2020.
  46. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE International Conference on Computer Vision, 2023.
  47. Selfreformer: Self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283, 2022.
  48. Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision, 2023a.
  49. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023b.
  50. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kumara Kahatapitiya (20 papers)
  2. Adil Karjauv (10 papers)
  3. Davide Abati (15 papers)
  4. Fatih Porikli (141 papers)
  5. Amirhossein Habibian (21 papers)
  6. Yuki M. Asano (63 papers)
Citations (9)