Mobile Video Diffusion (2412.07583v1)

Published 10 Dec 2024 in cs.CV and cs.AI

Abstract: Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

PDF HTML Abstract

Mobile Video Diffusion: A Synopsis

The paper entitled "Mobile Video Diffusion" addresses the significant computational constraints associated with deploying video diffusion models on mobile devices. These models, known for their high realism and controllability in video generation, traditionally require substantial memory and computational resources, which restrict their usability to high-end GPUs or cloud-based servers. The authors introduce a mobile-optimized video diffusion model, MobileVD, that applies a series of innovative optimizations to the base Stable Video Diffusion (SVD) architecture to make it feasible for on-device deployment.

Core Innovations and Methodology

The primary contribution of this work lies in reducing the computational demands of video diffusion models to enable their operation on consumer-grade mobile devices. The authors achieve this by:

Memory and Computational Cost Reduction: The model reduces frame resolution and introduces multi-scale temporal representations, effectively optimizing both memory usage and computational efficiency.
Pruning Techniques: Two novel pruning techniques are introduced, namely channel funneling and temporal block pruning. Channel funneling compresses the network's width by introducing intermediate funnels in linear layers, which reduces the computational load without a significant loss in accuracy. Temporal block pruning strategically removes less significant temporal blocks, which further decreases the computational footprint.
Adversarial Finetuning: To accelerate video generation, adversarial training is employed to reduce the number of denoising steps from multiple iterations to just one. This singular forward pass drastically cuts down processing time while maintaining high video quality.
Implementation of Efficient UNet Architecture: The paper adapts the spatio-temporal UNet from SVD with additional architectural optimizations such as optimized cross-attention layers, yielding significant efficiency improvements on mobile devices.

Numerical Results and Efficiency Gains

The paper quantitatively demonstrates the efficiency of the MobileVD model, reporting a 523× reduction in computational requirements (from 1817.2 TFLOPs to 4.34 TFLOPs) as compared to standard models. This optimization results in a slight drop in quality, with an FVD score of 149 versus 171, which is deemed acceptable given the trade-off between computational efficiency and video quality.

Implications and Future Work

The work paves the way for on-device video generation, which has numerous applications ranging from augmented reality to privacy-preserving video content creation. By enabling video diffusion models to run efficiently on mobile devices, this research contributes to a broader accessibility of advanced video generation technology beyond high-resource environments.

Looking forward, the paper suggests that future advancements may focus on further improving spatial and temporal compression to enhance video quality and extend the duration and resolution of generated videos. Moreover, integrating more advanced autoencoders could be a fruitful avenue for alleviating the limitations of low-resolution outputs, expanding the applicability of such models in diverse mobile applications.

Overall, the MobileVD model represents a significant step in making cutting-edge video diffusion models accessible on mobile platforms, offering a practical solution to the computational bottlenecks that have historically limited their deployment outside of powerful computing environments. This work has set a benchmark for efficiency in mobile AI applications, potentially catalyzing further research into lightweight generative models.