- The paper presents a training-free method that deploys diffusion-based text-to-video generation on mobile devices without re-training or heavy model compression.
- Key techniques include Linear Proportional Leap and Temporal Dimension Token Merging that reduce denoising steps and attention computation, achieving speedups of up to 1.9x.
- The framework uses Concurrent Inference with Dynamic Loading to overcome memory constraints, enabling deployment of models with footprints exceeding 23GB on devices like the iPhone 15 Pro.
"On-device Sora" (2503.23796) presents a framework designed to execute pre-trained, diffusion-based text-to-video generative models, such as Open-Sora, directly on resource-constrained mobile devices like the iPhone 15 Pro. The central contribution is achieving this deployment without necessitating model re-training, compression (e.g., distillation, extensive quantization, pruning), or fine-tuning, which are typically resource-intensive processes. The framework applies a suite of novel, training-free optimization techniques to address the inherent computational and memory challenges associated with running large-scale generative models on mobile hardware.
Challenges of On-Device Video Diffusion
Deploying state-of-the-art text-to-video diffusion models, particularly those utilizing Spatial-Temporal Diffusion Transformers (STDiT), on mobile platforms encounters three primary obstacles:
- High Number of Denoising Steps (C1): Diffusion models inherently require numerous iterative steps (often dozens to hundreds) to transform noise into a coherent video latent representation. Each step involves computationally expensive model inference, leading to prohibitive generation times on mobile GPUs.
- Intensive Token Processing in Attention (C2): The STDiT architecture relies heavily on attention mechanisms (both self-attention and cross-attention) to model spatial and temporal relationships. The computational complexity, particularly the quadratic complexity of self-attention with respect to the number of input tokens (representing spatial patches across time), results in significant latency per denoising step.
- Large Memory Footprint (C3): The combined size of the model components—typically including a text encoder (e.g., T5), the main diffusion model (e.g., STDiT), and a decoder (e.g., VAE)—often exceeds the available RAM on mobile devices. For instance, the paper notes a combined footprint potentially exceeding 23 GB, far surpassing the ~3.3 GB of usable application memory on an iPhone 15 Pro.
Methodology: Training-Free Optimizations
On-device Sora introduces three techniques to mitigate these challenges without altering model weights:
Linear Proportional Leap (LPL)
LPL aims to reduce the total number of denoising steps required (addressing C1). It leverages the trajectory properties observed in models trained with Rectified Flow, such as Open-Sora, which theoretically promotes straighter paths between noise and data samples in the latent space.
- Mechanism: LPL monitors the denoising process. After an initial set of standard diffusion steps (e.g., n steps, with a default minimum of 50 mentioned), it assesses the linearity of the remaining trajectory. This is achieved by calculating the cosine similarity between the drift estimations v(Pn​,tn​) and v(Pn+1​,tn+1​) predicted by the STDiT model at consecutive steps n and n+1. If the similarity exceeds a predefined threshold (indicating near-linearity) or the maximum number of standard steps is reached, LPL activates. It then performs an "early stop," taking the drift vector v(Pn+1​,tn+1​) calculated at step n+1 and scaling it by the remaining integration time tn+1​ to directly estimate the final latent state zk​. The update becomes zk​=zn​+v(Pn+1​,tn+1​)×tn+1​.
- Implementation: This involves modifying the diffusion sampling loop to include the cosine similarity check and the conditional leap calculation. It avoids running the STDiT model for the skipped steps, directly reducing computational load.
- Impact: Experiments showed LPL reducing the effective number of STDiT inference calls significantly (e.g., from a baseline of 30 down to ~16-18), leading to speedups of approximately 1.5x-1.9x in the denoising phase. Its effectiveness was also demonstrated on Pyramidal Flow, another Rectified Flow-based model.
Temporal Dimension Token Merging (TDTM)
TDTM focuses on reducing the computational cost of the attention mechanism within STDiT (addressing C2) by decreasing the number of tokens processed along the temporal dimension.
- Mechanism: TDTM exploits the expected temporal redundancy between adjacent video frames. Before inputting tokens to the spatial-temporal attention blocks, it merges consecutive tokens along the temporal axis. Specifically, for a pair of adjacent temporal tokens Ti​ and Ti+1​, they are replaced by their average: Ti/2′​=(Ti​+Ti+1​)/2. This halves the number of tokens along the temporal dimension fed into the attention layers. After the attention computation (both self-attention and cross-attention), the resulting merged output tokens are "unmerged" by simply duplicating each merged token Ti/2′′​ back into two identical tokens to restore the original sequence length before subsequent operations (like MLP layers).
- Implementation: This modification occurs within the STDiT forward pass, specifically before the attention blocks. It involves averaging operations before attention and duplication operations after attention. The paper suggests it can be applied selectively, perhaps only during initial denoising steps, to balance efficiency gains against potential minor quality degradation in highly dynamic scenes.
- Impact: By reducing the number of tokens, TDTM decreases the computational complexity of attention. Theoretically, for self-attention (O(N2)), halving the temporal tokens can reduce computation by up to 4x, while for cross-attention (O(N)), it reduces computation by 2x. Experimental results indicated speedups of ~1.1x-1.7x attributed to TDTM.
Concurrent Inference with Dynamic Loading (CI-DL)
CI-DL tackles the memory limitation (C3) and associated latency overheads of loading model parts.
- Mechanism: CI-DL employs two coordinated strategies:
- Model Partitioning and Concurrent Inference: Large model components (T5 encoder, STDiT) are broken down into smaller sequential blocks that can fit individually within the available mobile device RAM. CI-DL orchestrates the inference process such that while the GPU is executing computation for the current block (i), the CPU concurrently loads the next required block (i+1) from storage (e.g., NAND flash) into RAM. This overlapping of I/O (loading) and computation minimizes GPU idle time spent waiting for model weights.
- Dynamic Loading: At runtime, the system monitors available memory. If sufficient memory is free, CI-DL dynamically decides to keep frequently accessed model blocks (e.g., initial layers of STDiT used in every denoising step) resident in RAM ("retained state") rather than unloading them after each use. This avoids the latency associated with repeatedly reloading these blocks from slower storage, optimizing particularly for the iterative nature of the diffusion process.
- Implementation: This requires careful management of memory buffers and asynchronous loading operations, leveraging platform-specific APIs (like CoreML on iOS) for efficient CPU-GPU coordination and memory management. The partitioning strategy needs to consider layer dependencies and memory constraints.
- Impact: CI-DL enables the execution of models significantly larger than the available RAM by fitting them piece by piece. The concurrency aspect directly reduces the latency overhead associated with loading model weights, making the overall inference pipeline significantly faster compared to naive sequential loading.
Implementation and Evaluation
The proposed On-device Sora framework was implemented and evaluated on an iPhone 15 Pro using Apple's CoreML framework for hardware acceleration.
- Model: Open-Sora (specifically, version 1.0 based on STDiT) served as the primary text-to-video backbone.
- Quantization: To manage memory and potentially speed up inference, the T5 text encoder was quantized to 8-bit integers (int8). However, the core STDiT model and the VAE decoder were kept in 32-bit floating-point (fp32) format, as initial experiments indicated potential quality degradation with quantization for these components within the training-free constraint.
- Evaluation: Performance was benchmarked using VBench, comparing generated video quality against the original Open-Sora model running on an NVIDIA A6000 GPU. Metrics covered aspects like image quality, temporal consistency, and motion dynamics. Latency measurements and ablation studies were conducted to isolate the contributions of LPL, TDTM, and CI-DL.
- Results: The evaluation demonstrated that On-device Sora could generate videos (e.g., 256x256 resolution) on the iPhone 15 Pro with quality largely comparable to the GPU baseline, according to VBench metrics. While some minor frame-level quality reductions were observed, the "dynamic degree" metric slightly improved. The combined application of LPL, TDTM, and CI-DL resulted in substantial latency reduction. For instance, the STDiT denoising latency for generating a 16-frame, 256x256 video was reduced from an estimated 1768 seconds (baseline projection) to approximately 454 seconds.
Conclusion
"On-device Sora" demonstrates a viable pathway for deploying large diffusion-based text-to-video models on mobile devices without the conventional requirement of model-specific re-training or compression. By introducing the Linear Proportional Leap (LPL) to reduce denoising steps, Temporal Dimension Token Merging (TDTM) to lower attention computation costs, and Concurrent Inference with Dynamic Loading (CI-DL) to manage memory constraints and I/O latency, the framework achieves significant efficiency gains while preserving much of the original model's generative quality. This training-free approach significantly lowers the barrier for bringing advanced generative capabilities to edge devices.