This paper introduces DeepCache, a training-free method designed to accelerate the inference speed of diffusion models like Stable Diffusion, Latent Diffusion Models (LDM), and Denoising Diffusion Probabilistic Models (DDPM). The core problem addressed is the significant computational cost associated with the sequential denoising process in these models. Unlike methods requiring retraining or fine-tuning (e.g., distillation, pruning), DeepCache modifies the inference process dynamically at runtime.
Core Observation and Idea:
The authors observe that during the iterative denoising process, the high-level features computed by the deeper layers of the U-Net architecture exhibit significant temporal similarity between adjacent timesteps. This means that computing these features repeatedly in consecutive steps involves redundant calculations. DeepCache leverages this redundancy by caching and reusing these high-level features.
Methodology:
DeepCache utilizes the inherent structure of the U-Net, specifically its encoder-decoder structure with skip connections.
- Caching High-Level Features: At certain timesteps (cache update steps), the model performs a full forward pass through the U-Net. During this pass, the output features from an up-sampling block (which represent high-level, processed information from the deeper layers) are stored in a cache.
- Retrieving and Partial Inference: In the subsequent step(s) (retrieve steps), instead of running the full U-Net, DeepCache performs a partial inference:
- It computes only the low-level features from the corresponding down-sampling block in the encoder path using the current noisy input . This computation is relatively cheap as it involves only the shallower layers up to .
- It retrieves the cached high-level features from the previous step .
- It concatenates the newly computed low-level features with the retrieved high-level features and feeds them into the up-sampling block .
The rest of the up-sampling path ( down to ) is computed normally. This avoids recomputing the computationally expensive deeper parts of the U-Net ( and deeper).
Implementation Strategies:
1:N Inference (Uniform): The simplest strategy involves performing one full inference step (cache update) followed by partial inference steps (retrieve steps) using the same cached features. The sequence of full inference steps is , where is the total number of denoising steps. Increasing increases the speedup but can potentially degrade quality.
Non-uniform 1:N Inference: Acknowledging that feature similarity isn't constant across all timesteps (it often decreases significantly around certain points in the denoising process), this strategy performs full updates more frequently around timesteps where similarity is expected to be lower. The timesteps for full inference are chosen using a power function centered around a timestep :
Here, (power) and (center) are hyperparameters. This strategy aims to improve quality compared to uniform caching, especially for larger .
Pseudocode Overview (Simplified):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
def deepcache_step(x_t, t, cache, m): # --- Cache Update Step --- # Full U-Net pass h = [x_t] for i in range(1, d + 1): h.append(D[i](h[-1])) # Down path u_next = M(h[d]) # Middle for i in range(d, 0, -1): if i == m: cache['u_next'] = u_next # Cache feature before U_m u_i = U[i](concat(u_next, h[i])) u_next = u_i epsilon_pred = u_next # Final prediction x_prev = compute_x_prev(x_t, t, epsilon_pred) # Standard DDIM/PLMS update return x_prev, cache def deepcache_retrieve_step(x_t, t, cache, m): # --- Retrieve Step --- # Partial U-Net pass (only up to D_m) h = [x_t] for i in range(1, m + 1): h.append(D[i](h[-1])) u_next = cache['u_next'] # Retrieve cached feature # Compute remaining up path from U_m for i in range(m, 0, -1): u_i = U[i](concat(u_next, h[i])) u_next = u_i epsilon_pred = u_next # Final prediction x_prev = compute_x_prev(x_t, t, epsilon_pred) # Standard DDIM/PLMS update return x_prev cache = {} x = initial_noise for t in range(T, 0, -1): if (T - t) % N == 0: # Cache update step x, cache = deepcache_step(x, t, cache, m) else: # Retrieve step x = deepcache_retrieve_step(x, t, cache, m) |
Experimental Results:
DeepCache demonstrated significant speedups: 2.3x for Stable Diffusion v1.5 (50 PLMS steps) with minimal quality drop (0.05 CLIP Score), and up to 7.0x-10.5x for LDM-4-G (250 DDIM steps) with moderate quality drop (e.g., FID from 3.37 to 4.41 at 7.0x speedup using uniform N=10, or to 4.27 using non-uniform N=10).
It outperformed retraining-based compression methods like Diff-Pruning and BK-SDM variants in terms of quality at comparable or higher throughputs.
DeepCache is compatible with existing fast samplers like DDIM and PLMS. When compared to reducing sampler steps (e.g., using 25 PLMS steps vs. 50), DeepCache often achieved comparable or slightly better quality at similar throughputs.
Ablation studies confirmed the importance of reusing cached features and the positive impact of the partial inference steps compared to simply skipping steps.
The non-uniform strategy significantly improved results over the uniform one for larger caching intervals (e.g., N=10, N=20).
Practical Implementation Considerations:
Training-Free: Easy to integrate into existing inference pipelines for pre-trained U-Net based diffusion models without any model retraining.
Hyperparameters:
m
(Skip Branch Index): Controls the trade-off between speedup and quality. Caching at shallower branches (smallerm
) gives more speedup but potentially lower quality, as more of the network is skipped. Figure 3 shows MACs per branch, helping guide this choice.N
(Caching Interval): Controls the frequency of cache updates. LargerN
means more speedup but potentially lower quality. OptimalN
seems to be model/dataset dependent, often effective up to N=5 or N=10.c
,p
(Non-uniform Strategy): Requires tuning for optimal performance if using the non-uniform strategy, especially for largeN
. Appendix B provides guidance.
- Computational Cost: Reduces MACs significantly by skipping deeper layers during retrieve steps. The actual speedup depends on the chosen branch
m
and the U-Net architecture's computational distribution (Figure 3). - Memory: Requires storing the cached feature tensor (), which adds a memory overhead compared to standard inference.
- Limitations: Effectiveness depends on the U-Net structure; if shallow skip branches still contain a large portion of the computation, speedup is limited. Very large
N
values (e.g., N=20) can lead to noticeable quality degradation.
In summary, DeepCache offers a practical, training-free method to accelerate diffusion model inference by exploiting temporal redundancy in U-Net features. It provides a tunable trade-off between speed and quality, is compatible with existing samplers, and often outperforms retraining-based compression methods at similar throughputs.