SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference

Published 9 May 2026 in cs.AI | (2605.08835v1)

Abstract: The expansion of Artificial Intelligence-generated content service requires diffusion model serving to simultaneously achieve high throughput and low task end-to-end (E2E) latency. However, existing continuous batching methods suffer from severe resource contention during UNet-VAE concurrency, leading to latency spikes. Furthermore, concurrent multi-task scheduling entails a trade-off between UNet throughput and VAE latency across varying scheduling strategies. To address these, we propose SynerDiff, an efficient continuous batching system built on intra-inter level synergy. At the intra-concurrency level, SynerDiff alleviates resource contention by pruning component-specific resource bottlenecks via VAE Chunking and Adaptive Skip-CFG. At the inter-concurrency level, leveraging components' differential sensitivity to scheduling granularities, a threshold-aware scheduler plans concurrent sequences and tunes intra-concurrency decisions to minimize VAE latency while maintaining UNet within high-throughput threshold. Additionally, a feedback controller dynamically adjusts this threshold based on queue loads to boost system capacity ceiling. Experimental results show that, SynerDiff improves throughput by 1.6$\times$ and decreases both average E2E and P99 tail latencies by up to 78.7\%, compared to benchmarks while guaranteeing high image fidelity.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces SynerDiff, a continuous batching system that synergizes VAE chunking and adaptive Skip-CFG to achieve fast diffusion model inference.
The system employs a threshold-aware scheduler and a 3D dynamic programming algorithm to optimize GPU utilization across heterogeneous pipeline stages.
Quantitative results demonstrate a 1.6× throughput boost and up to 78.7% latency reduction while maintaining high image fidelity.

SynerDiff: A Synergetic Continuous Batching System for Efficient Diffusion Model Inference

Motivation and Problem Formulation

Diffusion models (DMs), particularly for AI-generated content, have become a core technology in high-fidelity image synthesis. Serving these models at production scale necessitates simultaneously achieving high throughput and low end-to-end (E2E) latency. Existing DM serving pipelines, characterized by a sequential execution of text encoding, iterative UNet denoising, and VAE decoding, exhibit asymmetric computational bottlenecks: UNet denoising is compute-bound and dominates latency, while VAE decoding is memory-bandwidth-bound and batch-insensitive.

Conventional approaches (e.g., static or dynamic batching, continuous batching as in InstGenIE) fail to reconcile this heterogeneity. Static batching underutilizes GPU parallelism under workload variations, while continuous batching with naive parallel scheduling triggers severe resource contention during UNet–VAE concurrency—manifested as latency spikes and throughput collapse under heavy loads. The core scheduling challenge is to coordinate heterogeneous DM pipeline components for maximal resource utilization without introducing tail latency phenomena.

SynerDiff System Architecture

SynerDiff proposes an intra–inter level synergistic continuous batching system to address DM serving’s fundamental throughput–latency trade-off. The system architecture consists of three integrated modules:

Offline Profiler: Microbenchmarks determine the optimal VAE chunking granularity and UNet batch size, constructing lookup tables for concurrent latency and throughput under varying configurations.
Global Scheduler: Acts as the control plane, executing a threshold-aware sequence scheduler guided by workload statistics and offline profiling data. The global scheduler dynamically tunes the throughput threshold based on system load and coordinates task-level execution granularity.
Component Executor: As the data plane, orchestrates fine-grained intra-batch scheduling—synchronously managing VAE chunking and adaptive UNet Skip Classifier-Free Guidance (Skip-CFG) in response to scheduler directives.

This design paradigm enables SynerDiff to remain model-agnostic and applicable across the family of UNet-based DMs.

Intra-Concurrency Optimization

VAE Chunking

SynerDiff divides the VAE decoder into temporally equivalent sub-blocks at the granularity of ResNet blocks to temporally disperse peak bandwidth requirements. By scheduling sub-blocks interleaved with UNet steps, bandwidth contention is ameliorated, suppressing latency spikes without significant computational overhead. This chunk-level parallelism trades minimal extra VAE latency for a substantial reduction in overall E2E delay.

Adaptive Skip-CFG

UNet denoising in DMs typically applies classifier-free guidance (CFG) to improve sample realism, at the cost of computational doubling. SynerDiff exploits the observation that late-stage denoising steps are less sensitive to CFG ablation—formally, skipping unconditional branch evaluations can be performed in later steps with negligible perceptual impact. The system adaptively prunes unnecessary UNet compute by enforcing a configurable skip threshold, directly expanding available GPU compute cycles for co-scheduled VAE decoding while preserving generation fidelity.

Inter-Concurrency Scheduling

SynerDiff employs a threshold-aware sequence scheduler to balance UNet throughput and VAE tail latency. Scheduling decisions leverage key empirical insights:

VAE decoding is highly sensitive to batching granularity due to its memory-bound profile.
UNet throughput demonstrates a non-linear plateau under concurrency; aggressive batch partitioning below this plateau does not penalize throughput.

A 3D dynamic programming (DP) algorithm is employed to explore optimal task sequence planning across UNet, VAE, and Skip-CFG-eligible denoising steps. The optimization aims to minimize mean VAE E2E latency, subject to throughput and quality constraints. A feedback controller operates in real-time—detecting queue buildup, adaptively increasing Skip-CFG application and VAE chunking factor to raise the system throughput ceiling as needed.

Experimental Evaluation

Experiments were conducted on an NVIDIA RTX 5090 with SDv1.5 across diverse traffic profiles (Poisson, bursty), benchmarking SynerDiff against static batching (Diffusers), task-level dynamic batching, and InstGenIE.

Key quantitative results include:

Throughput Scaling: SynerDiff boosts system throughput by 1.6× over InstGenIE, peaking at 3.4 tasks/s in burst settings.
Tail Latency Suppression: SynerDiff yields an up to 78.7% reduction in both average E2E and P99 tail latency relative to state-of-the-art baselines.
Component Efficiency: Unlike prior dynamic batching methods, SynerDiff maintains stable UNet and VAE performance concurrently under load, avoiding the exponential queuing delays observable in existing approaches.
Fidelity: Images maintain high DINO (structure) and CLIP (semantic alignment) scores (<13% drop in worst-case scenarios), validating that aggressive parallelization and compute pruning do not induce perceivable degradation.

Ablation studies confirm that both VAE chunking and adaptive Skip-CFG are critical; omission of either increases mean latency by up to 3× and reduces throughput by 20%. The threshold-aware scheduler and feedback controller contribute an additional 60% latency reduction, underscoring the importance of inter-level synergy.

Implications and Future Prospects

SynerDiff’s design offers a robust, scalable reference for serving DMs under practical, throughput-oriented constraints. By harmonizing resource-bound heterogeneity in DM architectures through intra–inter synergetic optimization, the work pushes the state-of-the-art in resource-efficient AIGC service. The modular system is compatible with multi-model and multi-GPU extensions. Potential future directions include distributed scheduler design for multi-accelerator contexts and integration with single-task acceleration strategies (e.g., kernel fusion, approximate caching).

Theoretically, SynerDiff’s framework demonstrates the necessity—and tractability—of balancing component-specific sensitivity, resource contention, and application-level quality thresholds in deep generative model systems. Practically, it enables DM-based services to support orders-of-magnitude larger arrival rates without sacrificing visual fidelity, advancing the deployment horizon for high-throughput generative AI.

Conclusion

SynerDiff introduces a principled, efficiency-oriented continuous batching system for DM inference, characterized by fine-grained intra-concurrency scheduling, adaptive workload pruning, and globally optimal inter-task sequence planning. The system achieves substantial throughput and latency improvements while preserving output fidelity, representing a significant contribution to the engineering of scalable, deployable generative model serving stacks (2605.08835).

Markdown Report Issue