- The paper introduces SynerDiff, a continuous batching system that synergizes VAE chunking and adaptive Skip-CFG to achieve fast diffusion model inference.
- The system employs a threshold-aware scheduler and a 3D dynamic programming algorithm to optimize GPU utilization across heterogeneous pipeline stages.
- Quantitative results demonstrate a 1.6× throughput boost and up to 78.7% latency reduction while maintaining high image fidelity.
SynerDiff: A Synergetic Continuous Batching System for Efficient Diffusion Model Inference
Diffusion models (DMs), particularly for AI-generated content, have become a core technology in high-fidelity image synthesis. Serving these models at production scale necessitates simultaneously achieving high throughput and low end-to-end (E2E) latency. Existing DM serving pipelines, characterized by a sequential execution of text encoding, iterative UNet denoising, and VAE decoding, exhibit asymmetric computational bottlenecks: UNet denoising is compute-bound and dominates latency, while VAE decoding is memory-bandwidth-bound and batch-insensitive.
Conventional approaches (e.g., static or dynamic batching, continuous batching as in InstGenIE) fail to reconcile this heterogeneity. Static batching underutilizes GPU parallelism under workload variations, while continuous batching with naive parallel scheduling triggers severe resource contention during UNet–VAE concurrency—manifested as latency spikes and throughput collapse under heavy loads. The core scheduling challenge is to coordinate heterogeneous DM pipeline components for maximal resource utilization without introducing tail latency phenomena.
SynerDiff System Architecture
SynerDiff proposes an intra–inter level synergistic continuous batching system to address DM serving’s fundamental throughput–latency trade-off. The system architecture consists of three integrated modules:
- Offline Profiler: Microbenchmarks determine the optimal VAE chunking granularity and UNet batch size, constructing lookup tables for concurrent latency and throughput under varying configurations.
- Global Scheduler: Acts as the control plane, executing a threshold-aware sequence scheduler guided by workload statistics and offline profiling data. The global scheduler dynamically tunes the throughput threshold based on system load and coordinates task-level execution granularity.
- Component Executor: As the data plane, orchestrates fine-grained intra-batch scheduling—synchronously managing VAE chunking and adaptive UNet Skip Classifier-Free Guidance (Skip-CFG) in response to scheduler directives.
This design paradigm enables SynerDiff to remain model-agnostic and applicable across the family of UNet-based DMs.
Intra-Concurrency Optimization
VAE Chunking
SynerDiff divides the VAE decoder into temporally equivalent sub-blocks at the granularity of ResNet blocks to temporally disperse peak bandwidth requirements. By scheduling sub-blocks interleaved with UNet steps, bandwidth contention is ameliorated, suppressing latency spikes without significant computational overhead. This chunk-level parallelism trades minimal extra VAE latency for a substantial reduction in overall E2E delay.
Adaptive Skip-CFG
UNet denoising in DMs typically applies classifier-free guidance (CFG) to improve sample realism, at the cost of computational doubling. SynerDiff exploits the observation that late-stage denoising steps are less sensitive to CFG ablation—formally, skipping unconditional branch evaluations can be performed in later steps with negligible perceptual impact. The system adaptively prunes unnecessary UNet compute by enforcing a configurable skip threshold, directly expanding available GPU compute cycles for co-scheduled VAE decoding while preserving generation fidelity.
Inter-Concurrency Scheduling
SynerDiff employs a threshold-aware sequence scheduler to balance UNet throughput and VAE tail latency. Scheduling decisions leverage key empirical insights:
- VAE decoding is highly sensitive to batching granularity due to its memory-bound profile.
- UNet throughput demonstrates a non-linear plateau under concurrency; aggressive batch partitioning below this plateau does not penalize throughput.
A 3D dynamic programming (DP) algorithm is employed to explore optimal task sequence planning across UNet, VAE, and Skip-CFG-eligible denoising steps. The optimization aims to minimize mean VAE E2E latency, subject to throughput and quality constraints. A feedback controller operates in real-time—detecting queue buildup, adaptively increasing Skip-CFG application and VAE chunking factor to raise the system throughput ceiling as needed.
Experimental Evaluation
Experiments were conducted on an NVIDIA RTX 5090 with SDv1.5 across diverse traffic profiles (Poisson, bursty), benchmarking SynerDiff against static batching (Diffusers), task-level dynamic batching, and InstGenIE.
Key quantitative results include:
- Throughput Scaling: SynerDiff boosts system throughput by 1.6× over InstGenIE, peaking at 3.4 tasks/s in burst settings.
- Tail Latency Suppression: SynerDiff yields an up to 78.7% reduction in both average E2E and P99 tail latency relative to state-of-the-art baselines.
- Component Efficiency: Unlike prior dynamic batching methods, SynerDiff maintains stable UNet and VAE performance concurrently under load, avoiding the exponential queuing delays observable in existing approaches.
- Fidelity: Images maintain high DINO (structure) and CLIP (semantic alignment) scores (<13% drop in worst-case scenarios), validating that aggressive parallelization and compute pruning do not induce perceivable degradation.
Ablation studies confirm that both VAE chunking and adaptive Skip-CFG are critical; omission of either increases mean latency by up to 3× and reduces throughput by 20%. The threshold-aware scheduler and feedback controller contribute an additional 60% latency reduction, underscoring the importance of inter-level synergy.
Implications and Future Prospects
SynerDiff’s design offers a robust, scalable reference for serving DMs under practical, throughput-oriented constraints. By harmonizing resource-bound heterogeneity in DM architectures through intra–inter synergetic optimization, the work pushes the state-of-the-art in resource-efficient AIGC service. The modular system is compatible with multi-model and multi-GPU extensions. Potential future directions include distributed scheduler design for multi-accelerator contexts and integration with single-task acceleration strategies (e.g., kernel fusion, approximate caching).
Theoretically, SynerDiff’s framework demonstrates the necessity—and tractability—of balancing component-specific sensitivity, resource contention, and application-level quality thresholds in deep generative model systems. Practically, it enables DM-based services to support orders-of-magnitude larger arrival rates without sacrificing visual fidelity, advancing the deployment horizon for high-throughput generative AI.
Conclusion
SynerDiff introduces a principled, efficiency-oriented continuous batching system for DM inference, characterized by fine-grained intra-concurrency scheduling, adaptive workload pruning, and globally optimal inter-task sequence planning. The system achieves substantial throughput and latency improvements while preserving output fidelity, representing a significant contribution to the engineering of scalable, deployable generative model serving stacks (2605.08835).