GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Published 6 Apr 2026 in cs.DC | (2604.04335v2)

Abstract: Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper presents a novel system that enables deadline-aware preemption and elastic resource allocation for text-to-image and text-to-video workloads.
It introduces a dynamic programming scheduler that optimizes step-level diffusion iterations, resulting in high SLO compliance even under mixed load conditions.
Experimental results demonstrate a 44% increase in SLO attainment and significant latency reductions through intelligent preemption and dynamic SP reconfiguration.

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Introduction and Motivation

GENSERVE addresses the fundamental challenge of serving heterogeneous diffusion workloads—specifically, text-to-image (T2I) and text-to-video (T2V) generation—on shared GPU clusters while optimizing service-level objective (SLO) attainment. The stark disparities in compute demands, parallelism profiles, and SLO stringency between T2I and T2V tasks create contention and deadline violations under conventional serving policies such as FIFO, SJF, or static resource partitioning. The key insight exploited by this work is the highly predictable and stepwise-iterative nature of modern diffusion models, particularly DiT architectures, which allow for fine-grained, low-cost, step-level preemption and online adaptation, facilitating proactive coordination of resource allocation.

Figure 1: Serving 4 videos (V1--V4) and 3 images (I1--I3) on 4 GPUs. (a) FIFO leads to head-of-line blocking and deadline misses for images; (b) GENSERVE leverages preemption and dynamic SP/batching to satisfy all deadlines.

System Design

GENSERVE integrates three synergistic resource management and scheduling primitives:

Intelligent Video Preemption: Video requests, typically long-running, are paused at denoising step boundaries when SLO-critical image requests arrive and no GPU is immediately available. The system computes deadline slack for each video and selectively preempts the highest-slack candidates, guaranteeing that preemption will not result in subsequent deadline violations. Upon resource availability, paused videos are resumed, with possible SP upgrades to accelerate their completion if slack becomes tight or the image queue is depleted.
Figure 3: GENSERVE’s scheduler leverages step-level predictability to preempt videos, prioritizing SLO compliance for urgent image requests by pausing and resuming video workloads adaptively.
Elastic Resource Allocation: T2I and T2V workloads possess divergent parallelism and batching optima. GENSERVE dynamically assigns SP degrees to video jobs based on resolution and load, and employs SLO-aware batching for image jobs, balancing GPU utilization with per-request latency constraints. Joint adaptation ensures that GPU resources are continually rebalanced as the workload mix and SLO pressure evolve.
Figure 5: The SLO-aware scheduler adaptively batches image jobs, varies video SP degrees, and decouples pipeline stages for maximal utilization and SLO attainment.
SLO-Aware Online Scheduling via Dynamic Programming: At each scheduling round (triggered at step boundaries or job events), GENSERVE constructs and scores a compact set of scheduling candidates (batch sizes for images, concrete placements and SP configurations for videos), and solves a knapsack-style DP to globally maximize the number of SLO-satisfying requests. The scheduler’s decisiveness is facilitated by the determinism of per-step diffusion inference costs.
Figure 7: GENSERVE system overview: Scheduler, Profiler, Solver, Allocator, and Worker components form a closed optimization and execution loop.

Experimental Results

End-to-End SLO Attainment

GENSERVE yields up to 44% higher SLO attainment rates (SAR) compared to competitive baselines (SRTF, FCFS, SJF, RASP) across a variety of hardware, workload mix, and system load regimes. Under balanced mixes (50:50 T2I:T2V), GENSERVE maintains high overall SAR, approaching 90% under relaxed SLOs and retaining superior image and video deadline satisfaction even as load intensifies or tasks become more video-heavy.

Figure 9: Number of SLO-satisfying requests versus SLO stringency; GENSERVE consistently leads, with maximal benefit at moderate/large SLOs.

Figure 11: Overall SAR across workload mixes; GENSERVE maintains high SLO success as video proportion increases, outperforming all baselines.

Figure 13: SAR versus arrival rate. GENSERVE’s advantage persists across a spectrum of system loads, particularly notable at 18 to 36 req/min.

Latency and Queueing

GENSERVE eliminates high-percentile queuing for image jobs caused by head-of-line blocking under FIFO-style policies, reducing median and tail image latencies by factors of 3x or more in many settings. Among videos, the system reduces median completion latency via dynamic SP, with larger tails as a deliberate, SLO-aware trade-off.

Figure 6: CDF of request turnaround latency; GENSERVE sharply reduces image tails and accelerates video median latency.

Ablation Study

GENSERVE’s layered architecture was ablated stepwise, revealing:

Preemption alone is essential for image SLO, but uncoordinated preemption diminishes video SLO dramatically.
The DP scheduler is indispensable for balanced global SLO maximization—it contributes SAR improvements on par with preemption itself.
Dynamic SP switching provides further incremental SLO and latency benefits, endowing the system with the ability to accelerate lagging requests instead of relying solely on preemption.
Figure 2: Ablation study: SLO attainment rises progressively as preemption, DP scheduling, and dynamic SP are introduced.

Architecture Sensitivity

Replicated co-serving avoids fragmentation endemic to static partitioning (dedicated image/video GPU pools), dynamically utilizing all available resources as the workload mix shifts.
Figure 4: Replicated co-serving consistently achieves higher SAR than static partitioning across all workload mixes.
Execution cost, preemption, and SP reconfiguration overheads are negligible (<0.25% of step time), confirming the practicality of millisecond-level rescheduling and context switching.

Theoretical and Practical Implications

GENSERVE fundamentally advances the state-of-the-art in large-scale generative model serving by leveraging the architectural properties of diffusion models: predictable per-step cost and state-saving at boundaries. By unifying step-aware preemption, elastic parallelism, and SLO-driven scheduling with precise workload-aware cost models, GENSERVE navigates the joint resource allocation space much more efficiently than existing static or greedy policies. The approach is robust to stochastic workload mixes, request size distributions, and bursty arrivals, and is not bottlenecked by solver overhead even at nontrivial cluster or request scale.

Pragmatically, this enables serving stacks to support latency-critical multi-modal (image/video) generation at high utilization, without over-provisioning or deleterious SLO trade-offs. The methods are compatible and potentially synergistic with orthogonal advances in patch-level caching, approximate computation, and multi-model routing. Future work may extend these insights to cross-modal systems and other iterative generative models.

Conclusion

GENSERVE demonstrates that step-level preemptibility and precise cost modeling in diffusion models unlock new opportunities for SLO-driven co-serving of heterogeneous generative workloads. The coordinated use of deadline-aware preemption, elastic SP allocation, and dynamic batching, optimized through a lightweight DP scheduler, produces substantial improvements in SLO attainment and latency, with minimal runtime overhead or architectural complexity. These findings have immediate applicability for large-scale deployment of text-to-image and text-to-video services and broader implications for efficient management of future multi-modal, iterative generative model workloads.

Markdown Report Issue