UnifiedServe: GPU Orchestration for MLLMs
- UnifiedServe is a GPU-internal orchestration framework that decouples multi-stage MLLM inference by logically scheduling vision encoding and LLM inference.
- It employs asynchronous, fine-grained scheduling and shared GPU resources to deliver up to 4.4× improvements in throughput and significant latency reductions.
- Its integration with FlashCodec via asynchronous IPC and concurrent kernel launches enables a non-blocking and scalable serving architecture.
UnifiedServe is a GPU-internal orchestration framework designed for high-efficiency, low-latency serving of Multi-Modal LLMs (MLLMs), with a core focus on the vision encoding and LLM inference stages in multi-stage pipelines. It achieves this by logically decoupling task scheduling across stages while physically multiplexing shared GPU resources, enabling order-of-magnitude improvements in throughput and latency compared to conventional monolithic or split serving approaches (Zhao et al., 19 Dec 2025).
1. System Context and Motivation
UnifiedServe operates within the canonical three-stage MLLM inference pipeline: (1) multimodal preprocessing (e.g., video/image decompression), (2) vision encoding (e.g., ViT-based embedding of visual inputs), and (3) LLM inference (prefill and autoregressive decode). In typical deployments, these stages either run monolithically on the same GPU—leading to severe intra-stage stalls due to resource contention, or split across separate GPUs—fragmenting memory and compute, thereby increasing time-to-first-token (TTFT), under-utilizing hardware, and decreasing peak throughput. UnifiedServe aims to eliminate inter-stage blocking without GPU partitioning, guaranteeing tight service-level objectives (SLOs) for both time-between-tokens (TBT) and TTFT, while simultaneously maximizing overall GPU utilization (Zhao et al., 19 Dec 2025).
2. Architectural Design
UnifiedServe implements a three-worker, asynchronous, logically decoupled scheduling abstraction:
- vision_process worker: Executes multimodal preprocessing (with joint integration of FlashCodec for NVDEC-powered video decode), outputting patch-token embeddings into a local IPC buffer.
- encode-prefill worker: Alternates between vision encoding (chunked over patch pages) and LLM prefill on the encoded visual inputs, with both sub-stages running in hybrid or tensor parallel to modulate per-iteration GPU demand.
- decode worker: Launches autoregressive sequencing for token generation, interfacing with the shared IPC KV-cache, and batch-processing output requests to clients.
All three operate on the same physical GPU cluster using NVIDIA Multi-Process Service (MPS) to submit kernels concurrently, ensuring hardware parallelism at the Streaming Multiprocessor (SM), memory, and cache levels. Data exchanges leverage multi-page shared memory buffers, scatter/gather via NCCL, and CUDA IPC handle transfer, enabling simultaneous producer-consumer coupling across the stages (Zhao et al., 19 Dec 2025).
| Component | Role | Resource Scope |
|---|---|---|
| vision_process | Video decode, patch emit | All GPUs (NVDEC, SMs) |
| encode-prefill | Vision encode + LLM prefill | SM, memory (shared) |
| decode | Autoregressive decode | Prioritized SM, memory |
3. Scheduling Algorithms and Resource Allocation
UnifiedServe employs distinct, fine-grained GPU schedulers:
A. Stall-Free Video Decoding (via FlashCodec):
- Inputs are partitioned into video GOPs across NVDEC engines on all GPUs.
- Threadpool-based GOP dispatch eliminates decoder straggler effects; frame buffers are dynamically allocated, sharply bounding decode-stage memory growth.
- Achieves up to 9× speedup for long videos on 4×A100 systems (Zhao et al., 19 Dec 2025).
B. Encode-Prefill Orchestration:
- Let denote the prefill token budget, the encode token budget—values are selected to satisfy the decode-stage TBT SLO.
- The algorithm maintains in-flight sets for encode and prefill, greedily accumulating chunks to fill batch budgets, and pipelining multi-microbatch execution for encode and single-batch for prefill.
- Mathematical resource allocation per epoch is governed by:
where is per-token GPU consumption for each stage , is tokens served per second, and is the fractional resource allocation. By dynamic tuning of , latency-critical decode is prioritized, while encode/prefill opportunistically borrow residual compute and memory headroom (Zhao et al., 19 Dec 2025).
Decode prioritization ensures that TBT SLOs are maintained under bursty, heterogeneous traffic, while aggressively exploiting otherwise idle SM cycles for vision encode and prefill bands.
4. Integration with FlashCodec and End-to-End Pipeline
UnifiedServe is designed as the second component of a unified MLLM inference stack, working in concert with FlashCodec, the first-stage collaborative GPU video decode engine:
- FlashCodec apportions incoming video across all GPUs’ NVDEC engines using a fine-grained stall-free scheduler and emits patch-token sequences.
- Patch-token pages are exchanged to vision encoder via a shared IPC buffer organized with collective write queues and page-table indirection.
- UnifiedServe orchestrates encode, prefill, and decode as detailed above, reading from/writing to distributed shared memory spaces, and avoiding bottlenecks by logical decoupling.
- All compute and memory resources on each GPU are accessible to any stage as needed, subject to scheduler-imposed constraints on SLO and batch formation (Zhao et al., 19 Dec 2025).
This integration yields a non-blocking pipeline from frame decode through token generation, with each process advancing independently yet maintaining full physical multiplexing across GPUs.
5. Quantitative Performance and Comparative Analysis
Experimental deployments were conducted on Qwen2.5-VL-32B and InternVL3-38B MLLMs across the MLVU (long video), EgoSchema (short video), and VisionArena (image) datasets. The testbed comprised multi-A100 GPU clusters.
Key findings for UnifiedServe paired with FlashCodec (Zhao et al., 19 Dec 2025):
| Metric | UnifiedServe + FlashCodec | Monolithic | Split |
|---|---|---|---|
| Requests per unit time (Sustained) | up to | ||
| Tightest SLO met | stricter | Baseline | Baseline |
| Peak throughput (tokens/sec) | monolithic | ||
| Average TTFT (Qwen2.5-VL-32B, MLVU) | $1.2$ s | $1.1$ s | $6.0$ s |
| P99 TBT | $70$ ms | $450$ ms | $25$ ms |
Throughput and latency improvements are attributed to (1) minimization of inter-stage blocking and (2) elimination of device idleness via fine-grained intra-GPU resource sharing and scheduling. This suggests UnifiedServe's architectural strategy is near Pareto-optimal for both tail-latency and throughput across heterogeneous multimodal queries.
6. Implementation Details
UnifiedServe is implemented on top of Sarathi-Serve, with the following salient features:
- Shared Memory IPC: Patch tokens and KV-Cache buffers are realized using CUDA IPC handles and virtual paging (arrays of pv_indptr, pv_page_indices, etc.) to enable efficient, lazy materialization and transfer.
- Collective Communication: NCCL-based scatter/gather enables efficient movement of patches and embedding states across distributed processes.
- Kernel Concurrency: MPS permits overlapping kernel launches from vision_process, encode-prefill, and decode, preventing context-switch latency spikes.
- Full async video decode: Via FlashCodec’s threadpool/NVDEC engine multiplexing, the preprocessing step does not stall subsequent pipeline stages (Zhao et al., 19 Dec 2025).
UnifiedServe and FlashCodec together constitute a high-throughput, scalably nonblocking architecture for multi-GPU, multi-stage MLLM inference, directly addressing prior throughput/latency bottlenecks endemic to both monolithic and split designs.
7. Implications and Future Directions
UnifiedServe's design underscores a shift towards processor-centric, rather than stage-centric, MLLM inference scheduling. Physically shared, logically decoupled resource orchestration enables stricter SLO attainment and higher utilization, providing a blueprint for future MLLM and LLM inference frameworks in cloud and edge data centers.
A plausible implication is that similar logical/physical decoupling with aggressive resource sharing could be further generalized beyond vision–LLMs; for instance, multi-modal pipelines involving audio, graph-based inputs, or other non-text signals could benefit from this architecture. Additionally, continued innovations around IPC buffer management and token-level scheduling may yield further improvements in both resource utilization and tail latency under mixed-modality, high-variance traffic situations.
UnifiedServe represents a demonstrably effective methodology for maximizing both latency and throughput metrics in high-concurrency, multi-GPU MLLM serving environments, with direct applicability for next-generation, low-SLO, high-throughput inferencing stacks (Zhao et al., 19 Dec 2025).