Papers
Topics
Authors
Recent
2000 character limit reached

UnifiedServe: GPU Orchestration for MLLMs

Updated 26 December 2025
  • UnifiedServe is a GPU-internal orchestration framework that decouples multi-stage MLLM inference by logically scheduling vision encoding and LLM inference.
  • It employs asynchronous, fine-grained scheduling and shared GPU resources to deliver up to 4.4× improvements in throughput and significant latency reductions.
  • Its integration with FlashCodec via asynchronous IPC and concurrent kernel launches enables a non-blocking and scalable serving architecture.

UnifiedServe is a GPU-internal orchestration framework designed for high-efficiency, low-latency serving of Multi-Modal LLMs (MLLMs), with a core focus on the vision encoding and LLM inference stages in multi-stage pipelines. It achieves this by logically decoupling task scheduling across stages while physically multiplexing shared GPU resources, enabling order-of-magnitude improvements in throughput and latency compared to conventional monolithic or split serving approaches (Zhao et al., 19 Dec 2025).

1. System Context and Motivation

UnifiedServe operates within the canonical three-stage MLLM inference pipeline: (1) multimodal preprocessing (e.g., video/image decompression), (2) vision encoding (e.g., ViT-based embedding of visual inputs), and (3) LLM inference (prefill and autoregressive decode). In typical deployments, these stages either run monolithically on the same GPU—leading to severe intra-stage stalls due to resource contention, or split across separate GPUs—fragmenting memory and compute, thereby increasing time-to-first-token (TTFT), under-utilizing hardware, and decreasing peak throughput. UnifiedServe aims to eliminate inter-stage blocking without GPU partitioning, guaranteeing tight service-level objectives (SLOs) for both time-between-tokens (TBT) and TTFT, while simultaneously maximizing overall GPU utilization (Zhao et al., 19 Dec 2025).

2. Architectural Design

UnifiedServe implements a three-worker, asynchronous, logically decoupled scheduling abstraction:

  • vision_process worker: Executes multimodal preprocessing (with joint integration of FlashCodec for NVDEC-powered video decode), outputting patch-token embeddings into a local IPC buffer.
  • encode-prefill worker: Alternates between vision encoding (chunked over patch pages) and LLM prefill on the encoded visual inputs, with both sub-stages running in hybrid or tensor parallel to modulate per-iteration GPU demand.
  • decode worker: Launches autoregressive sequencing for token generation, interfacing with the shared IPC KV-cache, and batch-processing output requests to clients.

All three operate on the same physical GPU cluster using NVIDIA Multi-Process Service (MPS) to submit kernels concurrently, ensuring hardware parallelism at the Streaming Multiprocessor (SM), memory, and cache levels. Data exchanges leverage multi-page shared memory buffers, scatter/gather via NCCL, and CUDA IPC handle transfer, enabling simultaneous producer-consumer coupling across the stages (Zhao et al., 19 Dec 2025).

Component Role Resource Scope
vision_process Video decode, patch emit All GPUs (NVDEC, SMs)
encode-prefill Vision encode + LLM prefill SM, memory (shared)
decode Autoregressive decode Prioritized SM, memory

3. Scheduling Algorithms and Resource Allocation

UnifiedServe employs distinct, fine-grained GPU schedulers:

A. Stall-Free Video Decoding (via FlashCodec):

  • Inputs are partitioned into video GOPs across NVDEC engines on all GPUs.
  • Threadpool-based GOP dispatch eliminates decoder straggler effects; frame buffers are dynamically allocated, sharply bounding decode-stage memory growth.
  • Achieves up to 9× speedup for long videos on 4×A100 systems (Zhao et al., 19 Dec 2025).

B. Encode-Prefill Orchestration:

  • Let τ\tau denote the prefill token budget, α\alpha the encode token budget—values are selected to satisfy the decode-stage TBT SLO.
  • The algorithm maintains in-flight sets for encode and prefill, greedily accumulating chunks to fill batch budgets, and pipelining multi-microbatch execution for encode and single-batch for prefill.
  • Mathematical resource allocation per epoch is governed by:

maxiRis.t.iαiGiGtotal ,Latencyi(αi)SLOi i\max \sum_{i} R_i \qquad \text{s.t.}\quad \sum_{i}\alpha_i G_i \leq G_\text{total}\ ,\quad \text{Latency}_i(\alpha_i) \leq \text{SLO}_i\ \forall i

where GiG_i is per-token GPU consumption for each stage i{encode, prefill, decode}i\in\{\text{encode, prefill, decode}\}, RiR_i is tokens served per second, and αi\alpha_i is the fractional resource allocation. By dynamic tuning of α\alpha, latency-critical decode is prioritized, while encode/prefill opportunistically borrow residual compute and memory headroom (Zhao et al., 19 Dec 2025).

Decode prioritization ensures that TBT SLOs are maintained under bursty, heterogeneous traffic, while aggressively exploiting otherwise idle SM cycles for vision encode and prefill bands.

4. Integration with FlashCodec and End-to-End Pipeline

UnifiedServe is designed as the second component of a unified MLLM inference stack, working in concert with FlashCodec, the first-stage collaborative GPU video decode engine:

  • FlashCodec apportions incoming video across all GPUs’ NVDEC engines using a fine-grained stall-free scheduler and emits patch-token sequences.
  • Patch-token pages are exchanged to vision encoder via a shared IPC buffer organized with collective write queues and page-table indirection.
  • UnifiedServe orchestrates encode, prefill, and decode as detailed above, reading from/writing to distributed shared memory spaces, and avoiding bottlenecks by logical decoupling.
  • All compute and memory resources on each GPU are accessible to any stage as needed, subject to scheduler-imposed constraints on SLO and batch formation (Zhao et al., 19 Dec 2025).

This integration yields a non-blocking pipeline from frame decode through token generation, with each process advancing independently yet maintaining full physical multiplexing across GPUs.

5. Quantitative Performance and Comparative Analysis

Experimental deployments were conducted on Qwen2.5-VL-32B and InternVL3-38B MLLMs across the MLVU (long video), EgoSchema (short video), and VisionArena (image) datasets. The testbed comprised multi-A100 GPU clusters.

Key findings for UnifiedServe paired with FlashCodec (Zhao et al., 19 Dec 2025):

Metric UnifiedServe + FlashCodec Monolithic Split
Requests per unit time (Sustained) up to 3.0×3.0\times 1×1\times 1×1\times
Tightest SLO met 1.5×1.5\times stricter Baseline Baseline
Peak throughput (tokens/sec) 4.4×4.4\times monolithic 1×1\times 1.9×1.9\times
Average TTFT (Qwen2.5-VL-32B, MLVU) $1.2$ s $1.1$ s $6.0$ s
P99 TBT $70$ ms $450$ ms $25$ ms

Throughput and latency improvements are attributed to (1) minimization of inter-stage blocking and (2) elimination of device idleness via fine-grained intra-GPU resource sharing and scheduling. This suggests UnifiedServe's architectural strategy is near Pareto-optimal for both tail-latency and throughput across heterogeneous multimodal queries.

6. Implementation Details

UnifiedServe is implemented on top of Sarathi-Serve, with the following salient features:

  • Shared Memory IPC: Patch tokens and KV-Cache buffers are realized using CUDA IPC handles and virtual paging (arrays of pv_indptr, pv_page_indices, etc.) to enable efficient, lazy materialization and transfer.
  • Collective Communication: NCCL-based scatter/gather enables efficient movement of patches and embedding states across distributed processes.
  • Kernel Concurrency: MPS permits overlapping kernel launches from vision_process, encode-prefill, and decode, preventing context-switch latency spikes.
  • Full async video decode: Via FlashCodec’s threadpool/NVDEC engine multiplexing, the preprocessing step does not stall subsequent pipeline stages (Zhao et al., 19 Dec 2025).

UnifiedServe and FlashCodec together constitute a high-throughput, scalably nonblocking architecture for multi-GPU, multi-stage MLLM inference, directly addressing prior throughput/latency bottlenecks endemic to both monolithic and split designs.

7. Implications and Future Directions

UnifiedServe's design underscores a shift towards processor-centric, rather than stage-centric, MLLM inference scheduling. Physically shared, logically decoupled resource orchestration enables stricter SLO attainment and higher utilization, providing a blueprint for future MLLM and LLM inference frameworks in cloud and edge data centers.

A plausible implication is that similar logical/physical decoupling with aggressive resource sharing could be further generalized beyond vision–LLMs; for instance, multi-modal pipelines involving audio, graph-based inputs, or other non-text signals could benefit from this architecture. Additionally, continued innovations around IPC buffer management and token-level scheduling may yield further improvements in both resource utilization and tail latency under mixed-modality, high-variance traffic situations.

UnifiedServe represents a demonstrably effective methodology for maximizing both latency and throughput metrics in high-concurrency, multi-GPU MLLM serving environments, with direct applicability for next-generation, low-SLO, high-throughput inferencing stacks (Zhao et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to UnifiedServe.