Papers
Topics
Authors
Recent
Search
2000 character limit reached

MixServe: Distributed LLM Serving System

Updated 20 January 2026
  • MixServe is a distributed serving framework for multi-model, multi-precision LLM deployments that efficiently manages multi-GPU and multi-node clusters.
  • It employs a principled α–β network cost model and fused communication strategies to optimize tensor and expert parallelism under tight SLO constraints.
  • The system integrates offline profiling, precision-aware KV slab allocators, and adaptive scheduling to achieve high throughput and low latency in heterogeneous and bandwidth-constrained environments.

MixServe refers to a family of distributed serving systems and methodologies for multi-model, multi-precision, and multi-tenancy deployments of LLMs and Mixture-of-Experts (MoE) architectures. MixServe systems target efficient utilization of multi-GPU and multi-node clusters in heterogeneous and bandwidth-constrained environments. They combine automatic parallelism configuration, resource-aware scheduling, and, in some variants, fine-grained memory and compute multiplexing to achieve low-latency, high-throughput, and tight Service-Level Objective (SLO) compliance under varying workloads and model mixes.

1. Background and Motivation

Rapid advances in LLMs, particularly Mixture-of-Experts models, have driven the deployment of trillion-parameter inference workloads across distributed clusters. Individual models typically cannot fit on a single device; deployment necessitates multi-GPU or multi-node sharding. The dominant parallelism strategies are Tensor Parallelism (TP), based on All-Reduce (AR) collectives for sharding dense layers, and Expert Parallelism (EP), based on All-to-All (A2A) collectives to route tokens among distributed experts. TP is restricted by inter-node bandwidth limitations, while EP suffers from load imbalance at high parallel degrees (Zhou et al., 13 Jan 2026).

Traditional systems lack principled cost models for optimal selection of parallel schemes in the presence of complex bandwidth hierarchies and dynamic workloads. Furthermore, the rise of quantized models, mixed-precision deployment, and the co-location of model serving and ongoing retraining create additional challenges for memory management, job scheduling, and statistical multiplexing (Bin et al., 8 Sep 2025, Li et al., 28 Jul 2025, Li et al., 2023).

2. Communication Cost Modeling and Hybrid Parallelism

MixServe's core innovation is the principled modeling of communication costs and hybrid parallelism. The analytical model employs the classic "α–β" network cost formulation, where message latency is α and per-byte cost is β. For pp ranks and tensor size ss:

  • All-Reduce (AR) time:

TAR(p,s)=2αlog2p+2β(p1)psT_{AR}(p,s) = 2\alpha\log_2 p + 2\beta\frac{(p-1)}{p}s

  • All-to-All (A2A) time:

TA2A(p,s)=α(p1)+β(p1)psT_{A2A}(p,s) = \alpha(p-1) + \beta\frac{(p-1)}{p}s

Intra-node α and β are much smaller than inter-node quantities, reflecting topology-aware performance (Zhou et al., 13 Jan 2026).

MixServe automatically selects feasible (TP, EP, DP) degrees that satisfy memory constraints and minimize a composite metric (e.g., steady-state inter-token latency or time-to-first-token) based on predicted computation, communication, and queueing delays. The optimal configuration is determined by exhaustive search over possible allocations, using offline profiling of model hyperparameters, device specifications, and networking characteristics.

Hybrid parallelism is instantiated by fusing TP's AR (intra-node) and EP's A2A (inter-node) collectives. This fusion enables overlapped execution, effectively hiding the slower global communications behind faster local reductions:

Tfused=max{TARintra(pTP,sAR),TA2Ainter(pEP,sA2A)}+δT_{\mathrm{fused}} = \max\,\{T_{AR}^{\mathrm{intra}}(p_{TP}, s_{AR}),\, T_{A2A}^{\mathrm{inter}}(p_{EP}, s_{A2A})\} + \delta

with δ\delta as negligible orchestration overhead.

3. System Architecture and Implementation

MixServe systems are composed of an offline analyzer and an online serving runtime:

  • Offline Analyzer: Profiles short prompts, hardware specs, and constructs cost models to recommend optimal TP, EP, and DP degrees for the target deployment. It leverages the α–β modeling and empirical FLOPs/memory measurements.
  • Runtime Serving: Acts as a wrapper over general-purpose LLM inference frameworks (e.g., vLLM, Tutel). It handles the loading of parameter shards, initialization of group-based communicators, and insertion of fused collective calls at each forward pass. Main computation resides in the main process, while communication is advanced by preposted threads using collective libraries such as NCCL or MPI (Zhou et al., 13 Jan 2026).

Buffer management seeks to eliminate allocation overhead at runtime, with pre-allocation of RS and A2A fusion buffers. Asynchronous collectives are handled via background progress threads for maximal overlap.

For deployments requiring mixed-precision or heterogeneous quantization, additional architectural elements such as precision-aware slab allocation and token-block scheduling are introduced (Bin et al., 8 Sep 2025).

4. Scheduling Algorithms and Adaptive Strategy Selection

MixServe integrates multi-level scheduling and resource management, depending on the deployment context:

  • Auto-parallelization and Placement: Inspired by AlpaServe, a two-phase (auto-parallelization + greedy placement) algorithm computes per-model and per-group strategies minimizing communication and staging imbalance, using domain-specific dynamic programming/ILP or heuristic enumeration. A fast trace-driven simulator estimates SLO attainment given model mixes and arrival traces, driving the selection of candidate placements (Li et al., 2023).
  • Fused Communication Overlap: The fused AR–A2A algorithm decomposes MoE layers into combined intra-node and inter-node phases, allowing dispatch/combination steps to overlap and amortize the slowest operation.
  • Two-Level Scheduling (for Mixed Precision/Quantized Models): A global scheduler assigns models to GPUs or TP-groups based on memory and SLO constraints, while local per-model schedulers dynamically adjust batch size and admission using deadline-driven heuristics based on earliest deadline first (EDF) and the Moore–Hodgson algorithm (Bin et al., 8 Sep 2025).

A typical runtime workflow first drops infeasible requests (that cannot meet SLO even if run alone), then maximizes parallel batching under the tightest deadline in the batch, ensuring no TTFT (time to first token) violation. Additional queue-level reprioritization addresses cross-model back-pressure and resource contention.

5. Memory Management and Precision-Aware Allocation

To address severe memory fragmentation under mixed-precision workloads, MixServe incorporates a precision-aware KV slab allocator (Bin et al., 8 Sep 2025). This design pre-allocates a large contiguous KV tensor for each device, divides it into fixed-size slabs keyed by unique block sizes, and supports O(1) allocation and release of per-model blocks without GPU-side remapping. The slab size, SS, is set to a small multiple of the least common multiple of all block sizes assigned to co-located models:

S=Llcm(B1,B2,,Bk),LZ+S = L \cdot \mathrm{lcm}(B_1, B_2, \ldots, B_k),\qquad L\in\mathbb{Z}^+

Fragmentation (both internal and external) is driven to zero for each precision, substantially improving effective memory utilization at high token loads and tight SLOs.

Local resource managers share weight tensors in read-only mode, enable head-wise allocation of KV-cache via global pools, and partition GPU DRAM for efficient sharing and isolation.

6. Experimental Results and Comparative Evaluation

Evaluations demonstrate that MixServe achieves substantial performance benefits:

Cluster/Model TTFT Speedup ITL Speedup Throughput Gain
DeepSeek-R1, Ascend 2.67× (vs TP+PP) 1.42× +22%
Qwen3, H20 up to 1.23× +43.5%
AlpaServe-style up to 10× λ @ 99% 2.3× fewer GPUs
FineServe, Mixed Prec 1.8×

SLO attainment for tight TTFT goals (P95) exceeds 90% for MixServe while rival systems collapse below 50% at high request rates and model heterogeneity (Zhou et al., 13 Jan 2026, Bin et al., 8 Sep 2025, Li et al., 2023).

Ablation experiments show the dominant role of communication overlap and precision-aware placement, with global schedulers contributing up to 3× aggregate throughput improvement and adaptive local schedulers boosting SLO attainment by over 30% versus FCFS.

Systems such as LeMix demonstrate that integration of offline profiling, execution prediction, and dynamic adaptation for joint training and inference on shared clusters yields up to 3.53× throughput and 2.12× higher SLO attainment than strict separation (Li et al., 28 Jul 2025).

7. Design Guidelines, Best Practices, and Limitations

Best practices distilled from MixServe research include:

  • Combine α–β network cost modeling with empirical hardware and model profiling for placement decisions.
  • Use fused intra-node (AR) and inter-node (A2A) communication algorithms to minimize the overall communication bottleneck.
  • Employ precision-aware KV slab allocation to support co-location of diverse quantized models without fragmentation.
  • Tune (TP, EP, DP) allocation to cluster-specific bandwidth hierarchies, skewing parallel degrees optimally based on intra/inter-node ratios.
  • Deploy queue-level and batch-level SLO enforcement in schedulers, ensuring high utilization without SLO violations.

Current limitations include static model mixes, lack of support for online model hot-swapping or elastic slab resizing, and absence of decode-aware SLO enforcement—decode-phase optimization is noted as a direction for future work (Bin et al., 8 Sep 2025, Zhou et al., 13 Jan 2026).

MixServe is broadly applicable to future LLM serving and training infrastructure, offering architectural solutions for hierarchical bandwidth clusters, mixed-precision deployments, and environments requiring high responsiveness under tight resource budgets.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MixServe.