This paper, "Taming the Titans: A Survey of Efficient LLM Inference Serving" (Zhen et al., 28 Apr 2025 ), provides a comprehensive overview of methods to optimize LLM inference, addressing challenges posed by their large parameter counts and the computational demands of the attention mechanism. The goal is to achieve low latency and high throughput for LLM services. The survey categorizes optimization techniques into instance-level, cluster-level, emerging scenarios, and miscellaneous areas.
Background
The paper first reviews LLM fundamentals.
- Transformer-based LLM: LLMs are primarily built on the Transformer decoder, featuring Multi-Head Self-Attention (MHA) with complexity (where is sequence length) and Feedforward Networks (FFN).
- Inference Process: This involves two phases:
- Prefill: Processing the entire input prompt in a compute-bound pass to generate the first token and cache Key/Value (KV) pairs.
- Decoding: Generating subsequent tokens sequentially. Using the KV cache reduces complexity to per token but increases memory overhead.
Evaluation Metrics: Key metrics include Time To First Token (TTFT), Time Between Tokens (TBT), Throughput (tokens/second), Capacity, and percentile latencies (P50, P90, P99).
Instance-Level Optimization
These techniques focus on optimizing inference on a single instance or a tightly coupled set of devices.
- Model Placement:
- Model Parallelism: Distributes model parameters across devices.
- Pipeline Parallelism (e.g., GPipe, PipeDream): Different layers on different GPUs.
- Tensor Parallelism (e.g., Megatron-LM): Splits individual operations (like matrix multiplications) across GPUs.
- Supplementary Parallelism: Sequential parallelism (partitions activations for long sequences), Context parallelism (splits all layers along sequence dimension), and Expert parallelism (for MoE models).
- Offloading: Moves parts of the model or computation to CPU/storage when GPU memory is insufficient.
- Examples: ZeRO-Offload, DeepSpeed-Inference, FlexGen.
- PowerInfer uses a GPU-CPU hybrid engine, placing "hot" (frequently used) neurons on GPU and "cold" neurons on CPU.
- TwinPilots integrates GPU and CPU engines with hierarchical memory.
- Model Parallelism: Distributes model parameters across devices.
- Request Scheduling:
- Inter-Request Scheduling: Prioritizes request batches.
- Commonly FCFS (First-Come-First-Served), but can lead to head-of-line blocking.
- SJF (Shortest Job First) or approximations using predicted decoding lengths (e.g., FastServe's Skip-Join MLFQ, Prophet's SJF for prefill).
- SRTF (Shortest Remaining Time First) with dynamic prediction and preemption ratios.
- INFERMAX uses cost models for strategic preemption.
- Intra-Request Scheduling: Manages scheduling within concurrent request batches.
- Orca's iteration-level scheduling: dynamically adds/removes requests per iteration.
- Dynamic SplitFuse / chunked-prefills: partition prefill into smaller segments, merging with decoding to reduce delays.
- Slice-level scheduling (SCLS): divides generation into fixed-length slices processed sequentially.
- Inter-Request Scheduling: Prioritizes request batches.
- Decoding Length Prediction: Crucial for scheduling.
- Exact Length Prediction: Predicts exact token counts using models like BERT embeddings with random forests or smaller LMs (e.g., OPT).
- Range-Based Classification: Classifies requests into length bins (e.g., short/medium/long) using classifiers like DistilBERT or FFNs on BERT CLS tokens.
- Relative Ranking Prediction: Predicts relative order of requests within a batch, potentially more robust but may require recalculation if batches change.
- KV Cache Optimization: Manages the memory-intensive KV cache.
- Memory Management:
- Lossless Storage: PagedAttention (vLLM) uses OS-inspired paging to reduce fragmentation. DistAttention for distributed KV cache. FastDecode offloads cache to CPU. LayerKV uses hierarchical allocation. KunServe frees space by removing model parameters and fetching from other instances.
- Approximation Methods: PQCache uses Product Quantization. InfiniGen uses dynamic cache management with intelligent prefetching.
- Reuse Strategies:
- Lossless Reuse: PagedAttention allows page-level sharing. Radix tree-based systems for global prefix sharing. CachedAttention for cross-turn dialogue cache reuse.
- Semantic-aware Reuse: GPTCache uses semantic similarity. SCALM clusters queries for semantic patterns. (Lossless is good for fixed patterns, semantic-aware for diverse inputs).
- Compression Techniques:
- Quantization-based: FlexGen uses 4-bit group-wise quantization. Kivi uses per-channel/token quantization. MiniCache exploits inter-layer similarity. AWQ quantizes less salient weights. Atom uses mixed-precision. QServe uses W4A8KV4 precision.
- Compact Encoding Architectures: CacheGen uses a custom tensor encoder to compress KV cache into bitstreams.
- Memory Management:
- PD Disaggregation: Separates compute-bound prefill and memory-bound decoding into distinct environments.
- Systems like DistServe, Splitwise, DéjàVu, Mooncake, TetriInfer, and P/D-Serve implement this by allocating resources (e.g., different GPU types or CPU/GPU) optimally for each phase, managing KV cache transfer between them. Mooncake uses a KVCache-centric architecture, distributing cache across idle CPU/DRAM/SSD.
Cluster-Level Optimization
These strategies focus on deploying and managing LLMs across a cluster of machines.
- Cluster Optimization:
- Architecture and Optimization for Heterogeneous Resources: Using a mix of GPU types for cost-effectiveness.
- Sia: joint optimization for task allocation across GPU types.
- Helix: models serving on heterogeneous GPUs/networks as a max-flow problem.
- LLM-PQ: adaptive quantization and phase-aware partitioning for heterogeneous clusters.
- HexGen, Splitwise, DistServe, HEXGEN-2: optimize for heterogeneous disaggregated architectures.
- Service-Aware Scheduling:
- DynamoLLM: adjusts instances, parallelism, and GPU frequencies based on input/output lengths.
- Splitwise: cluster-level scheduling for separate prefill/decode devices.
- Architecture and Optimization for Heterogeneous Resources: Using a mix of GPU types for cost-effectiveness.
- Load Balancing: Distributes requests to prevent node overload/underutilization.
- Traditional methods: Round Robin, Random.
- Heuristic Algorithms: SCLS uses a max-min algorithm based on estimated serving time. SAL considers queued prefill tokens and available memory.
- Dynamic Scheduling: Llumnix reschedules requests across instances at runtime using real-time migration.
- Intelligent Predictive Scheduling: Reinforcement learning-based routers model routing as a Markov Decision Process.
- Cloud-Based LLM Serving: Leverages cloud resources when local infrastructure is insufficient.
- Deployment and Computing Effective:
- SpotServe: uses preemptible spot instances with dynamic reparallelization and stateful recovery.
- ServerlessLLM: addresses cold start latency in serverless environments.
- Mélange: optimizes GPU allocation based on request patterns.
- POLCA: power management for efficiency.
- Cooperation with Edge Devices: Addresses cloud latency by using edge computing.
- EdgeShard: collaboration between distributed edge devices and cloud.
- PreLLM: multi-armed bandit for personalized scheduling.
- Hybrid SLM/LLM approaches: small models on edge, larger on cloud.
- Deployment and Computing Effective:
Emerging Scenarios
Optimizations for specific advanced tasks, model architectures, or techniques.
- Long Context: Handles significantly longer input sequences.
- Parallel Processing: LoongServe uses elastic sequence parallelism.
- Attention Computation: RingAttention distributes long sequences with blockwise attention. StripedAttention extends it for causal attention imbalance. DistAttention subdivides attention across GPUs. InstInfer offloads attention to Computational Storage Drives.
- KV Cache Management: Infinite-LLM manages dynamic contexts at cluster level. InfiniGen optimizes cache in CPU memory. Marconi uses tailored admission/eviction policies for hybrid models.
- Retrieval-Augmented Generation (RAG): LLMs retrieve external knowledge.
- Workflow Scheduling: PipeRAG uses pipeline parallelism and flexible retrieval. Teola models RAG as data flow nodes. RaLMSpec uses speculative retrieval. RAGServe adapts RAG configurations for quality/latency balance.
- Storage Optimization: RAGCache uses knowledge trees and speculative pipelining. SparseRAG uses prefilling and selective decoding. CacheBlend reuses cache and selectively recomputes KV. EPIC uses position-independent context caching.
- Mixture of Experts (MoE): Sparse models with multiple expert sub-networks.
- Expert Placement: Tutel uses switchable parallelism and dynamic pipelining. DeepSpeed-MoE combines expert parallelism and slicing.
- Expert Load Balancing: Expert Buffering allocates active experts to GPUs, others to CPUs. Brainstorm dynamically assigns GPU units. Lynx adaptively reduces active experts. ExpertChoice selects top-k tokens per expert. DeepSeek-V3 duplicates high-load experts.
- All-to-All Communication: Tutel uses a 2D hierarchical All-to-All. Aurora optimizes token transmission order. Lina prioritizes All-to-All over All-Reduce.
- Low-Rank Adaptation (LoRA): Adapts LLMs with small, trainable adapters.
- CaraServe: GPU-efficient, cold-start-free LoRA serving with model multiplexing and rank-aware scheduling.
- dLoRA: dynamically merges/unmerges adapters and migrates requests/adapters.
- Speculative Decoding: Uses smaller "draft" models to generate candidate tokens, verified in parallel by the larger target LLM.
- Systems like SpecInfer use tree-based speculative inference for faster distributed/single-GPU offloading.
- Augmented LLMs: LLMs integrated with external tools (APIs, Agents).
- APISERVE: dynamically manages GPU resources for external API calls.
- LAMPS: predicts memory usage for augmented LLMs.
- Parrot: optimizes scheduling by identifying request dependencies in Agent scenarios using Semantic Variables.
- Test-Time Reasoning: Inference-time algorithms enhance reasoning but increase token generation.
- Dynasor: uses a "Certaindex" metric to track reasoning progress and adjust resources.
- Other methods learn to allocate computation based on task difficulty and marginal benefit.
Miscellaneous Areas
Niche but critical aspects.
- Hardware:
- Optimizations for specific hardware like Intel GPUs, mobile GPUs (Transformer-Lite), NPUs.
- Techniques like mixed-precision, multi-level caching (HBM, DRAM, SSDs).
- Benchmarking tools (LLM-Pilot) and analytical tools (GenZ).
- LLMs system for mobile devices with chunk-based KV cache compression.
- Privacy: Protecting user conversation content.
- Weight permutation to shuffle KV pairs.
- Quantifying privacy-utility trade-offs.
- MARILL: MPC-minimized secure inference by optimizing LLM architecture during fine-tuning.
- Simulator: For evaluating LLM deployment in virtual environments.
- Vidur: scalable, high-fidelity simulation framework.
- Helix system includes an event-based simulator for heterogeneous GPU clusters.
- Fairness: Ensuring fair resource allocation among clients.
- Defining fairness based on cost functions (input/output tokens).
- Scheduling algorithms like Virtual Token Counter (VTC).
- Energy: Optimizing energy usage.
- Investigating carbon emissions from operational and embodied perspectives.
- Analyzing performance and energy consumption across model scales and batch sizes.
Future Works
- Scheduling with dependency constraints (for multi-LLM/agent systems).
- Large Multimodal Model (LMM) service optimization, addressing text/image input imbalances.
- Intelligent LLM inference service (using smaller LLMs to manage larger ones).
- Enhanced safety and privacy, especially in cloud environments.
Conclusion
The paper concludes that the primary challenges in LLM inference serving are memory and computational load. It offers a hierarchical review of solutions and suggests future research directions to advance this rapidly evolving field.