Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency (2505.01658v2)

Published 3 May 2025 in cs.CL

Abstract: LLMs are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine

LLMs are increasingly integrated into applications like chatbots and code generators. However, complex tasks such as chain-of-thought reasoning and autonomous agent services significantly increase inference costs due to repeated model invocations. While optimization methods like parallelism, compression, and caching exist, the diversity of service requirements makes selecting the right approach challenging. Specialized LLM inference engines have emerged as crucial infrastructure components that consolidate these optimizations for practical deployment. This paper provides a comprehensive survey and evaluation of 25 open-source and commercial inference engines, focusing on their optimization techniques, hardware support, usability, and scalability.

The paper begins by outlining the background of LLM inference, primarily focusing on the decoder-only transformer architecture. It details the core components like Multi-Head Attention (MHA), Feed-Forward Networks (FFN), and attention variants such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), which impact memory usage during inference. The inference process is broken down into two phases: prefill (processing the input prompt) and decode (autoregressively generating subsequent tokens). Key performance metrics relevant to inference, such as Time-to-First-Token (TTFT), Time-Between-Tokens (TBT), End-to-End Latency, and Throughput, are defined, highlighting how the prefill phase primarily affects TTFT and the decode phase impacts TBT. The paper emphasizes that efficient inference requires optimizing both phases, often through techniques like KV caching, batching, kernel fusion, and quantization. The overall LLM serving workflow, including model selection, prompt engineering, evaluation/fine-tuning, and deployment considerations (cloud vs. on-premise), is presented as a pipeline where the inference engine plays a central role.

A significant part of the survey is dedicated to providing practical guidance for selecting an inference engine. The paper analyzes engines based on non-technical signals such as ecosystem maturity (organization type, licensing, user popularity via GitHub stars and growth rate) and sustainability (commit activity, documentation quality, user forum presence). It presents a comparative analysis of open-source engines based on these factors, indicating that projects like Ollama [ollama], llama.cpp [llamacpp], vLLM [kwon2023efficient], DeepSpeed-FastGen [holmes2024deepspeed], and Unsloth [unsloth] show high user interest and active development.

Hardware compatibility and platform support are detailed, categorizing engines by their support for different operating systems (Linux, Windows, macOS, Web/API), CPU architectures (x86-64, ARM/Apple Silicon), GPU vendors (NVIDIA, AMD, Intel), and AI accelerators (Google TPU, AMD Instinct, Intel Gaudi, Huawei Ascend, AWS Inferentia). The distinction between CPU-based, edge-focused, and server-side engines is made, noting that edge engines prioritize lightweight design and resource efficiency (e.g., Ollama, llama.cpp, MLC LLM [mlcLLM]), while server engines optimize for multi-GPU/multi-node performance (e.g., TensorRT-LLM [tensorrtLLM], vLLM, DeepSpeed-FastGen). The paper introduces a taxonomy matrix (Figure 3) classifying engines by scalability (single-node vs. multi-node) and device support (homogeneous vs. heterogeneous), providing a quick visual guide for selection based on infrastructure. Commercial inference engines (Friendli Inference [friendli], Fireworks AI [fireworks], GroqCloud [groqcloud], Together Inference [together]) are discussed, highlighting their managed services, broad model coverage, hardware variety (including specialized hardware like Groq LPU [abts2020think]), and pricing models (per token or per hour). Performance benchmarks comparing these commercial services across different models are presented (Figure 4).

The core of the paper lies in a detailed review of prominent inference engines and a comprehensive classification of the optimization techniques they support (Table 5).

Key Optimization Techniques Discussed:

  • Batch Optimization: Grouping requests for parallel processing. Discusses Static, Dynamic [crankshaw2017clipper], Continuous yu2022orca, Nano-batching zhu2024nanoflow, and Chunked-prefills agrawal2023sarathi, explaining how they improve throughput and reduce latency spikes.
  • Parallelism: Distributing model computation across multiple devices. Covers Data Parallelism [rajbhandari2020zero], Fully Sharded Data Parallelism (FSDP) zhao2023pytorch, Tensor Parallelism stojkovic2024towards, and Pipeline Parallelism [hu2021pipeline]. Discusses hybrid approaches and hardware-aware strategies.
  • Compression: Reducing model size and computation.
    • Quantization: Converting models to lower precision (FP8, INT8, INT4). Discusses algorithms (GPTQ [frantar2022gptq], AWQ [lin2024awq], AQLM [egiazarian2024extreme], SmoothQuant [xiao2023smoothquant]), KV Cache Quantization [hooper2024kvquant], and hardware support for formats like MXFP8 [rouhani2023microscaling]. Engines like llama.cpp, vLLM, TensorRT-LLM, TGI [tgi] support various quantization methods and data types (Table 6).
    • Pruning: Removing less important parameters (Structured, Unstructured, Contextual [valicenti2023mini], Post-Training [zhao2024pruning], Token Pruning [fu2024lazyLLM]). Supported by engines like DeepSpeed-FastGen and TensorRT-LLM.
    • Sparsity Optimization: Designing models with inherent sparsity (Structured - N:M, Block, Dynamic - MoE [cai2024survey], Kernel-level). Many engines support MoE, and some (vLLM, SGLang, TGI, TensorRT-LLM) support structured sparsity techniques.
  • Fine-tuning: Adapting pretrained models (PEFT methods like LoRA [hu2022lora] and QLoRA [dettmers2023qlora]). Many engines support LoRA/QLoRA and Multi-LoRA for serving multiple adapters concurrently.
  • Caching: Reusing computation for repeated input segments or generated tokens. Discusses Prompt Caching gim2024prompt, Prefix Caching liu2024optimizing, and KV Caching pope2023efficiently.
  • Attention Optimization: Improving the efficiency of the attention mechanism. Covers KV Cache optimizations (PagedAttention, TokenAttention, ChunkedAttention), I/O optimizations like FlashAttention dao2022flashattention, KV Cache Reuse (RadixAttention [zheng2024sglang] in SGLang), attention programming models like FlexAttention [dong2024flex], and custom kernels like FireAttention fireworks.
  • Sampling Optimization: Accelerating token generation beyond sequential decoding. Focuses on Speculative Decoding [leviathan23fast] using a draft model to propose tokens validated by the target model. Discusses draft model types (EAGLE [li2024eagle], Medusa [cai2024medusa]) and optimization techniques. Supported by engines like Ollama, llama.cpp, vLLM, SGLang, TensorRT-LLM, and several commercial engines.
  • Structured Outputs: Ensuring generated text adheres to predefined formats (JSON, code). Discusses Constrained Decoding using FSMs [willard2023efficient] or CFGs [geng2023grammar]. Highlights libraries and frameworks like Outlines [outlines], XGrammar [dong2024xgrammar], LM Format Enforcer [lm-format-enforcer], llguidance [llguidance], GBNF [gbnf], and OpenAI's API features. Many engines integrate support for these tools or provide native structured output capabilities.

Finally, the paper outlines future directions and open challenges in LLM inference, including the need to support multimodal LLMs, adopt alternative architectures beyond Transformers (e.g., Mamba [gu2023mamba], Jamba [lieber2024jamba]), develop hardware-aware kernel fusion and mixed-precision capabilities for diverse hardware, manage memory for increasingly long context windows (e.g., Llama 4 Scout's 10M tokens [llama4]), handle complex logical reasoning workloads, facilitate application-specific engine selection, enhance security during inference (prompt injection, data leaks), improve on-device inference on edge/mobile devices, broaden support for heterogeneous hardware, and integrate with cloud orchestration and serving platforms (Kubernetes [burns2016borg], Ray [moritz2018ray], Triton Inference Server [tritoninferenceserver]).

In conclusion, the survey provides a valuable resource for researchers and practitioners by mapping the landscape of LLM inference engines, systematically classifying their supported optimizations, and offering practical guidance for selecting and implementing engines based on specific application needs, hardware constraints, and performance objectives. It highlights that achieving efficient and scalable LLM inference requires a deep understanding of both model architecture and system-level optimizations, emphasizing the ongoing evolution of this field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sihyeong Park (6 papers)
  2. Sungryeol Jeon (1 paper)
  3. Chaelyn Lee (2 papers)
  4. Seokhun Jeon (2 papers)
  5. Byung-Soo Kim (2 papers)
  6. Jemin Lee (45 papers)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com