TurboMind Inference Engine

Updated 16 December 2025

TurboMind is a high-performance inference engine for LLMs that uses end-to-end mixed-precision strategies and hardware-aware optimizations.
It integrates specialized modules like an offline packager, online engine, scheduler, and KV cache manager for efficient memory and computation management.
Empirical evaluations demonstrate up to 156% higher throughput and 61% lower latency compared to other leading inference engines.

TurboMind is a high-performance inference engine designed for LLMs with end-to-end mixed-precision support, functioning as a drop-in backend within the LMDeploy framework. Its core utility lies in systematically reducing the memory and computational requirements of LLM inference through hybrid precision schemes applied to model weights, activations, and key-value (KV) caches. TurboMind implements novel hardware-aware execution pipelines and end-to-end optimizations that leverage tensor-core architectures and deep memory hierarchies to achieve state-of-the-art latency and throughput for mixed-precision workloads (Zhang et al., 21 Aug 2025).

1. System Architecture

TurboMind comprises several specialized modules organized for high-throughput, low-latency inference. At the highest level inside LMDeploy, the engine incorporates the following functional modules:

Offline Packager: Receives quantized weight tensors (e.g., INT4, INT8, BF16), applies a hardware-aware packing strategy, and stores tensor-core–aligned fragments to persistent memory.
Online Engine: Loads pre-packed weights and KV-caches at runtime, transfers them through global and shared memory to registers, applies dequantization, and performs matrix operations (GEMM or attention) to generate tokens.
Scheduler: Manages the orchestration of GEMM and attention pipelines, overlapping asynchronous memory copy operations (cp.async), integer-to-float (I2F) dequantization, and tensor-core multiply-accumulate (MMA) instructions for throughput maximization.
KV-Cache Manager: Maintains quantized K/V tensors and streams them efficiently to the attention pipeline.
API Layer: Exposes configuration parameters for arbitrary mixed-precision formats (e.g., W4A16KV8), batch sizes, device selection, and tensor parallelism.

Data and Execution Flow

Weight Preparation: Offline, quantized weights are packed into fragments coalesced for tensor-core execution.
Runtime Execution: The Online Engine loads weight/KV fragments into memory, where the GEMM pipeline processes weights and activations, and the attention pipeline processes Q/K/V and the KV cache, all via tensor-core instructions and specialized memory management.
Hierarchical Memory Utilization: Data flows from global memory, through shared memory, and into registers before being consumed by tensor cores for computation.

2. Mixed-Precision Strategies

TurboMind supports arbitrary combinations of precision formats as follows:

Weights ( $W$ ): INT4, INT8, BF16, FP16
Activations ( $A$ ): 8, 16, or 32 bits
KV Cache ( $KV$ ): 4, 8, 16 bits

These combinations are denoted as, for example, W4A16KV8. Quantization operates either per-block or per-row with the quantization process defined for a weight $w$ and scale $s_w$ as: $q_w = \mathrm{round}(w / s_w), \quad w_q \in \{-2^{b-1}, ..., 2^{b-1} - 1\}, \quad \hat{w} = s_w \times q_w$ with an error bound $|w - \hat{w}| \leq s_w / 2$ .

For a matrix tile of size $M \times K \times N$ , the error bound from quantization is: $\|W - \hat{W}\|_F \leq \frac{1}{2} \sqrt{MK}\, s_w$ This strategy offers up to a $4\times$ reduction in memory footprint (e.g., $M_{fp16} / M_{int4} = 16 / 4$ ), and potentially $4\times$ higher arithmetic intensity on INT4-compatible tensor cores compared to FP16.

3. General Matrix Multiply (GEMM) Pipeline

Offline Weight Packing

TurboMind implements a hardware-optimized, bank-conflict–free packing algorithm. Each weight tensor in $b$ -bit format undergoes:

Bit-extension to 16 bits.
Partitioning into $m_0 \times k_0$ tiles, each tile loaded via asynchronous copy (cp.async) to shared memory.
Loading to registers using ldmatrix for correct lane alignment.
In-register repacking and permutation for direct tensor-core MMA consumption, eliminating runtime shuffles.
Coalesced global memory storage of packed fragments in a flat, tensor-core–aligned layout.

Online Acceleration

At inference:

Warps issue cp.async and ldmatrix instructions to move fragments into registers.
Bring quantized fragments into floating-point via fused I2F dequantization and MMA:
- For each register holding $f_{int}$ , compute $f_{fp16} = s_w \cdot f_{int}$ and accumulate: $\text{acc} = \text{mma.sync}(\text{acc}, A_{fp16}, f_{fp16})$ .
Final matrix multiplication $C = A_{fp16} \times \hat{W}_{fp16}$ executes near FP16 speed, even for $b=4$ (INT4).

4. Attention Pipeline

Adaptive Head Alignment

TurboMind addresses lane misalignments that arise when query ( $Q$ ) is FP16 but key ( $K$ ) is INT8 or INT4:

Copies $Q$ from registers to row-major shared memory.
A custom shared-to-register “rearrangement” kernel redistributes $Q$ to align with the fragmented layout required by INTK, enabling use of ldmatrix without bank conflicts.
This head alignment ensures mixed-precision $Q K^\mathrm{T}$ executes at tensor-core throughput.

Mixed Precision Q, K, V and Attention

For arbitrary Q, K, V precision:

$Q$ : FP16.
$K_q$ , $V_q$ : $\mathrm{INT}_b$ with scales $s_K$ , $s_V$ .
On-the-fly dequantization: $\hat{K} = s_K K_q$ , $\hat{V} = s_V V_q$ .
Attention scores: $S = Q \hat{K}^\mathrm{T} / \sqrt{d_k}$ via FP16 MMA.
Apply softmax: $O = \mathrm{softmax}(S) \hat{V}$ .
No additional error-compensation steps beyond quantization scaling; pipeline design and I2F procedures are sufficient to minimize quantization error.

5. Hardware-Oriented Execution and Memory Optimization

TurboMind exploits the following hardware-aware techniques:

Instruction-Level Parallelism (ILP): The GEMM and attention loops overlap (1) tensor-core mma.sync on tile $k$ , (2) ALU I2F + FMA on tile $k+1$ , and (3) cp.async prefetch on tile $k+2$ . Registers are double-buffered to feed outputs directly to subsequent mma.sync instructions without stalls.
Memory Hierarchy Management:
- Global-to-shared via cp.async for large-burst copying.
- Shared-to-register transfer using ldmatrix or equivalent to mitigate bank conflicts.
- Register-to-tensor-core execution using mma.sync native tiling.
KV Memory Loading Pipeline:
- Each KV tile (e.g., $64 \times 64$ ) is subdivided into micro-tiles (e.g., 16 values).
- For each micro-tile: (A) execute $Q K^\mathrm{T}$ on current $K$ slice, (B) dequantize the next $K$ slice, (C) prefetch the next full KV tile.
- Triple-buffered shared memory ensures continuous data flow and no compute stalls.

6. Empirical Performance and Comparative Evaluation

TurboMind's performance evaluation encompasses 16 LLMs (7B–235B, both dense and MoE), two quantization schemes (AWQ, GPTQ), and four GPU architectures (RTX 4090, L40S, A100, H100). Baselines for comparison include vLLM + MARLIN, TensorRT-LLM, and OmniServe + QServe.

Key empirical findings for mixed-precision workloads:

Metric	TurboMind vs. Baseline	Value
Serving latency	vLLM + MARLIN	up to 61% lower (avg 30%)
Throughput	vLLM + MARLIN	up to 156% higher (avg 58%)
Throughput	TensorRT-LLM (7/14B)	118.9% speedup
TTFT reduction	TensorRT-LLM (7/14B)	52.2%
Throughput	OmniServe + QServe (W4A8KV4)	14.1% higher

Selected kernel-level benchmarks (Qwen 8B AWQ, W4A16KV8, A100):

GEMM (INT4 $\times$ FP16) vs. MARLIN: 19.2% average speedup (up to 25.5%).
Attention prefill: 22.1% average latency reduction (max 48.7%).
Attention decode: 7.6% latency reduction (max 29.9%).
INT4 $\times$ FP16 vs. FP16 $\times$ FP16 GEMM: 134% speedup for small batches; parity at large batches.
Memory bandwidth utilization: up to 93% (8-bit KV), up to 95% (16-bit KV).

These results indicate consistent performance advantages across models and hardware profiles, with particular efficacy at batch sizes and quantization regimes relevant to high-throughput inference.

7. Integration with LMDeploy and Usage

TurboMind is the default inference engine in LMDeploy, accessible via the --engine turbomind option at both CLI and Python API layers. Typical usage exposes control over weight, activation, and KV precisions, as well as parallelization and batch sizing:

from lmdeploy import InferenceEngine
engine = InferenceEngine(
  model_path="qwen-8b-awq",
  weight_precision=4,
  act_precision=16,
  kv_precision=8,
  engine="turbomind",
  device="cuda:0",
  tensor_parallel_degree=2
)
outputs = engine.generate(
  prompts=[…],
  max_new_tokens=512, temperature=0.8
)

Key CLI flags include:

--weight-precision=W4|W8|FP16|BF16
--activation-precision=A8|A16|FP32
--kv-precision=KV4|KV8|KV16
--batch-size=N
--engine=turbomind

The offline packer is invoked automatically at model preparation, caching packed fragments suitable for all supported GPU architectures and incurring zero per-request packing latency during inference. LMDeploy infrastructure dispatches TurboMind kernels without user intervention, providing deterministic performance characteristics for production and research use (Zhang et al., 21 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Efficient Mixed-Precision Large Language Model Inference with TurboMind (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TurboMind.