TurboMind Inference Engine
- TurboMind is a high-performance inference engine for LLMs that uses end-to-end mixed-precision strategies and hardware-aware optimizations.
- It integrates specialized modules like an offline packager, online engine, scheduler, and KV cache manager for efficient memory and computation management.
- Empirical evaluations demonstrate up to 156% higher throughput and 61% lower latency compared to other leading inference engines.
TurboMind is a high-performance inference engine designed for LLMs with end-to-end mixed-precision support, functioning as a drop-in backend within the LMDeploy framework. Its core utility lies in systematically reducing the memory and computational requirements of LLM inference through hybrid precision schemes applied to model weights, activations, and key-value (KV) caches. TurboMind implements novel hardware-aware execution pipelines and end-to-end optimizations that leverage tensor-core architectures and deep memory hierarchies to achieve state-of-the-art latency and throughput for mixed-precision workloads (Zhang et al., 21 Aug 2025).
1. System Architecture
TurboMind comprises several specialized modules organized for high-throughput, low-latency inference. At the highest level inside LMDeploy, the engine incorporates the following functional modules:
- Offline Packager: Receives quantized weight tensors (e.g., INT4, INT8, BF16), applies a hardware-aware packing strategy, and stores tensor-core–aligned fragments to persistent memory.
- Online Engine: Loads pre-packed weights and KV-caches at runtime, transfers them through global and shared memory to registers, applies dequantization, and performs matrix operations (GEMM or attention) to generate tokens.
- Scheduler: Manages the orchestration of GEMM and attention pipelines, overlapping asynchronous memory copy operations (cp.async), integer-to-float (I2F) dequantization, and tensor-core multiply-accumulate (MMA) instructions for throughput maximization.
- KV-Cache Manager: Maintains quantized K/V tensors and streams them efficiently to the attention pipeline.
- API Layer: Exposes configuration parameters for arbitrary mixed-precision formats (e.g., W4A16KV8), batch sizes, device selection, and tensor parallelism.
Data and Execution Flow
- Weight Preparation: Offline, quantized weights are packed into fragments coalesced for tensor-core execution.
- Runtime Execution: The Online Engine loads weight/KV fragments into memory, where the GEMM pipeline processes weights and activations, and the attention pipeline processes Q/K/V and the KV cache, all via tensor-core instructions and specialized memory management.
- Hierarchical Memory Utilization: Data flows from global memory, through shared memory, and into registers before being consumed by tensor cores for computation.
2. Mixed-Precision Strategies
TurboMind supports arbitrary combinations of precision formats as follows:
- Weights (): INT4, INT8, BF16, FP16
- Activations (): 8, 16, or 32 bits
- KV Cache (): 4, 8, 16 bits
These combinations are denoted as, for example, W4A16KV8. Quantization operates either per-block or per-row with the quantization process defined for a weight and scale as: with an error bound .
For a matrix tile of size , the error bound from quantization is: This strategy offers up to a reduction in memory footprint (e.g., ), and potentially higher arithmetic intensity on INT4-compatible tensor cores compared to FP16.
3. General Matrix Multiply (GEMM) Pipeline
Offline Weight Packing
TurboMind implements a hardware-optimized, bank-conflict–free packing algorithm. Each weight tensor in -bit format undergoes:
- Bit-extension to 16 bits.
- Partitioning into tiles, each tile loaded via asynchronous copy (cp.async) to shared memory.
- Loading to registers using ldmatrix for correct lane alignment.
- In-register repacking and permutation for direct tensor-core MMA consumption, eliminating runtime shuffles.
- Coalesced global memory storage of packed fragments in a flat, tensor-core–aligned layout.
Online Acceleration
At inference:
- Warps issue cp.async and ldmatrix instructions to move fragments into registers.
- Bring quantized fragments into floating-point via fused I2F dequantization and MMA:
- For each register holding , compute and accumulate: .
- Final matrix multiplication executes near FP16 speed, even for (INT4).
4. Attention Pipeline
Adaptive Head Alignment
TurboMind addresses lane misalignments that arise when query () is FP16 but key () is INT8 or INT4:
- Copies from registers to row-major shared memory.
- A custom shared-to-register “rearrangement” kernel redistributes to align with the fragmented layout required by INTK, enabling use of ldmatrix without bank conflicts.
- This head alignment ensures mixed-precision executes at tensor-core throughput.
Mixed Precision Q, K, V and Attention
For arbitrary Q, K, V precision:
- : FP16.
- , : with scales , .
- On-the-fly dequantization: , .
- Attention scores: via FP16 MMA.
- Apply softmax: .
- No additional error-compensation steps beyond quantization scaling; pipeline design and I2F procedures are sufficient to minimize quantization error.
5. Hardware-Oriented Execution and Memory Optimization
TurboMind exploits the following hardware-aware techniques:
- Instruction-Level Parallelism (ILP): The GEMM and attention loops overlap (1) tensor-core mma.sync on tile , (2) ALU I2F + FMA on tile , and (3) cp.async prefetch on tile . Registers are double-buffered to feed outputs directly to subsequent mma.sync instructions without stalls.
- Memory Hierarchy Management:
- Global-to-shared via cp.async for large-burst copying.
- Shared-to-register transfer using ldmatrix or equivalent to mitigate bank conflicts.
- Register-to-tensor-core execution using mma.sync native tiling.
- KV Memory Loading Pipeline:
- Each KV tile (e.g., ) is subdivided into micro-tiles (e.g., 16 values).
- For each micro-tile: (A) execute on current slice, (B) dequantize the next slice, (C) prefetch the next full KV tile.
- Triple-buffered shared memory ensures continuous data flow and no compute stalls.
6. Empirical Performance and Comparative Evaluation
TurboMind's performance evaluation encompasses 16 LLMs (7B–235B, both dense and MoE), two quantization schemes (AWQ, GPTQ), and four GPU architectures (RTX 4090, L40S, A100, H100). Baselines for comparison include vLLM + MARLIN, TensorRT-LLM, and OmniServe + QServe.
Key empirical findings for mixed-precision workloads:
| Metric | TurboMind vs. Baseline | Value |
|---|---|---|
| Serving latency | vLLM + MARLIN | up to 61% lower (avg 30%) |
| Throughput | vLLM + MARLIN | up to 156% higher (avg 58%) |
| Throughput | TensorRT-LLM (7/14B) | 118.9% speedup |
| TTFT reduction | TensorRT-LLM (7/14B) | 52.2% |
| Throughput | OmniServe + QServe (W4A8KV4) | 14.1% higher |
Selected kernel-level benchmarks (Qwen 8B AWQ, W4A16KV8, A100):
- GEMM (INT4FP16) vs. MARLIN: 19.2% average speedup (up to 25.5%).
- Attention prefill: 22.1% average latency reduction (max 48.7%).
- Attention decode: 7.6% latency reduction (max 29.9%).
- INT4FP16 vs. FP16FP16 GEMM: 134% speedup for small batches; parity at large batches.
- Memory bandwidth utilization: up to 93% (8-bit KV), up to 95% (16-bit KV).
These results indicate consistent performance advantages across models and hardware profiles, with particular efficacy at batch sizes and quantization regimes relevant to high-throughput inference.
7. Integration with LMDeploy and Usage
TurboMind is the default inference engine in LMDeploy, accessible via the --engine turbomind option at both CLI and Python API layers. Typical usage exposes control over weight, activation, and KV precisions, as well as parallelization and batch sizing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from lmdeploy import InferenceEngine engine = InferenceEngine( model_path="qwen-8b-awq", weight_precision=4, act_precision=16, kv_precision=8, engine="turbomind", device="cuda:0", tensor_parallel_degree=2 ) outputs = engine.generate( prompts=[…], max_new_tokens=512, temperature=0.8 ) |
Key CLI flags include:
--weight-precision=W4|W8|FP16|BF16--activation-precision=A8|A16|FP32--kv-precision=KV4|KV8|KV16--batch-size=N--engine=turbomind
The offline packer is invoked automatically at model preparation, caching packed fragments suitable for all supported GPU architectures and incurring zero per-request packing latency during inference. LMDeploy infrastructure dispatches TurboMind kernels without user intervention, providing deterministic performance characteristics for production and research use (Zhang et al., 21 Aug 2025).