Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC

Published 8 Apr 2026 in cs.DC, cs.LG, cs.OS, cs.PF, and cs.SE | (2604.07609v1)

Abstract: LLM inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized. We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement. Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines even in isolation, reducing pre-saturation P99 TTFT by up to 8.47$\times$ and P99 TPOT by up to 3.40$\times$, improving decode throughput by up to 2.1$\times$, and reducing energy per token by up to 48.6$\%$. Under CPU interference, Blink maintains stable performance, while existing systems degrade by up to two orders of magnitude.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a CPU-free serving stack that offloads LLM orchestration to GPU and SmartNIC, eliminating host CPU involvement.
It employs device-side CUDA scheduling and DPU-based tokenization to achieve up to 8.47× lower tail latency and 2.1× higher throughput.
The approach ensures improved energy efficiency and performance isolation, enabling scalable and robust datacenter deployments.

CPU-Free LLM Inference: The Blink Serving Architecture

Motivation and Problem Analysis

LLM inference is a dominant datacenter service, yet the critical path in mainstream serving stacks remains host-CPU-centric. Existing systems (e.g., vLLM, SGLang, TensorRT-LLM) depend on the CPU for token-level orchestration: request admission, continuous batching, kernel dispatch, and KV-cache management. Such coupling renders inference highly sensitive to CPU resource contention, especially in multi-tenant deployments. Empirical analysis shows severe throughput degradation and tail-latency inflation under moderate CPU interference, with baseline systems retaining only 28–54% of their isolated throughput. Conventional datacenter mitigations (huge pages, core pinning, cache partitioning) are fundamentally ineffective in preventing interference-induced performance degradation due to irreducible host-side orchestration overhead.

Figure 1: Throughput comparison on Qwen-3, demonstrating that Blink is unaffected by colocation while baselines degrade; annotations show the ratio between colocated and isolated throughput.

Architecture: Eliminating the Host CPU from the Critical Path

Blink proposes a serving stack with the host CPU removed from the steady-state path. The serving responsibilities are restructured: the SmartNIC (DPU) takes over request ingress and handles protocol parsing, tokenization, and streaming, directly placing input tokens into GPU memory via RDMA. The GPU maintains a persistent CUDA scheduler kernel implementing batching, KV-cache management, and device-side graph launch, enabling continuous batching and per-token scheduling without host intervention.

Figure 2: Blink architecture, where after initialization, only DPU and GPU are involved in inference; circles trace a request’s lifecycle from network ingress to token generation.

SmartNICs (DPU) excel in network and memory bandwidth per core, making them apt for transport and request management. The GPU handles latency-critical scheduling through a persistent kernel that scans a ring buffer (shared via one-sided RDMA) for incoming requests and manages decode state transitions. Device-side CUDA graph launch (fire-and-forget mode) minimizes kernel dispatch latency and enables unbounded token generation with amortized launch-window recovery.

Figure 3: Normalized makespan for CPU- vs. GPU-resident scheduling; GPU scheduling eliminates per-token PCIe round-trip, with 1.16–1.70× reduction in makespan.

Performance Evaluation: Latency, Throughput, and Isolation

Blink is evaluated against TRT-LLM, vLLM, and SGLang across four models (Llama-3 8B, Phi-4 15B, Qwen-3 32B, Qwen-3 30B-A3B). Under isolation, Blink achieves up to 8.47× lower P99 time-to-first-token (TTFT), up to 3.40× lower P99 time-per-output-token (TPOT), and up to 2.1× higher decode throughput. For example, on Qwen-3 30B-A3B, baselines incur 3.45–8.47× higher TTFT and 1.85–3.40× higher TPOT. Device-side orchestration yields these gains by removing host scheduling and kernel dispatch from the per-token loop.

Blink’s performance remains stable under CPU interference, sustaining up to 6.46× higher decode throughput and 4.87× higher request throughput than baselines, whose throughput collapses (only retaining 28–48% of isolated capacity). Tail latency is virtually unchanged: TTFT and TPOT inflation under interference stays within 1.14× and 1.04×, respectively, for Blink, whereas baselines inflate up to 18.84× and 11.1×.

Figure 4: Energy per token under isolated and interference conditions; Blink achieves up to $48.6\%$ lower energy per token in isolation, and up to $70.7\%$ under interference.

Energy Efficiency and Tokenization Performance

Blink’s efficiency translates directly to lower energy consumption. It achieves up to 48.6% lower energy per token in isolation and up to 70.7% under interference compared to baseline systems, which is attributable to stable throughput as wall power remains comparable across systems.

Tokenization and prompt handling are offloaded to the DPU. Blink’s tokenizer, implemented on ARM A78 cores, outperforms HuggingFace’s and llama.cpp’s tokenizers on Xeon by 8–19.7× across input lengths, avoiding a bottleneck even at scale.

Figure 5: Tokenization latency comparison on BlueField-3 ARM vs. Intel Xeon; Blink delivers 8–19.7× speedup over HuggingFace.

System Extensibility and Practical Deployment

Blink is compatible with any model compiled for TensorRT and supports dense, MoE, and encoder-decoder architectures with no modifications. Scheduling policies (e.g., FCFS, continuous batching) and device-side optimizations (chunked prefill, prefix caching, speculative decoding, KV offloading) are orthogonal and integrable within Blink's GPU-resident scheduler. Multi-GPU extensions (tensor/pipeline parallelism) require replica schedulers and GPU-native communication primitives (NCCL, IBGDA) without violating CPU-free critical path constraints.

Blink’s OpenAI-compatible HTTP interface and SSE streaming enable drop-in deployment. The architecture cleanly decouples provisioning/init (CPU involvement) from steady-state inference (DPU+GPU only), facilitating server consolidation and resource sharing.

Prior SmartNIC-based inference systems (Lynx, SplitRPC, Conspirator, OS2G) operate on one-shot or job-level granularity and retain host CPU on the critical path, lacking support for autoregressive LLM features like token-level continuous batching and KV-cache management. Typical optimizations focus on improving host-driven orchestration (continuous batching, memory efficiency) whereas Blink eliminates host orchestration entirely, establishing persistent accelerator-native control.

Conclusion

Blink demonstrates that CPU-free LLM inference serving is not only feasible but highly advantageous in throughput, latency, isolation, and energy efficiency. By restructuring the serving stack to delegate control and data plane functions to the GPU and SmartNIC DPU, Blink achieves strong numerical improvements—up to 8.47× lower tail latency, 2.1× higher throughput, and up to 70.7% energy reduction under interference. Its architectural decoupling guarantees performance isolation and unlocks server consolidation as datacenter scale increases, providing a robust foundation for future AI deployment frameworks (2604.07609).

Markdown Report Issue