Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance
Abstract: Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory bandwidth. While recent dataflow architectures mitigate these overheads by enabling aggressive fusion of decoder layers into a single kernel, they too leave performance on the table due to synchronization penalties at layer boundaries. This paper presents kernel looping, a specialized global optimization technique which exploits an optimization opportunity brought by combining the unique layer-level fusion possible in modern dataflow architectures with the repeated layer structure found in LLMs. Kernel looping eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a modified kernel containing a pipelined outer loop. We evaluate kernel looping on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU), a commercial dataflow accelerator for AI. Experiments demonstrate that kernel looping speeds up the decode phase of a wide array of powerful open-source models by up to 2.2$\times$ on SN40L. Kernel looping allows scaling of decode performance over multiple SN40L sockets, achieving speedups of up to 2.5$\times$. Finally, kernel looping enables SN40L to achieve over 90% of peak performance on 8 and 16 sockets and achieve a speedup of up to 3.7$\times$ over DGX H100. Kernel looping, as well as the models evaluated in this paper, are deployed in production in a commercial AI inference cloud.
- Artificial analysis: Independent analysis of ai models and api providers. https://artificialanalysis.ai/.
- Context caching - google gemini api documentation. https://ai.google.dev/gemini-api/docs/caching.
- Getting started with cuda graphs. https://developer.nvidia.com/blog/cuda-graphs/.
- Nvidia collective communications library (nccl). https://developer.nvidia.com/nccl.
- Nvidia dgx h100 datasheet. https://resources.nvidia.com/en-us-dgx-systems/ai-enterprise-dgx.
- Nvidia h100 tensor core gpu architecture. https://resources.nvidia.com/en-us-tensor-core.
- Nvidia nsight systems. https://developer.nvidia.com/nsight-systems.
- Nvidia tensorrt-llm. https://docs.nvidia.com/tensorrt-llm/index.html.
- Prompt caching - openai. https://platform.openai.com/docs/guides/prompt-caching.
- Prompt caching with claude. https://www.anthropic.com/news/prompt-caching.
- Taming throughput-latency tradeoff in llm inference with sarathi-serve. arXiv preprint arXiv:2403.02310, 2024.
- Large language monkeys: Scaling inference compute with repeated sampling, 2024.
- Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers, 2023.
- DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Chain-of-verification reduces hallucination in large language models, 2023.
- Abhimanyu Dubey et. al. The llama 3 herd of models, 2024.
- Rollbin: reducing code-size via loop rerolling at binary level. In Proceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES 2022, page 99–110, New York, NY, USA, 2022. Association for Computing Machinery.
- Multi-gpu communication schemes for iterative solvers: When cpus are not in charge. In Proceedings of the 37th ACM International Conference on Supercomputing, ICS ’23, page 192–202, New York, NY, USA, 2023. Association for Computing Machinery.
- Mixtral of experts, 2024.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
- Automatic horizontal fusion for gpu kernels. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 14–27, 2022.
- Andes: Defining and enhancing quality-of-experience in llm-based text streaming services. arXiv preprint arXiv:2404.16283, 2024.
- Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS ’24, page 932–949, New York, NY, USA, 2024. Association for Computing Machinery.
- Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, page 883–898, New York, NY, USA, 2021. Association for Computing Machinery.
- OpenAI. “learning to reason with llms,”, 2024.
- R. Prabhakar. Sambanova sn40l rdu: Breaking the barrier of trillion+ parameter scale gen ai computing. In 2024 IEEE Hot Chips 36 Symposium (HCS), pages 1–24, Los Alamitos, CA, USA, aug 2024. IEEE Computer Society.
- Sambanova sn40l: Scaling the ai memory wall with dataflow and composition of experts, 2024.
- Steve Rennich. Cuda c/c++ streams and concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf.
- Loop rolling for code size reduction. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 217–229, 2022.
- Reflexion: Language agents with verbal reinforcement learning, 2023.
- Loop rerolling for hardware decompilation. Proc. ACM Program. Lang., 7(PLDI), June 2023.
- Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024.
- G. Stiff and F. Vahid. New decompilation techniques for binary-level co-processor generation. In ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005., pages 547–554, 2005.
- Urpr—an extension of urcr for software pipelining. In Proceedings of the 19th annual workshop on Microprogramming, pages 94–103, 1986.
- Qwen Team. Qwen2.5: A party of foundation models, September 2024.
- Attention is all you need, 2023.
- Scalable kernel fusion for memory-bound gpu applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 191–202, 2014.
- Kernel fusion: An effective method for better power efficiency on multithreaded gpu. In 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int’l Conference on Cyber, Physical and Social Computing, pages 344–350, 2010.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv preprint arXiv:2401.09670, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.