Papers
Topics
Authors
Recent
2000 character limit reached

Fused Triton Kernels in LLM Optimization

Updated 7 January 2026
  • Fused Triton kernels are single-launch GPU implementations that merge multiple tensor operations to reduce launch overhead and memory traffic.
  • They integrate operations such as normalization, activation, embedding, and collective communication to achieve significant speedups and memory savings in LLM workloads.
  • Benchmark studies report up to 8× speed improvements and over 50% memory reduction, enabling efficient transformer block and distributed training optimizations.

Fused Triton kernels are single-launch GPU kernel implementations developed in the Triton language, designed to combine multiple tensor operations that would otherwise be performed as separate kernels. By fusing computation steps such as linear, normalization, activation, embedding, and collective communication operations, these kernels achieve substantial reductions in launch overhead, global memory traffic, and latency, leading to higher throughput and efficiency for LLM and foundation model training and inference workloads. Fused Triton kernels are extensively developed and benchmarked in open-source libraries such as Liger-Kernel, and their scope now encompasses transformer blocks, collective communication operators, quantized matrix multiplication, and advanced attention mechanisms (Hsu et al., 2024, Punniyamurthy et al., 2023, Hoque et al., 2024, Roy et al., 3 Jul 2025, Ringlein et al., 7 Oct 2025).

1. Principles of Operation Fusion in Triton

Operation fusion refers to combining what would otherwise be multiple GPU kernels—each handling a specific tensor operation—into a single, just-in-time (JIT) compiled kernel that performs all steps in one pass over the data. In Triton, operation fusion is achieved by writing a kernel that:

  • Launches once for a given input, avoiding repeated launch latencies.
  • Loads each input element exactly once from high-bandwidth memory (HBM) into on-chip storage, performs multiple arithmetic transformations, then writes outputs once.
  • Minimizes the need to materialize and store large intermediate tensors between operations.

This strategy increases arithmetic intensity and amortizes HBM↔SRAM/register data movement across several computations. For LLM training, where a forward and backward pass can trigger hundreds of small tensor operations (each incurring kernel launches and separate loads/stores in naïve frameworks), such fusion directly reduces per-operation overhead and memory pressure (Hsu et al., 2024). In distributed contexts, fusion can also overlap communication collectives with computation by issuing network transactions directly from GPU workgroups, reducing end-to-end latency for workloads such as embedding All-to-All or GEMV + AllReduce (Punniyamurthy et al., 2023).

2. Fused Kernel Construction and Covered Operations

Liger-Kernel provides fused implementations for nearly all core transformer and MLP components:

  • RMSNorm and LayerNorm: Forward kernels compute normalization and gain/shift in a single loop; backward kernels cache expensive reductions (inverse std) and implement gradient aggregation in two stages to avoid atomics (Hsu et al., 2024).
  • RoPE (Rotary Positional Embedding): Block-diagonal structure in the rotations is leveraged to compute all operations in one pass, eliminating materialization of intermediate states.
  • Fused MLP Non-linearities (SwiGLU, GeGLU): Kernels combine two linear/operator matmuls, two bias adds, a non-linear activation (SiLU or GELU), and the output elementwise product. Forward pseudocode from Liger-Kernel illustrates the interleaving of loads, matmuls, bias, activation, and fused output writes within loop tiling to maximize reuse and minimize register pressure.
  • CrossEntropy/Softmax: A single fused kernel computes online softmax, replaces input logits with gradients in place, and elides both intermediate softmax matrix storage and multiple loads/stores.
  • FusedLinearCrossEntropy (FLCE): Integrates final projection, logits chunking, and cross-entropy in one pass using chunked matmuls and local accumulation, critical for extreme-vocabulary models.

Other research extends the Triton fusion principle to quantized GEMMs with dequantization, paged-attention (QKᵀ→softmax→V) for LLM inference, 2-simplicial (trilinear) attention, and compute-collective fusion for distributed learning (Hoque et al., 2024, Roy et al., 3 Jul 2025, Ringlein et al., 7 Oct 2025, Punniyamurthy et al., 2023).

Kernel Type Fused Operations Impact (A100 80GB)
RMSNorm Norm, scale 7× faster, 3× less memory
GeGLU/SwiGLU Dual matmul, bias, activation, output combine ≈same speed, 1.6× less memory
CrossEntropy Softmax, CE grad 3× faster, 5× less memory
RoPE Full embedding rotation 8× faster, 3× less memory
FLCE Last-layer matmul+CE, chunked accumulation >2× less memory

3. Mathematical and Algorithmic Patterns

Fused kernels explicitly implement multi-step mathematical expressions with a focus on minimizing intermediate storage and maximizing in-register or shared-memory computation. Examples include:

  • Fused normalization:

x^=x1ni=1nxi2+ϵ,y=x^γ\hat x = \frac{x}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2 + \epsilon}}, \quad y = \hat x \odot \gamma

  • SwiGLU fused forward:

x1=Wx+b,x2=Vx+c,y=SiLU(x1)x2x_1 = W x + b, \quad x_2 = V x + c, \quad y = \mathrm{SiLU}(x_1) \odot x_2

  • Fused CrossEntropy:

y=softmax(x),xL=yty = \mathrm{softmax}(x), \quad \nabla_x \mathcal L = y-t

Ab,h,i,j,k=1dQb,h,i,Kb,h,j,Kb,h,k,A_{b,h,i,j,k} = \frac{1}{\sqrt d} \sum_\ell Q_{b,h,i,\ell} K_{b,h,j,\ell} K'_{b,h,k,\ell}

Sb,h,i,j,k=exp(Ab,h,i,j,k)j,kexp(Ab,h,i,j,k),Ob,h,i,=j,kSb,h,i,j,kVb,h,j,Vb,h,k,S_{b,h,i,j,k} = \frac{\exp(A_{b,h,i,j,k})}{\sum_{j',k'}\exp(A_{b,h,i,j',k'})}, \quad O_{b,h,i,\ell} = \sum_{j,k} S_{b,h,i,j,k} V_{b,h,j,\ell} V'_{b,h,k,\ell}

These computations are implemented as tightly-tiled loops with explicit handling for register limits, online reductions for numerical stability (online max/logsumexp), and output accumulation. The kernels use Triton primitives such as tl.dot, tl.load/store (with masking), and sometimes custom handling for quantization or collective communications.

4. Empirical Performance and Benchmarks

Systematic benchmarks across major transformer and LLM models highlight the quantitative gains of fused Triton kernels over unfused baselines:

  • RMSNorm (hidden=16,384): 7× speedup, 3× memory reduction per operation (Hsu et al., 2024).
  • CrossEntropy (vocab=163,840): 3× faster, 5× lower memory.
  • RoPE: 8× faster, 3× less memory.
  • FusedLinearCrossEntropy: >2× less memory, prevents out-of-memory (OOM) for multi-head scenarios in Medusa.
  • End-to-end LLM training, 4×A100:
    • LLaMA-3-8B, batch=64: +42.8% throughput, −54.8% GPU memory.
    • Qwen2, batch=48: +25.5% throughput, −56.8% memory.
    • Gemma7B, batch=48: +11.9% throughput, −51.8% memory.
    • Mistral7B, batch=128: +27% throughput, −21% memory.

Benchmarks for distributed and communication-fused kernels show:

  • Embedding + All-to-All: up to 1.45× speedup (inter-node), 1.25× (intra-node).
  • GEMV + AllReduce: up to 1.21× speedup.
  • 128-node DLRM forward+backward: ~21% lower end-to-end latency (Punniyamurthy et al., 2023).

For quantized inference, fused dequant + GEMM using SplitK achieves 64–124% average speedup (with up to 295% peak) on H100, especially for skinny matrix shapes (Hoque et al., 2024). Fused paged-attention kernels achieve >90% of peak TFLOPS and >80% peak bandwidth (Ringlein et al., 7 Oct 2025).

5. Notable Implementation Challenges and Remedies

Several domain- and hardware-specific challenges arise in developing robust, performant fused kernels:

  • Numerical Stability: Use of explicit small ϵ\epsilon in normalization, online softmax logsumexp across tiles, and “safe log” in fused loss (Hsu et al., 2024, Roy et al., 3 Jul 2025).
  • Register and Shared-Memory Pressure: Careful block sizing (BLOCK, BLOCK_Q, BLOCK_KV, etc.) and in-register reduction caches for backward passes.
  • Gradient Aggregation: Two-stage reduction (as in LayerNorm) to avoid expensive global atomics (Hsu et al., 2024).
  • Collective/Communication Fusion: Integration of ROC_SHMEM/NVSHMEM primitives as Triton builtins for direct GPU-initiated RDMA-puts and in-kernel flags/fences (Punniyamurthy et al., 2023).
  • SplitK Decomposition: For quantized GEMM, atomic accumulation of partial sums across multiple K-splits with tradeoff tuning for split_k (Hoque et al., 2024).
  • Masking and Variable Dimensions: All loads/stores protected with boundary masks for sizes not divisible by tile/block parameters (Hsu et al., 2024, Roy et al., 3 Jul 2025).
  • Auto-tuning: Offline micro-benching, heuristic grid mapping to choose block size and hardware knobs (num_warps, num_stages, etc.), static grid selection for inference graphs (Ringlein et al., 7 Oct 2025).

6. Research Extensions and Ecosystem Integration

Fused Triton kernels have catalyzed method development and production deployment in multiple domains:

  • Distributed Model Parallelism: Custom collectives fused with computation are exposed as first-class PyTorch operators (e.g., torch.embedding_all2all, torch.gemv_allreduce) and as Triton-extendable primitives (e.g., tl.contrib.rocm.put) (Punniyamurthy et al., 2023).
  • Quantized Model Inference: Fused W4A16 kernels with SplitK are critical for efficient deployment of LLMs with very wide and skinny layers (Hoque et al., 2024).
  • Advanced Attention Mechanisms: Paged attention kernels and novel architectures like 2-simplicial Transformers are implemented as single-pass Triton kernels to maximize cache and register re-use, enabling performance matching or exceeding FlashAttention-class methods on both NVIDIA and AMD hardware (Roy et al., 3 Jul 2025, Ringlein et al., 7 Oct 2025).
  • Cross-vendor and Production Integration: These techniques form the foundation of open-source inference backends such as vLLM Triton (github.com/vllm-project/vllm-triton-lib), ensuring cross-architecture portability and eliminating the need for vendor-specific or hand-tuned CUDA (Ringlein et al., 7 Oct 2025).

Fused Triton kernels are driving the next generation of high-efficiency, low-latency large-scale machine learning infrastructure, providing blueprints for research and production-grade deployment. Current active work continues to generalize fusion patterns for new architectural primitives, distributed operations, and quantized model types, making fused Triton kernels a cornerstone technique in LLM and foundation model engineering.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fused Triton Kernels.