Liger-Kernel: Efficient Triton Kernels for LLMs

Updated 20 November 2025

Liger-Kernel is an open-source suite of Triton GPU kernels that significantly improves LLM training throughput and reduces GPU memory usage.
It employs fused operations and input chunking to minimize memory allocations and accelerate compute-intensive tasks while ensuring numerical accuracy.
Its modular design enables easy integration via auto and model-specific patchers, supporting workflows in PyTorch FSDP and DeepSpeed.

Liger-Kernel is an open-source suite of GPU kernels implemented in Triton's Python-like DSL, specifically engineered for efficient training of LLMs. By replacing standard PyTorch operations with highly optimized Triton kernels, Liger-Kernel provides substantial improvements in end-to-end training throughput and GPU memory utilization for popular LLMs. Its design prioritizes modularity, extensibility, and integration into standard LLM workflows while ensuring numerical correctness and convergence parity with conventional PyTorch training protocols. Liger-Kernel is distributed under a permissive license and is compatible with major distributed training frameworks including PyTorch FSDP and DeepSpeed ZeRO (Hsu et al., 2024).

1. Architectural Design and Triton Integration

Liger-Kernel comprises a collection of Triton GPU kernels, each intended as a drop-in substitute for memory- and compute-critical PyTorch operations in LLM architectures. Each kernel accepts flattened tensors—for example, input shapes $(B, T, H)$ are reshaped to $(B \cdot T, H)$ —and parallelizes computation over rows of length $H$ . Fused operations, such as matrix multiplication combined with activation functions, are performed entirely in on-chip SRAM, minimizing high-bandwidth DRAM access and avoiding unnecessary intermediate tensors.

Integration is provided via:

AutoLigerKernelForCausalLM, supporting zero-code patching for any compatible causal LM.
Model-specific patchers (e.g., apply_liger_kernel_to_llama()), allowing targeted integration.
Liger primitives (e.g., LigerLayerNorm, LigerCrossEntropyLoss), composable for custom module definitions.

Out-of-the-box compatibility is maintained for PyTorch FSDP and DeepSpeed ZeRO/ZeRO++ pipelines. Only PyTorch and Triton are required, streamlining adoption in existing workflows.

2. Kernel-Level Optimization Strategies

A. Operation Fusion

Operation fusion targets the elimination of intermediate memory allocations and global-memory traffic by combining sequences of pointwise and linear computations. For example, RMSNorm's forward and backward passes are combined into a single Triton kernel:

Forward:

$y = \frac{x}{\operatorname{RMS}(x)} \odot \gamma, \quad \operatorname{RMS}(x) = \sqrt{\frac{1}{n} \sum_i x_i^2 + \epsilon}$

Backward:

$\nabla_x \mathcal{L} = \frac{1}{\operatorname{RMS}(x)} \left[\nabla_y \mathcal{L} \odot \gamma - \frac{\hat{x}^T (\nabla_y \mathcal{L} \odot \gamma)}{n} \hat{x}\right], \qquad \nabla_\gamma \mathcal{L} = \hat{x} \odot \nabla_y \mathcal{L}$

where $\hat{x} = x / \operatorname{RMS}(x)$ .

Fused kernels are similarly implemented for SwiGLU/GeGLU activations and for rotary positional embeddings (RoPE), combining query/key rotations per token position.

B. Input Chunking and FLCE

For operations where the output dimension scales with vocabulary size ( $O(B \cdot T \cdot V)$ ), Liger-Kernel applies input chunking: the hidden-state matrix $H \in \mathbb{R}^{(B \cdot T) \times H}$ is partitioned along the row dimension, and computation occurs over chunks. The Fused Linear + CrossEntropy kernel (FLCE) projects each chunk, computes in-place cross-entropy loss, and accumulates gradients without instantiating the full logits tensor. Heuristics are provided for chunk size selection:

$\text{chunk\_size} = 2^{\lceil \log_2 \left( \lceil (B \cdot T) / \lceil V/H \rceil \rceil \right) \rceil}$

C. Kernel Engineering Practices

Performance and correctness are ensured via:

Testing both power-of-two and irregular tensor shapes.
Strict numerical tolerances (FP32: $\text{atol}=10^{-7}$ , $\text{rtol}=10^{-5}$ ; BF16: $\text{atol}=10^{-3}$ , $\text{rtol}=10^{-2}$ ).
Enforcing tensor contiguity before kernel launch.
Promoting program_id to int64 for large row counts.

3. Empirical Evaluation and Benchmarks

A. Microbenchmarks

All microbenchmarks were conducted on A100-80GB GPUs, median of ten runs:

CrossEntropy kernel (vocab=163,840): $\sim3\times$ faster and $\sim5\times$ lower peak memory than torch.nn.CrossEntropy.
GeGLU/SwiGLU (sequence length=16384): Comparable speed to baseline, $\sim1.6\times$ lower peak memory.
RMSNorm ( $H=16384$ ): $\sim7\times$ faster, $\sim3\times$ lower peak memory.
LayerNorm: $\sim30$ \% faster, negligible change in memory use.
RoPE ( $H=16384$ ): $\sim8\times$ faster, $\sim3\times$ lower peak memory.

B. LLM Fine-Tuning Throughput and Memory

End-to-end fine-tuning (4xA100, BFloat16, Alpaca dataset, sequence length 512) demonstrates substantial throughput and memory improvements:

Model	Batch Size	Throughput Increase	GPU Memory Reduction
LLaMA-3-8B	64	+42.8%	−54.8%
Qwen2	48	+25.5%	−56.8%
Gemma-7B	48	+11.9%	−51.8%
Mistral-7B	128	+27.0%	−21.0%
Phi-3	128	+17.0%	−13.0%

C. Medusa Multi-Token Heads

Without Liger-Kernel FLCE, large multi-head logits in Medusa cause out-of-memory (OOM) errors in both new-head initialization and full-model fine-tuning. Liger-FLCE reduces peak memory by approximately $2\times$ and increases throughput by 15–30\%, depending on head count (Hsu et al., 2024).

4. Modularity and Extensibility

Liger-Kernel offers three defined usage tiers:

Auto patching: Via AutoLigerKernelForCausalLM.from_pretrained(), all supported modules in a pretrained model are automatically replaced.
Model-specific patching functions: For example, apply_liger_kernel_to_llama() injects Liger kernels into LLaMA model architectures.
Primitive composition: Researchers can construct custom modules by composing Liger primitives (e.g., LigerLayerNorm, LigerCrossEntropyLoss, LigerSwiGLU).

Each primitive maintains parameter parity with its torch counterpart (e.g., $\gamma$ , $\beta$ , weights, biases). New kernels can be contributed by following the suite template: provide a pure-PyTorch reference, random-shape and tolerance tests, and a mini convergence run to verify learning dynamics.

5. Workflow Integration and Code Examples

Liger-Kernel facilitates minimal-friction integration into LLM pipelines:

Zero-code patching:

1 2	from liger_kernel.transformers import AutoLigerKernelForCausalLM model = AutoLigerKernelForCausalLM.from_pretrained("my-model-path")

Model-specific patching:

from liger_kernel.transformers import apply_liger_kernel_to_llama
from transformers import AutoModelForSequenceClassification
apply_liger_kernel_to_llama()
model = AutoModelForSequenceClassification.from_pretrained("llama-checkpoint")

Custom modules:

import torch
from liger_kernel.transformers import LigerLayerNorm, LigerCrossEntropyLoss

class MyLigerBlock(torch.nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.layer_norm = LigerLayerNorm(hidden_dim)
        self.loss_fn = LigerCrossEntropyLoss()
    def forward(self, x, labels=None):
        x = self.layer_norm(x)
        if labels is not None:
            loss = self.loss_fn(logits, labels)
            return loss
        return logits

HuggingFace-TRL Trainer integration:

from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    "meta-llama/Meta-Llama-3-8B",
    train_dataset=dataset,
    args=SFTConfig(..., use_liger=True)
)
trainer.train()

6. Validation: Convergence and Correctness

Kernel correctness is asserted through unit-testing against pure PyTorch reference implementations across a diverse set of tensor shapes and data types (FP32, BF16), with rigorous $(\text{atol}, \text{rtol})$ tolerances. Tests are executed for both regular and irregular shapes, with checks for tensor layout errors and stability under large-scale training. Miniature end-to-end runs (e.g., "tiny-Shakespeare") confirm concordant convergence trajectories, and extended 4xA100 training cycles on LLaMA and Qwen2 exhibit no divergence or accuracy loss relative to baseline.

7. Broader Significance

Liger-Kernel addresses "last-mile" GPU efficiency bottlenecks in LLM training by fusing and chunking memory- and compute-intensive operations into a compact set of highly optimized Triton kernels. The design approach yields 20–40% higher throughput and 20–60% lower peak GPU memory consumption, while retaining functional and numerical compatibility with mainstream PyTorch LLM infrastructures. Its compositional API paradigm and extensible kernel suite offer a foundation for incremental adaptation and scaling to future architectures (Hsu et al., 2024).

PDF Markdown Chat (Pro)

References (1)

Liger Kernel: Efficient Triton Kernels for LLM Training (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Liger-Kernel.