TransNormerLLM: Linear-Efficient LLM

Updated 10 April 2026

The paper introduces a novel architecture replacing softmax attention with Lightning Attention and blockwise linear processing, reducing complexity to linear while maintaining competitive accuracy.
TransNormerLLM is a large language model that employs gated linear attention, LRPE-d positional encoding, and normalization techniques to enhance training and inference efficiency.
By integrating system optimizations like FSDP and mixed precision, TNL achieves 20–35% throughput improvements and scalable deployment up to 175B parameters, despite limitations in retrieval tasks.

TransNormerLLM (TNL) is a LLM architecture that leverages blockwise linear attention, specialized positional encoding, and advanced system optimizations to achieve competitive accuracy and throughput relative to conventional softmax-based Transformers, while reducing training and inference time and memory complexity to linear in sequence length. TNL’s innovations lie in both its theoretical design—centered around LRPE-d positional encoding and the Lightning Attention module—and its practical instantiation, which enables deployment of large-capacity models on current GPU and distributed computing infrastructure (Qin et al., 2024, Qin et al., 2023, Shen et al., 2024).

1. Architectural Overview and Distinctive Components

TNL builds upon the standard Transformer framework but replaces each of the $L$ stacked blocks’ multi-head softmax attention and feed-forward networks with two submodules:

Gated Linear Attention (GLA): A linearized token mixer that incorporates a Swish-activated gating mechanism.
Simple Gated Linear Unit (SGLU): A channel mixer utilizing a gate-only nonlinearity without extra activation.

Both GLA and SGLU employ SimpleRMSNorm (SRMSNorm), a normalization operation defined as $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ , optimizing computational efficiency by forgoing parameter scaling inherent in RMSNorm or LayerNorm.

Key architectural differences with standard LLMs include:

NormAttention: Replacing softmax attention with a normalized bilinear form, $O = \mathrm{Norm}((Q K^\top) V)$ , or its kernel-trick-based recurrent variant, $O = \mathrm{Norm}(Q (K^\top V))$ , eliminating $O(n^2 d)$ computation.
LRPE-d Positional Encoding: A linearized relative positional embedding with exponential decay, $\lambda^{t-s} e^{i\theta(t-s)}$ , applied head- and layer-wise, ensuring preservation of global context while preventing attention dilution.
Lightning Attention: A novel, IO-aware, exact implementation of causal linear attention with constant speed and fixed memory per sequence length, relying on a blockwise intra- and inter-block computation.
Robust Inference Algorithm: Numerically stable, constant-time sequence generation using a decay-absorbing recurrence, avoiding overflow/underflow of intermediate weights and maintaining efficiency at arbitrary sequence lengths (Qin et al., 2023, Qin et al., 2024).

2. Lightning Attention: Mechanism and Blockwise Implementation

The Lightning Attention mechanism decomposes the causal attention computation over sequences of length $n$ using block partitioning into $T = n/B$ blocks of size $B$ :

Intra-Block (Quadratic within block): Standard left-product attention is applied to each block, costing $O(B^2 d)$ per block, maintaining strict causality.
Inter-Block (Linear across blocks): A running $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 0 summary matrix $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 1 encodes the past, computed via linear recurrence ( $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 2 per block), eliminating the need for memory- and compute-intensive global cumulative sums.

For each block $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 3:

$\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 4

This design results in $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 5 total cost, achieving near-ideal linear scaling for practical $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 6, while fully utilizing SRAM via GPU tiling schemes and avoiding global prefix-sum bottlenecks (Qin et al., 2024).

3. Positional Encoding and Gating Strategies

LRPE-d: Linearized Relative Positional Encoding with Exponential Decay

For each attention head $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 7 and layer $\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 8:

$\mathrm{SRMSNorm}(x) = x / (\|x\|_2/\sqrt{d})$ 9

$O = \mathrm{Norm}((Q K^\top) V)$ 0

This encoding eliminates attention dilution over long contexts while preserving global token interactions, and is implemented in a manner compatible with linear kernel-trick approaches (Qin et al., 2023).

Gating and Normalization

GLA: Swish-activated gate increases training stability and improves perplexity by 1–2%, as confirmed by ablation.
SGLU: Gate-only nonlinearity avoids redundant activations, saving compute without degrading performance.
SRMSNorm: Empirically matches other normalization strategies but offers a ~20% speedup when implemented with Triton, especially for large hidden sizes.

4. Training Efficiency, Parallelism, and System Implementation

TNL leverages a robust system stack for large-scale distributed training:

FSDP (Fully Sharded Data Parallel): Parameter, gradient, and optimizer state sharding across GPUs.
Activation Checkpointing: Tradeoff of memory for additional compute; essential for multi-billion parameter models.
Automatic Mixed Precision (bfloat16): Reduces memory and enables larger context windows.
Model Parallelism:
- SGLU and GLA parallelism: Projections and output matrices are split across model-parallel devices, requiring a single all-reduce per step.
- Scalability: Empirical evidence of efficient scaling up to 175B parameters with context lengths up to 12K on available hardware (Qin et al., 2023).

Throughput and maximum trainable context length exceed those of analogous softmax-attention Transformers by 20–35% at scale ( $O = \mathrm{Norm}((Q K^\top) V)$ 17B params). Empirical memory footprint and wall-clock time both display linear growth with sequence length, in contrast to the quadratic scaling of standard architectures (Qin et al., 2023, Qin et al., 2024).

5. Empirical Benchmarks and Scaling Laws

Training Corpora and Task Coverage

TNL models are pretrained on high-quality corpora, filtered to 2T tokens (6 TB disk) with explicit support for multilingual tokenization (including enhanced coverage of Chinese scripts). Tokenizations are based on BPE with OOV character fallback.

Accuracy and Efficiency

Perplexity: TNL yields lower or matched PPL compared to GPT/Transformer baselines at and beyond 385M parameters. For 7B, WIKITEXT-2 PPL reaches 14.1 (vs. 15.2 for LLaMA) and LAMBADA PPL is 5.5 (vs. 5.9 for LLaMA) (Shen et al., 2024).
Commonsense Reasoning: TNL at 7B achieves 58.63% average CSR accuracy, surpassing LLaMA (56.70%) and matching HGRN2 (58.78%).
Retrieval and Long-Document Generation: TNL lags conventional softmax-attention models and architectures like HGRN2/cosFormer2 in NIAH and SCROLLS benchmarks, which depend on explicit “re-reading” capabilities absent from strictly linear recurrent attention.
Constant Throughput: Tokens generated per second remain essentially invariant for sequence lengths ranging 1K–128K, whereas FlashAttention-2 and other LLMs show throughput collapse at long lengths (Qin et al., 2024).
Scaling Laws: Loss-vs-compute exponent for TNL is $O = \mathrm{Norm}((Q K^\top) V)$ 2, which is marginally flatter than LLaMA ( $O = \mathrm{Norm}((Q K^\top) V)$ 3), indicating slightly slower improvement per added compute, but the gap closes once model size exceeds $O = \mathrm{Norm}((Q K^\top) V)$ 4B parameters (Shen et al., 2024).

Model	Params	WIKI PPL ↓	LAMBADA PPL ↓	CSR ↑	NIAH w.a. ↑	SCROLLS avg ↑
LLaMA	7 B	15.2	5.9	56.70	59.7	14.57
TNL	7 B	14.1	5.5	58.63	20.5	10.74
HGRN2	7 B	13.8	5.2	58.78	30.8	13.46
cosFormer2	7 B	13.5	6.4	57.92	23.6	15.15

6. Design Rationales, Limitations, and Future Directions

TNL exemplifies a family of linear-complexity models demonstrating that, with sufficient architectural mitigation (gating, decay-structured positional encodings, IO-efficient linear attention), it is possible to match or exceed the language-modeling accuracy and scaling behavior of traditional softmax-based Transformers for most tasks. However, the inability to “re-read” or compound upon previous tokens constrains retrieval and some generative tasks, as evidenced by performance on NIAH and SCROLLS benchmarks (Shen et al., 2024).

Optimizing the balance between model size and dataset size under compute constraints reveals that TNL assigns a slightly higher value to increased training data compared to model size, relative to LLaMA. This offers practical advantages for training efficiency and generalization in data-rich scenarios.

Further directions include hybridization with state-space or convolutional layers, adoption of learnable or data-dependent decay rates, additional IO optimizations, and extension of context length with sparse or low-rank attention mechanisms.

7. Availability and Reproducibility

Source code for TNL and all associated variants is available at https://github.com/OpenNLPLab/TransnormerLLM. The stack is implemented in PyTorch (with Fairseq/MetaSeq), aided by custom Triton kernels for Lightning Attention and FSDP for large-scale sharding. Mixed-precision and activation checkpointing are standard, and block size $O = \mathrm{Norm}((Q K^\top) V)$ 5 is typically set to the attention head dimension. Hyperparameters and system splits are documented for reproducibility and extension.

References:

Markdown Report Issue Upgrade to Chat

References (3)

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention (2024)

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer (2023)

Scaling Laws for Linear Complexity Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransNormerLLM (TNL).