TWEO: Transformers Without Extreme Outliers

Updated 23 December 2025

TWEO is a collection of techniques that suppress extreme activation and parameter outliers in transformers using theoretical insights, normalization strategies, and regularization.
It employs loss-based regularization, tailored normalization, and architectural modifications such as Outlier-Protected blocks to control high-magnitude activations.
These methods enhance hardware compatibility and training stability, enabling robust low-bit quantization and improved performance across diverse transformer models.

Transformers Without Extreme Outliers (TWEO) encompasses a set of theoretical analyses, architectural modifications, regularization schemata, normalization strategies, and quantization techniques to suppress extreme activation outliers in transformers. These outliers—typically highly anisotropic, high-magnitude components in hidden representations or network parameters—impair hardware efficiency, destabilize quantized inference/training, and sharply concentrate representational power. TWEO methods span theoretical, algorithmic, and empirical domains, ranging from regularization via loss design to explicit normalization/rotation operations and parameter reparameterizations, providing robust recipes for both pre-training and post-training settings.

1. Characterization and Mechanisms of Extreme Outliers

Extreme activation and parameter outliers in transformers are identified by their statistical deviation and functional roles. Formally, a coordinate $i$ in LayerNorm scale $w_\ell \in \mathbb{R}^d$ (layer $\ell$ ) is an outlier if $|w_{\ell,i} - \mu_\ell| > \alpha \sigma_\ell$ (with $\alpha=3$ for BERT, $2$ for RoBERTa), and an "outlier dimension" holds this property across all layers. Empirically, in BERT-base/RoBERTa-base, disabling <0.0001% of parameters (e.g., two outlier LayerNorm dims) results in a >25% drop on GLUE/MNLI, whereas ablating random dims has no effect (Puccetti et al., 2022, Kovaleva et al., 2021).

Hidden activations concentrate variance along a small number of directions ("rogue" or outlier axes). This makes the embedding space highly anisotropic. Self-attention heads, especially those associated with special tokens or punctuation, focus their global context through these axes—manifested as "vertical stripes" in attention maps. Outlier dimensions are also strongly correlated with token frequency during pre-training: absolute hidden magnitudes $|h_\ell(t)_i|$ align with $\log$ (token frequency), with Pearson coefficients often peaking in intermediate layers, then vanishing at the input/output layers (Puccetti et al., 2022).

Recent analysis demonstrates that such outliers are not solely data-driven. SVD decomposition of MLP weight matrices, revealing colinear alignment of singular vectors and residual branches, produces mechanical amplification of activations—the essential root cause, not the dataset. Even with random Gaussian input, pre-trained transformers produce extreme outliers, confirming the data-independent, structural nature of the effect (Liang et al., 28 Nov 2025).

2. Regularization and Loss-Based Suppression

Loss-based regularization forms the cornerstone of recent TWEO implementations. The central idea is to penalize activations that exceed a "safe" threshold. This can be instantiated as a soft constraint in the objective:

$\mathcal{L}_{\rm total} = \mathcal{L}_{\rm task} + \lambda \left[ \frac{1}{L} \sum_{l=1}^{L} \mathbb{E}\left( \left| \frac{A^{(l)}}{\tau+\epsilon} \right|^p \right) \right]$

Here, $A^{(l)}$ is the post-MLP activation at block $l$ , $\tau$ is a magnitude threshold (e.g., 3), $p$ is a penalty exponent (e.g., 4), and $\lambda$ an annealed or fixed regularization weight. For $|A| \gg \tau$ , the penalty grows rapidly, suppressing the formation of heavy tails (Liang et al., 28 Nov 2025). This approach smooths activation distributions, lowering peak outlier counts from $O(10^4)$ to $<20$ per layer, and enables robust full-model FP8 training and aggressive quantization (W8A8 even for the residual stream).

Other approaches include explicit LayerNorm parameter clamping: penalizing deviations in scale or bias beyond a $k \sigma$ threshold, added as a regularization term to standard MLM or permutation-LM loss (Kovaleva et al., 2021).

3. Architectural and Normalization Strategies

Several TWEO recipes rely on modifying the transformer block's normalization or residual structure. The Outlier-Protected (OP) block (He et al., 2024) eschews width-wise LayerNorm/RMSNorm entirely, thus eliminating outlier-prone normalization layers. Instead, two essential stabilizers are incorporated:

Residual-downscaling: Scale the residual branch by $\beta=O(1/\sqrt{\text{depth}})$ (e.g., 0.1 for 24L).
Entropy-control via Query-Key normalization: Normalize Q/K vectors to fixed $\ell_2$ norm ( $\sqrt{d}$ ), preserving high entropy and preventing collapse.

This yields training stability and convergence indistinguishable from Pre-Norm but dramatically lowers activation kurtosis by 2–4 orders of magnitude, as measured by batchwise RMS kurtosis and max-median ratio.

Unified Normalization (UN) (Yang et al., 2022) replaces LayerNorm with a fused, inference-friendly formulation that eliminates runtime stat updates and division/sqrt ops. UN detects extreme outliers by comparing arithmetic and geometric means of per-channel variances over a moving window, adaptively replacing the current channel's variance estimate with the geometric mean if an outlier is detected. This preserves gradient fidelity and statistically bounds the effect of outliers. During inference, all normalization statistics are merged into adjacent linear operations.

4. Attention Mechanisms and Softmax Clipping

Several TWEO methods address the root cause of outliers in the attention mechanism. In standard transformers, pushing attention probabilities to near-zero/one (critical for "no-update" heads) inevitably leads to infinite logit gaps and thus extreme activations in FFN outputs.

Clipped Softmax and Normalized Clipped Softmax (Bondarenko et al., 2023, Liao et al., 2024) cure this by post-processing the softmax vector:

$\text{NCS}(x; \zeta, \beta)_i = \mathrm{clip}\left((\zeta-\gamma)\,\mathrm{softmax}(x)_i + \gamma,\,0,\,1\right)$

Here, $\gamma=(\beta-\zeta)/(T-1)$ ensures that $\sum_i \text{NCS} = \beta$ is invariant to sequence length. This guarantees that no attention probability is exactly zero or one, saturating gradients and removing the incentive for infinite logit differences. The layer remains robust across varying sequence lengths—a crucial property for generalization between pre-training and fine-tuning (Liao et al., 2024).

Gated attention is an alternative that completely decouples head activation from attention weights by introducing a learned sigmoid gate per head and token, sidestepping logarithmic gaps entirely (Bondarenko et al., 2023).

5. Quantization-Oriented Outlier Suppression and Plug-and-Play Techniques

Outlier suppression techniques are critical for low-bit quantization. Gamma Migration (Wei et al., 2022) analytically moves the problematic LayerNorm scale $\gamma$ into subsequent linear operations, transforming the network algebraically but dramatically shrinking the post-LN activation dynamic range. Post-migration, the node after LN has range reduced by up to $\max_j |\gamma_j|$ -fold. Token-Wise Clipping further shrinks outlier effects by adaptively selecting per-token clipping bounds (coarse quantile search, followed by local refinement via gradient descent) to minimize quantization loss. These are applied post-training and maintain operator and data-path equivalence (Wei et al., 2022).

Group-wise Hadamard Rotation (ConvRot) (Huang et al., 3 Dec 2025) achieves plug-and-play TWEO by orthogonally redistributing outlier mass across activation channels—both rows and columns—reducing O $(K^2)$ rotation complexity to O $(K)$ via groups, with group size $N_0$ . Once rotated, weights and activations are uniformly quantized, GEMM'ed in INT4, then dequantized. This process preserves the exact linear mapping pre-quantization, and suppresses outliers without retraining or calibration.

6. Empirical Impact: Performance, Quantization, and Hardware

TWEO recipes consistently yield dramatic improvements in hardware compatibility and quantization stability. Key results include:

Method	Model	Max Outlier	INT8/FP8 Perf.	Notes
Baseline	GPT-2/ViT-B	10,000+	collapse	All FP8 runs diverge/collapse
Loss-based TWEO reg.	GPT-2/ViT-B	≤20	~BF16 parity	Enables full-model FP8 up to 7B
Clipped/NCS Softmax	BERT/ViT/OPT	20–80	FP=INT8 acc	No INT8 loss, robust to seq. len
OP block	OPT (1.2B)	10–30	FP32 match	2–4 orders ↓ kurtosis, stable conv.
Gamma Migration+Clipping	BERT/RoBERTa	≤10–20	6b/8b match	Algebraically equivalent, PTQ stable
ConvRot (group Hadamard)	Diff. Transf.	O(1) post-rot	4b/W4A4 match	No retrain, >2× speedup, <1% loss

FP8-trained GPT-2 (1.6B) with the TWEO regularizer attains 13.84 PPL (BF16), compared to immediate collapse with baseline training. Per-tensor static W8A8 quantization previously unusable for LLMs matches or exceeds SmoothQuant/SOTA post-TWEO. Hardware measurements show $36\%$ higher throughput for FP8 training at identical accuracy.

7. Limitations, Open Questions, and Future Directions

TWEO techniques generalize across language, vision, and diffusion transformers. However, scaling to very large LLMs (e.g., OPT-350M+ for some NCS settings) may require hyperparameter tuning. Not all methods address pre-training biases (e.g., token frequency induced anisotropy) (Puccetti et al., 2022), and the robustness of certain plug-and-play schemes (ConvRot, Gamma Migration) remains to be fully explored for non-standard architectures or extreme depths.

The interaction of TWEO recipes with other model distillation, pruning, or mixed-precision quantization remains an open avenue. A plausible implication is that future regularization strategies will integrate frequency-aware, isotropy-promoting objectives and normalization/rotation modules to produce architectures that, by construction, lack statistically and geometrically dominant outlier axes, thereby realizing the TWEO paradigm across modalities and hardware (Liang et al., 28 Nov 2025, Puccetti et al., 2022, He et al., 2024, Huang et al., 3 Dec 2025).