Activation Outliers in Deep Neural Networks

Updated 15 April 2026

Activation outliers are highly atypical, large-magnitude elements in deep neural networks characterized by heavy-tailed statistics and significant impact on quantization.
They manifest as massive or channel-wise anomalies, often resulting from normalization scaling, attention softmax effects, and weight colinearity.
Mitigation strategies such as per-channel quantization, rotation-based smoothing, and gradient-based regularization effectively reduce quantization errors in low-bit inference.

Activation outliers are highly atypical, large-magnitude elements within the activation tensors of deep neural networks—predominantly transformer-based architectures such as LLMs. They are defined quantitatively as individual neuron or channel responses lying far in the upper or lower tails of the empirical activation distribution, typically exceeding several standard deviations beyond the mean, and are characterized by heavy-tailed statistics (high kurtosis, non-negligible probability mass beyond $3\sigma$ ). Activation outliers have substantial implications for model quantization, compression, training stability, and deployment, especially in low-bit post-training quantization (PTQ) scenarios. Their presence necessitates specialized quantization schemes and motivates extensive algorithmic, theoretical, and empirical research.

1. Mathematical Definitions and Statistical Properties

Activation outliers manifest as hidden-unit or channel activations far in the tail of the distribution of a layer's outputs. Several definitions have been formalized, including:

Standard deviation-based: An activation $x$ is considered an outlier if $|x-\mu| > k\sigma$ for $k \geq 3$ –6, where $\mu$ and $\sigma$ are the mean and standard deviation of the layer activations, respectively (Heo et al., 2023, Kaliaperumal, 4 Mar 2026, Bondarenko et al., 2023).
Empirical thresholding: A scalar threshold $\tau$ (e.g., $\tau=1000$ ) times the mean absolute activation is used: elements where $|h_{i,j}| > \tau\cdot\mu_h$ are activation outliers (An et al., 10 Feb 2025).
Percentile-based: Outliers may be operationally defined as belonging to the top or bottom $1\%$ or $x$ 0 of activation magnitudes in a layer (Paglieri et al., 2024).
Kurtosis: The distribution of activations in LLMs is heavy-tailed, with kurtosis $x$ 1, often $x$ 2 or even reaching $x$ 3 in transformer final layers (Heo et al., 2023, Kaliaperumal, 4 Mar 2026).

In empirical studies of BERT and LLaMA models, up to 55% of activation energy is concentrated in the top 1% of channels in deep layers, and tail mass probabilities satisfy $x$ 4, far exceeding Gaussian expectations (Heo et al., 2023, Kaliaperumal, 4 Mar 2026).

2. Taxonomy: Types and Origins of Activation Outliers

Two primary types are universally reported:

Massive activations: Extremely large, rare entries ( $x$ 5 and $x$ 6 median $x$ 7) arising typically in a few early or late layers, often from feedforward or gated MLPs (Raman et al., 27 May 2025, Lin et al., 2024). These propagate via residual connections.
Channel-wise outliers: Channels or neuron indices with consistently elevated mean or variance across tokens; these regularly "fire" at high amplitudes across the batch or sequence, dominating dynamic range allocation (Kaliaperumal, 4 Mar 2026, Heo et al., 2023).

Additional subclasses include:

Spike outliers: Large single-coordinate activations within a token vector, often localized to a single feature dimension but occurring only in specific tokens (Yi et al., 2024).
Token-wise outliers: Tokens whose entire activation vector is anomalously high; these can be isolated to the beginning of the sequence or special structural tokens (e.g., [BOS]) (Chen et al., 2024, Son et al., 2024).

Mechanistically, root causes include:

Attention softmax zero-update effect: Attention heads, tasked with "doing nothing" on semantically null or delimiter tokens, drive their corresponding softmax logits to extreme ranges, generating runaway activation outliers in preceding MLP layers (An et al., 10 Feb 2025, Bondarenko et al., 2023).
LayerNorm/RMSNorm rescaling: Large learned scaling factors ( $x$ 8) in normalizations boost channel means globally, producing persistent channel-wise outliers (Raman et al., 27 May 2025).
Weight structure and colinearity: Extreme outliers can be produced by colinearity between matrix singular vectors and particular weight rows or inputs, making the phenomenon data-independent and mechanistic rather than attributable to rare input tokens (Liang et al., 28 Nov 2025).

These structured, semantically meaningful outliers are propagated and sometimes even amplified by residual connections, challenging standard quantization workflows (Kaliaperumal, 4 Mar 2026, Raman et al., 27 May 2025).

3. Impact on Quantization and Compression

Activation outliers are the principal barrier to accurate low-bit quantization and efficient model compression:

Uniform quantization collapse: Even a single outlier in a quantization group forces the shared dynamic range to expand drastically, making the quantization step size $x$ 9 very coarse for the vast majority of non-outlier activations. This causes bulk quantization error to increase quadratically with the outlier's amplitude, severely degrading performance in W4A4, W8A8, or even W16A16 regimes (Heo et al., 2023, Yi et al., 2024, Raman et al., 27 May 2025, Lee et al., 2024).
Catastrophic accuracy drop: In BERT-base, overall accuracy under W8A8 can drop by more than 35 points (e.g., QNLI: 89.66% $|x-\mu| > k\sigma$ 0 54.33%), with nearly all error attributable to a few dominant channels (Kaliaperumal, 4 Mar 2026).
Compression error amplification: In low-rank decompositions, perturbations on weight columns corresponding to outlier-activated channels massively amplify output errors: $|x-\mu| > k\sigma$ 1 can blow up if $|x-\mu| > k\sigma$ 2 has outlier rows, even when $|x-\mu| > k\sigma$ 3 is small (Yuan et al., 2023).
Ineffectiveness of percentile clipping: Even aggressive percentile-based calibration (e.g., 99.99%) only marginally alleviates error, and can remove semantically critical signal rather than noise (Kaliaperumal, 4 Mar 2026, Paglieri et al., 2024).
Persistence in modern LLMs: While newer model families (Llama-2/3, Mistral) have tamed outlier magnitudes and prevalence, activation outliers still occur and matter for quantization performance, particularly in older models (e.g., OPT), narrow calibration settings, or aggressive compression (Paglieri et al., 2024).

4. Algorithmic Strategies for Identification and Mitigation

A broad spectrum of algorithms have been developed:

Strategy	Approach	Example Papers
Channel-wise/group-wise scaling	Per-input/channel quantization, PEGs, IC grouping	(Heo et al., 2023, Kaliaperumal, 4 Mar 2026)
Rotation-based smoothing	Orthogonal, Hadamard, dual transforms	(Xiang et al., 2024, Lin et al., 2024, Yi et al., 2024)
Token-wise outlier isolation	Prefixing KV cache with attention sinks	(Chen et al., 2024, Son et al., 2024)
SVD/activation-awared decompositions	SVD/PCA-based outlier direction separation	(Hu et al., 25 Mar 2025, Yuan et al., 2023, Lu et al., 21 Mar 2025)
Mixed-precision quantization	Selective high-precision for outlier groups	(Kaliaperumal, 4 Mar 2026)
Gradient-based regularization	Loss terms penalizing extreme activations	(Liang et al., 28 Nov 2025)
Normalization smoothing	Rescale/replace large scale factors	(Raman et al., 27 May 2025)
Channel voting/power-of-two scaling	Robust selection & rescaling per channel	(Lee et al., 17 Jul 2025)

Notable approaches:

Per-input-channel (per-IC) quantization avoids range inflation by aligning quantization scales with input channels rather than outputs, thus isolating the few outlier-affected channels and optimally allocating bits (Heo et al., 2023).
Block-wise and dual rotations (e.g., Hadamard/orthogonal transforms or combinations with zigzag permutations) distribute both persistent (channel-wise) and rare (massive) outliers across groups, reducing per-group variance and peak dynamic range (Xiang et al., 2024, Lin et al., 2024).
Prefixing attention sinks confines token-wise outliers to known "padding" or dummy slots in the key-value cache, achieving uniform activation magnitudes for downstream tokens and enabling efficient per-tensor static quantization (Chen et al., 2024, Son et al., 2024).
Activation-aware SVD and nested activation decomposition first "whiten" weights using the empirical activation covariance or principal directions, then perform separate low-rank factorizations for outlier-dominated and regular subspaces to maximize compression fidelity (Yuan et al., 2023, Lu et al., 21 Mar 2025).
FP format innovations such as asymmetric microscaling (AMXFP4) use distinct positive/negative group scales, suppressing bias and enhancing precision without calibration overhead (Lee et al., 2024).
Gradient-based suppression (e.g., TWEO loss) regularizes the $|x-\mu| > k\sigma$ 4-th moment or soft-thresholded magnitude of post-residual activations during training, reducing extreme tail events from $|x-\mu| > k\sigma$ 5 to $|x-\mu| > k\sigma$ 6 per layer (Liang et al., 28 Nov 2025).

Mitigation effectiveness is highly nonlinear in grouping structure, outlier type, and model scale. A small misallocation (e.g., too few "embedding groups" in PEG quantization) fails catastrophically, while sufficient isolation enables almost lossless quantization (Kaliaperumal, 4 Mar 2026).

5. Empirical Characterization, Layerwise Distribution, and Function

Activation outliers exhibit non-uniform patterns across layers, token positions, and features:

Layerwise patterns: Massive activations often originate in early FFN layers and propagate as "true" MAs through residuals; channel-wise outliers can persist or reappear in deep layers due to normalization scaling or attention re-triggering (Raman et al., 27 May 2025, An et al., 10 Feb 2025).
Token and channel localization: Outliers disproportionately occur at sequence start (position 0), weakly semantic tokens (punctuation, underscores), and are restricted to a stable set of feature/channel indices, e.g., channels 1415 and 2533 in LLaMA2-7B (An et al., 10 Feb 2025, Chen et al., 2024).
Structural signal vs. noise: Ablation and percentile-clipping analyses show that outlier activations encode structured, semantically meaningful signal (such as the "do-nothing" operation for delimiter tokens), not random noise (Kaliaperumal, 4 Mar 2026, Bondarenko et al., 2023).
Functional role as scaling gates: Outlier activations function as implicit, context-aware scaling factors—dynamically gating attention or suppressing updates for uninformative tokens. Explicitly parameterizing these gates in the model can eliminate outliers, accelerate convergence, and improve quantization resilience (An et al., 10 Feb 2025, Bondarenko et al., 2023).

The following table illustrates extreme outlier statistics and quantized accuracy outcomes (from (Kaliaperumal, 4 Mar 2026), BERT-base QNLI):

Layer	Kurtosis	Top-1% Energy (%)	W8A8 Accuracy (%)
11	271	55	54.33
1	14	21	—
—	—	—	89.66 (FP32)

6. Recent Trends, Model Evolution, and Practical Implications

Recent work demonstrates a shift: modern LLMs (Llama-3, Mistral) trained with improved optimization regimes (bfloat16, higher weight decay/clipping) exhibit far fewer and less extreme outliers, to the point that one-shot W8A8 PTQ achieves FP16-equivalent performance without dedicated outlier handling (Paglieri et al., 2024). The "diminishing effect" of outliers in these models implies that future emphasis may shift from bespoke outlier-aware schemes to end-to-end low-bit integer inference pipelines optimized for throughput and hardware efficiency.

However, in contexts with high compression ratios, older model families, or small calibration sets, robust outlier mitigation remains indispensable. Furthermore, as outliers encode crucial model behaviors (attention gating), their indiscriminate elimination risks functional impairment and collapse.

7. Open Problems and Future Directions

Despite considerable progress, several unresolved issues persist:

Optimal grouping and bit allocation trade-offs: Determining, for a fixed bit-budget, the ideal balance between group count, isolation granularity, and runtime overhead remains an open question (Heo et al., 2023).
Calibration-free and generalizable methods: Engineering solutions that do not rely on calibration data or are robust to dataset/model shifts—especially for edge deployment or privacy-sensitive scenarios—are actively pursued (Lee et al., 2024, Lin et al., 2024).
Interpretability and emergent outlier structure: Deeper understanding of the representational roles, mechanistic origins, and functional consequences of activation outliers (beyond their quantitative metrics) is ongoing, with implications for both model design and interpretability (Liang et al., 28 Nov 2025, An et al., 10 Feb 2025).
Universal strategies across architectures: Effective outlier mitigation for encoder-decoder transformers, diffusion models, and non-attention-based deep nets remains less explored, despite initial successful generalizations (Lee et al., 17 Jul 2025).

Further integration of outlier-handling research with architectural advances, hardware optimization, and theoretical learning dynamics is necessary to fully resolve the tension between quantization efficiency and the preservation of model function.