Spike-Aware Mixed-Precision (SAMPQ)
- The paper introduces SAMPQ, a novel quantization approach that selectively protects rare, high-magnitude activations to achieve near–floating point accuracy with reduced resource consumption.
- It employs adaptive statistical thresholds and per-layer analysis to assign higher bitwidths only to spiky layers, minimizing quantization errors in large language models.
- Experimental evaluations on models like LLaMA demonstrate that SAMPQ maintains competitive perplexity and zero-shot accuracy while offering significant energy, compute, and memory savings.
Spike-Aware Mixed-Precision Quantization (SAMPQ) refers to a class of quantization techniques for LLMs that allocate higher precision only to network elements exhibiting rare, high-magnitude activation “spikes.” Originating in research targeting LLaMA and derivatives, and subsequently extended to spiking neural architectures for energy-efficient inference, SAMPQ exploits the empirical localization of activation outliers to a small subset of model components. By doing so, it achieves near–floating point accuracy with much lower memory, compute, and energy footprints, particularly when compared to uniform low-bit quantization strategies (Maisonnave et al., 30 Apr 2025, Wang et al., 22 Oct 2025).
1. Underlying Motivation and Conceptual Rationale
LLMs frequently exhibit outlier or “salient” activations that, if quantized uniformly across all layers, cause the quantization range to expand, collapsing typical activations into coarse bins and leading to severe accuracy degradation. SAMPQ selectively detects and protects those locations—often a handful of linear projections in transformer architectures—where such spikes occur. The rationale is formally established by observing that activation spikes are sparse and sharply localized, making per-layer or per-group mixed-precision not only effective but also highly efficient (Maisonnave et al., 30 Apr 2025).
In spike-driven SNN extensions, SAMPQ (termed “SpikeQuant” in (Wang et al., 22 Oct 2025)) maps salient activations to higher-bit precision and re-encodes activations into time-to-first-spike (TTFS) codes, leveraging the brain-inspired integrate-and-fire (IF) paradigm to achieve both mixed-precision storage and explicit dequantization elimination. This enables further energy reductions while maintaining high model accuracy.
2. Activation Spike Detection and Layer Selection
The core of SAMPQ is a profiling phase that reliably identifies “spiky” locations in the model. For conventional transformer architectures:
- Let denote the activation matrix of a linear layer.
- Compute the mean () and standard deviation () of .
- Define a threshold , typically with .
- Activation spikes: entries where .
- Compute spike-ratio .
- Track maximum absolute activation .
- Mark a layer as spiky if (with small 0, e.g., 1) or 2 (e.g., 3–4).
The SAMPQ pseudocode profiles a small calibration batch through the pretrained model to assign either high or low bit-width to each layer depending on these statistics (Maisonnave et al., 30 Apr 2025). In the SNN context, saliency is detected online using the Median Absolute Deviation (MAD) method, with activations standardized and marked as “salient” if the z-score 5 (usually 6). Offline calibration computes the “salient bar” (mean threshold) per layer for group-wise quantization (Wang et al., 22 Oct 2025).
3. Mixed-Precision Bitwidth Assignment and Quantization Scheme
For each layer 7, the assigned bitwidth 8 is determined by the spike criterion:
- If 9 or 0, allocate 1 (FP16 or FP8).
- Else, allocate 2 (8 or 6 bits).
- Per-layer quantization uses uniform symmetric scaling:
3
- Worst-case quantization error in non-spiky layers: 4.
- For SNN implementations, activations are quantized group-wise: 4-bits for normal, 5-bits for salient values. Weights are usually quantized to 4 bits.
The SNN extension proceeds further: each quantized activation produces exactly one spike (TTFS coding) whose latency encodes the value, removing the need for storing the original bit-precise activations on chip.
4. Dequantization-Free Computation With IF Neurons
In SNN-based SAMPQ, the quantized activations 5 are TTFS-encoded such that at each time 6, 7 if 8, else 9. The linear projection is computed by initializing the membrane potential 0 and updating:
1
where the bias 2 is computed from quantization zero-points. The firing threshold 3, with 4 and 5 the group quantization scales, folds all dequantization scaling into the IF neuron threshold.
After 6 steps, the sum of spike events 7 yields 8; any residual potential produces a fractional output, ensuring that the entire low-bit dequantized dot product is physically realized directly by the SNN dynamics, eliminating explicit MAC or scaling operations (Wang et al., 22 Oct 2025).
5. Experimental Results and Performance Metrics
Empirical results established that SAMPQ achieves near–FP16 perplexity and zero-shot accuracy on a range of LLMs (LLaMA2/3, Mistral, OPT), outperforming uniform per-tensor quantization (Maisonnave et al., 30 Apr 2025, Wang et al., 22 Oct 2025).
Table 1: Perplexity and Zero-Shot Accuracy (lower perplexity, higher accuracy are better; SAMPQ values per original studies)
| Model | FP16 Perplexity | 8-bit SAMPQ Perplexity | FP16 Accuracy (%) | 8-bit SAMPQ Accuracy (%) |
|---|---|---|---|---|
| LLaMA3-8B | 6.14 | 8.24 (per-tensor) | 67.9 | 65.5 (per-tensor) |
| LLaMA2-7B | 5.47 | 6.27 | 64.9 | 62.9 |
| LLaMA2-13B | 4.88 | 8.38 | 67.6 | 60.3 |
| Mistral-7B | 5.25 | 10.14 | 68.2 | 59.4 |
For SNN-based SAMPQ, on Llama2-7B (W4A4/5), perplexities are 5.79 (WikiText2) and 7.33 (C4) vs. 5.68 and 7.08 for FP16, with less than 0.1% zero-shot accuracy drop. Energy reductions reach up to 9 compared to baseline methods.
Throughput on conventional hardware for 8-bit SAMPQ is 0–1 that of FP16, since typically only 2–3 linear layers retain high precision and constitute less than 5% of FLOPs (Maisonnave et al., 30 Apr 2025).
6. Theoretical Analysis and Implementation Implications
Theoretical impact analysis demonstrates that total quantization error 2 is bounded above by the sum of layerwise maxima in non-spiky layers, with spiky layers contributing negligible error. Thus, 3, and empirically, “protecting” only 4 layers ensures 5 quality drop.
In SNN variants, analytic and hardware-synthesized energy models show that for W4A4 MAC baselines, spike-based accumulate energy is 6–7 lower per activation, driven both by reduced on-chip data movement (1 bit per spike vs. 4–5 per activation) and the elimination of high-precision MACs.
7. Limitations, Assumptions, and Extensions
Assumptions: Spike locations are stable across input distributions within LLaMA-derived families, with 1–2 profiling batches sufficient. Only linear projections are quantized; normalization and softmax remain FP16/FP32. Static thresholds generalize within these architectures.
Limitations: SAMPQ is architecture-specific; for unrelated model families (e.g., GPT-NeoX), new profiling is required. 6-bit precision can be unstable without per-token/group scaling or minor QAT, and full support for FP8 remains nascent in deployed hardware/software.
Potential Extensions: Hyperparameter tuning (for 8 and 9) via grid search, fusion with outlier clustering or SmoothQuant approaches, integration into full QAT workflows, and application to broader model classes (including encoder–decoder transformers) represent active directions (Maisonnave et al., 30 Apr 2025).
By precisely targeting precision allocations to the minority of components responsible for extreme activation values, SAMPQ achieves high compression and performance gains with minimal adaptation requirements, enabling scalable, efficient, and accurate deployment of LLMs in both conventional and neuromorphic scenarios.