Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spike-Aware Mixed-Precision (SAMPQ)

Updated 17 April 2026
  • The paper introduces SAMPQ, a novel quantization approach that selectively protects rare, high-magnitude activations to achieve near–floating point accuracy with reduced resource consumption.
  • It employs adaptive statistical thresholds and per-layer analysis to assign higher bitwidths only to spiky layers, minimizing quantization errors in large language models.
  • Experimental evaluations on models like LLaMA demonstrate that SAMPQ maintains competitive perplexity and zero-shot accuracy while offering significant energy, compute, and memory savings.

Spike-Aware Mixed-Precision Quantization (SAMPQ) refers to a class of quantization techniques for LLMs that allocate higher precision only to network elements exhibiting rare, high-magnitude activation “spikes.” Originating in research targeting LLaMA and derivatives, and subsequently extended to spiking neural architectures for energy-efficient inference, SAMPQ exploits the empirical localization of activation outliers to a small subset of model components. By doing so, it achieves near–floating point accuracy with much lower memory, compute, and energy footprints, particularly when compared to uniform low-bit quantization strategies (Maisonnave et al., 30 Apr 2025, Wang et al., 22 Oct 2025).

1. Underlying Motivation and Conceptual Rationale

LLMs frequently exhibit outlier or “salient” activations that, if quantized uniformly across all layers, cause the quantization range to expand, collapsing typical activations into coarse bins and leading to severe accuracy degradation. SAMPQ selectively detects and protects those locations—often a handful of linear projections in transformer architectures—where such spikes occur. The rationale is formally established by observing that activation spikes are sparse and sharply localized, making per-layer or per-group mixed-precision not only effective but also highly efficient (Maisonnave et al., 30 Apr 2025).

In spike-driven SNN extensions, SAMPQ (termed “SpikeQuant” in (Wang et al., 22 Oct 2025)) maps salient activations to higher-bit precision and re-encodes activations into time-to-first-spike (TTFS) codes, leveraging the brain-inspired integrate-and-fire (IF) paradigm to achieve both mixed-precision storage and explicit dequantization elimination. This enables further energy reductions while maintaining high model accuracy.

2. Activation Spike Detection and Layer Selection

The core of SAMPQ is a profiling phase that reliably identifies “spiky” locations in the model. For conventional transformer architectures:

  • Let XRN×dX \in \mathbb{R}^{N \times d} denote the activation matrix of a linear layer.
  • Compute the mean (μ\mu) and standard deviation (σ\sigma) of XX.
  • Define a threshold τ=μ+ασ\tau = \mu + \alpha \cdot \sigma, typically with α=6\alpha = 6.
  • Activation spikes: entries where Xi,j>τ|X_{i,j}| > \tau.
  • Compute spike-ratio S(X)=1Ndi=1Nj=1d1Xi,j>τS(X) = \frac{1}{N \cdot d} \sum_{i=1}^N \sum_{j=1}^d 1_{|X_{i,j}| > \tau}.
  • Track maximum absolute activation M(X)=maxi,jXi,jM(X) = \max_{i,j} |X_{i,j}|.
  • Mark a layer as spiky if S(X)>ρS(X) > \rho (with small μ\mu0, e.g., μ\mu1) or μ\mu2 (e.g., μ\mu3–μ\mu4).

The SAMPQ pseudocode profiles a small calibration batch through the pretrained model to assign either high or low bit-width to each layer depending on these statistics (Maisonnave et al., 30 Apr 2025). In the SNN context, saliency is detected online using the Median Absolute Deviation (MAD) method, with activations standardized and marked as “salient” if the z-score μ\mu5 (usually μ\mu6). Offline calibration computes the “salient bar” (mean threshold) per layer for group-wise quantization (Wang et al., 22 Oct 2025).

3. Mixed-Precision Bitwidth Assignment and Quantization Scheme

For each layer μ\mu7, the assigned bitwidth μ\mu8 is determined by the spike criterion:

  • If μ\mu9 or σ\sigma0, allocate σ\sigma1 (FP16 or FP8).
  • Else, allocate σ\sigma2 (8 or 6 bits).
  • Per-layer quantization uses uniform symmetric scaling:

σ\sigma3

  • Worst-case quantization error in non-spiky layers: σ\sigma4.
  • For SNN implementations, activations are quantized group-wise: 4-bits for normal, 5-bits for salient values. Weights are usually quantized to 4 bits.

The SNN extension proceeds further: each quantized activation produces exactly one spike (TTFS coding) whose latency encodes the value, removing the need for storing the original bit-precise activations on chip.

4. Dequantization-Free Computation With IF Neurons

In SNN-based SAMPQ, the quantized activations σ\sigma5 are TTFS-encoded such that at each time σ\sigma6, σ\sigma7 if σ\sigma8, else σ\sigma9. The linear projection is computed by initializing the membrane potential XX0 and updating:

XX1

where the bias XX2 is computed from quantization zero-points. The firing threshold XX3, with XX4 and XX5 the group quantization scales, folds all dequantization scaling into the IF neuron threshold.

After XX6 steps, the sum of spike events XX7 yields XX8; any residual potential produces a fractional output, ensuring that the entire low-bit dequantized dot product is physically realized directly by the SNN dynamics, eliminating explicit MAC or scaling operations (Wang et al., 22 Oct 2025).

5. Experimental Results and Performance Metrics

Empirical results established that SAMPQ achieves near–FP16 perplexity and zero-shot accuracy on a range of LLMs (LLaMA2/3, Mistral, OPT), outperforming uniform per-tensor quantization (Maisonnave et al., 30 Apr 2025, Wang et al., 22 Oct 2025).

Table 1: Perplexity and Zero-Shot Accuracy (lower perplexity, higher accuracy are better; SAMPQ values per original studies)

Model FP16 Perplexity 8-bit SAMPQ Perplexity FP16 Accuracy (%) 8-bit SAMPQ Accuracy (%)
LLaMA3-8B 6.14 8.24 (per-tensor) 67.9 65.5 (per-tensor)
LLaMA2-7B 5.47 6.27 64.9 62.9
LLaMA2-13B 4.88 8.38 67.6 60.3
Mistral-7B 5.25 10.14 68.2 59.4

For SNN-based SAMPQ, on Llama2-7B (W4A4/5), perplexities are 5.79 (WikiText2) and 7.33 (C4) vs. 5.68 and 7.08 for FP16, with less than 0.1% zero-shot accuracy drop. Energy reductions reach up to XX9 compared to baseline methods.

Throughput on conventional hardware for 8-bit SAMPQ is τ=μ+ασ\tau = \mu + \alpha \cdot \sigma0–τ=μ+ασ\tau = \mu + \alpha \cdot \sigma1 that of FP16, since typically only 2–3 linear layers retain high precision and constitute less than 5% of FLOPs (Maisonnave et al., 30 Apr 2025).

6. Theoretical Analysis and Implementation Implications

Theoretical impact analysis demonstrates that total quantization error τ=μ+ασ\tau = \mu + \alpha \cdot \sigma2 is bounded above by the sum of layerwise maxima in non-spiky layers, with spiky layers contributing negligible error. Thus, τ=μ+ασ\tau = \mu + \alpha \cdot \sigma3, and empirically, “protecting” only τ=μ+ασ\tau = \mu + \alpha \cdot \sigma4 layers ensures τ=μ+ασ\tau = \mu + \alpha \cdot \sigma5 quality drop.

In SNN variants, analytic and hardware-synthesized energy models show that for W4A4 MAC baselines, spike-based accumulate energy is τ=μ+ασ\tau = \mu + \alpha \cdot \sigma6–τ=μ+ασ\tau = \mu + \alpha \cdot \sigma7 lower per activation, driven both by reduced on-chip data movement (1 bit per spike vs. 4–5 per activation) and the elimination of high-precision MACs.

7. Limitations, Assumptions, and Extensions

Assumptions: Spike locations are stable across input distributions within LLaMA-derived families, with 1–2 profiling batches sufficient. Only linear projections are quantized; normalization and softmax remain FP16/FP32. Static thresholds generalize within these architectures.

Limitations: SAMPQ is architecture-specific; for unrelated model families (e.g., GPT-NeoX), new profiling is required. 6-bit precision can be unstable without per-token/group scaling or minor QAT, and full support for FP8 remains nascent in deployed hardware/software.

Potential Extensions: Hyperparameter tuning (for τ=μ+ασ\tau = \mu + \alpha \cdot \sigma8 and τ=μ+ασ\tau = \mu + \alpha \cdot \sigma9) via grid search, fusion with outlier clustering or SmoothQuant approaches, integration into full QAT workflows, and application to broader model classes (including encoder–decoder transformers) represent active directions (Maisonnave et al., 30 Apr 2025).

By precisely targeting precision allocations to the minority of components responsible for extreme activation values, SAMPQ achieves high compression and performance gains with minimal adaptation requirements, enabling scalable, efficient, and accurate deployment of LLMs in both conventional and neuromorphic scenarios.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spike-Aware Mixed-Precision (SAMPQ).