Pertoken SwiGLU in Transformers

Updated 11 March 2026

Pertoken SwiGLU is a per-token activation function in Transformers that uses dual linear projections to compute a swish-gated output, enhancing model expressivity.
It leverages a quadrant-based sign decomposition to reveal distinct token-level operating regimes, helping to interpret and refine network behavior.
Optimizations such as DeepFusionKernel and dynamic sparsification reduce computational overhead, improving throughput and facilitating specialized MoE architectures.

Pertoken SwiGLU refers to the dynamics, analysis, and optimization of Swish-Gated Linear Unit (SwiGLU) activations and their associated computation on a per-token basis within Transformer architectures. This topic encompasses both the mathematical and empirical properties of SwiGLU blocks, their computational implications for tokenwise inference, their role in model expressivity and specialization, as well as recent advances in hardware, sparsification, and interpretability methodologies tailored for per-token behavior.

1. Mathematical Foundation and Per-Token Structure

The SwiGLU activation, as instantiated in modern Transformers, is defined as follows. For an input $x \in \mathbb{R}^{d_\mathrm{model}}$ to an MLP or expert sublayer, two independent linear projections yield the “gate” and “in” preactivations: $\mathrm{gate}_j = \langle w_{g,j}, x \rangle + b_{g,j} \qquad \mathrm{in}_j = \langle w_{i,j}, x \rangle + b_{i,j}$ The activation for neuron $j$ is computed: $\mathrm{out}_j = \mathrm{Swish}(\mathrm{gate}_j) \cdot \mathrm{in}_j,$ where $\mathrm{Swish}(u) = u \cdot \sigma(u)$ and $\sigma(u) = 1/(1+e^{-u})$ . In vectorized form,

$\mathrm{SwiGLU}(x) = \mathrm{Swish}(W_g x + b_g) \odot (W_i x + b_i).$

This construct modulates each "in" pathway by learned, context-dependent gating, yielding richer per-token nonlinearities than functions such as GELU or ReLU (Gerstner et al., 27 Feb 2026, Tanase et al., 21 Jul 2025).

For multi-token sequences, these computations are carried out independently for each token in the sequence, leading to tokenwise dynamics critical for both downstream task performance and analysis of network behavior.

2. Sign Quadrants and Tokenwise Functional Modes

A distinctive analytic methodology for per-token SwiGLU arises from considering the signs of the gate and in preactivations. Each token activation falls into one of four regimes:

$(++)$ $\mathrm{gate}>0$ , $\mathrm{in}>0$
$(+-)$ $\mathrm{gate}>0$ , $\mathrm{in}<0$
$(-+)$ $\mathrm{gate}<0$ , $\mathrm{in}>0$
$(--)$ $\mathrm{gate}<0$ , $\mathrm{in}<0$

Empirical analysis with GLUScope on OLMo-7B-0424 (Gerstner et al., 27 Feb 2026) demonstrates that tokens are distributed predominantly in the $(+-)$ region (67.7%), with $(++)$ —strong simultaneous activation—being rare (2.7%). Distribution across these quadrants correlates with distinct functional motifs:

$(++)$ : Rare, yields strong positive outputs without a consistent semantic motif.
$(+-)$ : Dominant; large negative deflections, associated with adverbial connectives and discourse shifts.
$(-+)$ : Negative/small outputs; no coherent pattern, acts as noise/background.
$(--)$ : Moderate positive outputs; highly co-occurrent with specific constructions, e.g., “once ... again”.

Partitioning by sign unveils multiplexed neuron roles, illustrating that per-token SwiGLU circuitry enables simultaneous encoding of multiple functional subspaces inaccessible to strictly positive-activated units.

3. Computational Efficiency and Hardware Concerns

SwiGLU blocks incur a per-token computational and memory penalty relative to single-projection FFNs due to their use of two full-rank projection matrices. For a hidden size $h$ and FFN dimension $d$ (e.g., $h=2048$ , $d=8192$ in Llama-scale models), per-token memory bandwidth for the “up” stage is doubled:

SwiGLU: $2 \times 16 \cdot h \cdot d$ bits $\approx 67$ MB per token per layer (FP16)
GELU/linear: $16 \cdot h \cdot d$ bits (half as much) (Tajima et al., 29 Jun 2025)

This memory bottleneck is a key target for hardware optimization. Techniques include:

Sharing weights across gate and value projections with learned binary masks (SwiMGLU/FlashMGLU), reducing memory traffic by 47% and latency by a factor of 6.25–19.7×, with no loss in accuracy (Tajima et al., 29 Jun 2025).
Deep kernel fusion (DeepFusionKernel) streams all matrix multiplications, nonlinearities, and gating into one register-resident kernel, reducing high-bandwidth memory (HBM) traffic per token by $\sim$ 237 KB (for $d_\mathrm{model}=4096$ ) and achieving sustained end-to-end speedups (9–13% depending on hardware and sequence length) (Zhang et al., 12 Feb 2026).
Pruning and sparsification methods, such as Dynamic Input Pruning (DIP), enforce per-token activation sparsity, thus cutting weight loads by up to 46% with $<$ 0.2 PPL penalty (Federici et al., 2024).

A comparative table (selected entries from (Tajima et al., 29 Jun 2025)):

Method	Memory Read (Llama-scale)	Latency (RTX 5090, bs=1)
SwiGLU (baseline)	~67 MB	0.521 ms
SwiMGLU (nₘ=1, Flash)	~35.5 MB	0.0834 ms
SwiMGLU (nₘ=8, Flash)	~49.5 MB	0.0834 ms

4. Pertoken SwiGLU in Model Specialization and MoE Architectures

SwiGLU per-token activations have been leveraged for explicit expert specialization and routing in Mixture-of-Experts (MoE) and modular architectures:

In vision and language MoE designs, each expert implements a SwiGLU MLP (Tan, 2024); per-token gating routes each token to a subset of experts, and parameter sharing across experts can further reduce memory and compute.
Regularization losses computed on per-token SwiGLU activations foster expert-role diversification (intra-layer: penalizing similarity among experts specializing in the same tokens) and path-consistency (cross-layer: maximizing joint routing probability across depth) (Hu et al., 15 Feb 2026). These regularizers, dropped in alongside load-balancing, sharpen expert specialization (e.g., perplexity improvement from 12.50 to 12.27), improve routing stability, and provide 5–7% inference throughput gains.

The explicit use of per-token SwiGLU activations in objective construction enables direct manipulation of routing entropy, diversity, and specialization, all at the token level and with minimal architectural intrusion.

5. Efficiency, FLOPs, and Per-Token Pruning Strategies

Pertoken SwiGLU computations display higher FLOPs/token than single-projection nonlinearities but admit efficiency gains through structural choices:

Baseline SwiGLU LLMs: $24 d^2$ FLOPs/token for $d_\mathrm{model}=d$ and $d_\mathrm{ff}=4d$ (Zhang et al., 12 Feb 2026).
Lightweight models in resource-constrained domains (e.g., L-SwiGLU for NLOS localization) leverage smaller expansion factors ( $h_{\mathrm{swi}}$ ), achieving 10–25% FLOP reductions per token while retaining gating nonlinearity (Masrur et al., 14 Jan 2025).

Sparsification methods such as DIP select (per-token, per-layer) a fixed number $k$ of active units, masking weight columns to induce sparsity. When combined with lightweight LoRA adapters, DIP prunes up to 50% of the MLP columns with only minor loss ( $\Delta$ PPL $<$ 0.2), as demonstrated on Phi-3-Medium, with a corresponding 55% increase in token throughput due to reduced DRAM load (Federici et al., 2024). Unlike ReLU-based LLMs, SwiGLU activations lack natural zeros, rendering predictor-based or thresholded static sparsity methods suboptimal for per-token regimes.

6. Interpretability and Functional Dissection of SwiGLU Neurons

Recent tools such as GLUScope have enabled direct, quadrant-based dissection of per-token SwiGLU neuron function (Gerstner et al., 27 Feb 2026). By partitioning a neuron's activation space according to the $(\mathrm{sign}(\mathrm{gate}), \mathrm{sign}(\mathrm{in}))$ quadrants, interpretability researchers can:

Quantify the empirical occupation frequency of each activation regime.
Display context-rich textual examples representative of each regime.
Surface specialized behaviors—such as the tight association to a linguistic construction (“once ... again”) that surfaces only in a specific sign quadrant and would be missed by simply considering top-k overall values.

This quadrant approach reveals that single SwiGLU units multiplex at least four distinct operational modes, providing empirical grounding for the hypothesis that gating structures—especially SwiGLU—permit fine-grained, context-dependent specialization at the token level within Transformer depth.

7. Architectural Variants and Broader Impact

SwiGLU's per-token dynamics have motivated a variety of architectural choices:

Tokenization innovation, as in the Supernova model, ties SwiGLU activation structure to efficient representation with reduced parameter count, demonstrating high performance at a lower computational budget (Tanase et al., 21 Jul 2025).
Efficient tokenization and gating approaches in domain-adapted Transformers (e.g., sensor-wise tokenization for indoor localization in L-SwiGLU Tranformers) capitalize on per-token SwiGLU dynamics to achieve both higher task accuracy and greater FLOP efficiencies (Masrur et al., 14 Jan 2025).
In vision, MoE-Transformer architectures using SwiGLU blocks, combined with grouped query attention and depthwise scaling, allow for drastic network size reduction without sacrificing competitiveness (Tan, 2024).

The per-token focus in SwiGLU analysis, optimization, and specialization underpins ongoing progress in both the scaling and the deployment efficiency of state-of-the-art neural architectures for language, vision, and domain-specific tasks.

References:

(Gerstner et al., 27 Feb 2026) GLUScope: A Tool for Analyzing GLU Neurons in Transformer LLMs
(Tajima et al., 29 Jun 2025) Masked Gated Linear Unit
(Federici et al., 2024) Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
(Hu et al., 15 Feb 2026) Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
(Tanase et al., 21 Jul 2025) Supernova: Achieving More with Less in Transformer Architectures
(Masrur et al., 14 Jan 2025) Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors
(Tan, 2024) How Lightweight Can A Vision Transformer Be
(Zhang et al., 12 Feb 2026) Deep Kernel Fusion for Transformers