Spark FFN: Sparse Transformer Efficiency

Updated 13 February 2026

Spark FFN is a variant of feed-forward networks that employs top-k masking, statistical thresholding, and parameter-efficient gating to enforce activation sparsity in Transformers.
The design systematically reduces per-token FLOPs and wall-time by leveraging the lazy neuron phenomenon without compromising model quality or standard training dynamics.
It integrates a linear-time differentiable top-k approximation and predictor-value split to achieve hardware-friendly sparse computations in both training and inference.

Spark FFN is a feed-forward network (FFN) variant for Transformers that explicitly exploits activation sparsity by means of top- $k$ masking, statistical thresholding, and parameter-efficient gating. Developed in the context of the Spark Transformer architecture, Spark FFN systematically reduces computational cost in both training and inference, achieving significant FLOPs and wall-time reductions while preserving model quality and training dynamics. The core innovation lies in scalable, hardware-friendly sparsification of FFN activations and a low-cost, differentiable predictor for selecting active neurons, thereby reactivating “lazy neuron” sparsity in modern Transformer models (You et al., 7 Jun 2025).

1. Background: Standard Transformer FFN and the Lazy Neuron Phenomenon

In the canonical Transformer layer, the FFN processes each token embedding $x\in\mathbb{R}^{d_{\text{model}}}$ using a two-layer MLP:

$y = \left(x W_1\right)~\sigma~ W_2$

with $W_1\in\mathbb{R}^{d_{\text{model}}\times d_{\text{ff}}}$ , $W_2\in\mathbb{R}^{d_{\text{ff}}\times d_{\text{model}}}$ , and a nonlinearity $\sigma(\cdot)$ such as GELU or ReLU. This requires approximately $4 d_{\text{model}} d_{\text{ff}}$ FLOPs per token.

Li et al. (2022) identified the “lazy neuron” phenomenon: when $\sigma =$ ReLU, only a small subset of the $d_{\text{ff}}$ hidden units have nonzero activations per token. This intrinsic sparsity allows, in principle, for skipping $W_2$ multiplications on inactive neurons, reducing some computation but not the initial $xW_1$ product. The challenge is to efficiently and explicitly harness this per-token sparsity without degrading model quality or increasing parameter count (You et al., 7 Jun 2025).

2. Explicit Sparsification via Top- $k$ Masking

Spark FFN enforces sparsity by selecting the top- $k$ activations from the pre-nonlinearity score vector $h = xW_1$ . The masking process is:

Compute $mask = \mathrm{Top}_k(h)$ where $mask \in \{0,1\}^{d_{\text{ff}}}$ ,
Compute $h_{\text{sparse}} = h \odot mask$ ,
Propagate via $y = h_{\text{sparse}} W_2$ .

Here, $\mathrm{Top}_k(h)$ retains only the $k$ largest elements (per token), annihilating the rest. If $k \ll d_{\text{ff}}$ , the final multiplication $h_{\text{sparse}} W_2$ is reduced to $2 k d_{\text{model}}$ FLOPs from $2 d_{\text{ff}} d_{\text{model}}$ . However, $\mathrm{Top}_k$ selection by sorting is $O(d_{\text{ff}}\log d_{\text{ff}})$ and is non-differentiable, necessitating an efficient relaxation (You et al., 7 Jun 2025).

3. Statistical Top- $k$ : Linear-Time Differentiable Masking

To overcome the inefficiency of exact Top- $k$ , Spark FFN introduces the “statistical Top- $k$ ” operator, which approximates Top- $k$ selection in linear time and is differentiable almost everywhere. For a vector $h \in \mathbb{R}^d$ and target $k$ :

Compute $\theta(h, k) = \mathrm{mean}(h) + \mathrm{std}(h) \cdot Q\left(1 - \frac{k}{d}\right)$ , where $Q$ is the standard Gaussian quantile function,
Apply soft-thresholding: $h_{\text{mask}} = \max(h - \theta, 0)$ .

This procedure sets elements below the threshold to zero and for others subtracts $\theta$ . Since mean and standard deviation are $O(d)$ and soft-thresholding is elementwise, the total computation is $O(d)$ . Empirically, the approach ensures $\approx k$ surviving entries under a Gaussian fit assumption, which is supported for FFN pre-activations in practice (You et al., 7 Jun 2025).

4. Predictor-Value Decomposition and Efficient Sparse Computation

Spark FFN partitions the input and first FFN layer for further efficiency. The weight matrix $W_1$ and input $x$ are split:

Predictor block: $W_{1,\mathrm{pred}} \in \mathbb{R}^{r \times d_{\text{ff}}}$ , operating on $x_{\mathrm{pred}} \in \mathbb{R}^r$ ,
Value block: $W_{1,\mathrm{rest}} \in \mathbb{R}^{(d_{\text{model}}-r) \times d_{\text{ff}}}$ , operating on $x_{\mathrm{rest}} \in \mathbb{R}^{d_{\text{model}}-r}$ .

The mechanism is:

Compute predictor scores: $scores = x_{\mathrm{pred}} \cdot W_{1,\mathrm{pred}}$ ,
Build mask: $mask = \text{Statistical-Top}_k(scores)$ ,
Compute values: $vals = W_{1,\mathrm{rest}}^\top x_{\mathrm{rest}}$ ,
Select and activate: $hidden = \text{GELU}(mask) \odot vals$ ,
Output: $y = hidden W_2$ .

Because $mask$ is $k$ -sparse, the expensive projections and matrix multiplications can be performed efficiently as sparse vector-matrix products. The predictor block’s cost is $2 r d_{\text{ff}}$ ; value block and output cost $2k(d_{\text{model}}-r)$ and $2k d_{\text{model}}$ , respectively. Optimal empirical performance occurs at $r \approx d_{\text{model}}/2$ (You et al., 7 Jun 2025).

5. Measured Efficiency, Sparsity, and Model Quality

On a 2B-parameter Transformer pretrained according to the Gemma-2 recipe, Spark FFN achieves:

Activation sparsity: $k/d_{\text{ff}} \approx 8\%$ (e.g., $k \approx 1106$ with $d_{\text{ff}}=13824$ ),
End-to-end per-token FLOPs reduction: $\approx2.5\times$ (72% in FFN, 75% in attention dot-products),
Decoding speedup: up to $1.79\times$ on 4-core CPU (prefill $1.64\times$ , decode $1.86\times$ ), up to $1.40\times$ on NVIDIA T4 GPU,
No change to the optimizer, learning rate schedule, or pretraining curriculum.

Empirically, this procedure yields near-zero impact on pretraining loss or downstream quality, and enables hardware-efficient implementations using specialized sparse kernels (You et al., 7 Jun 2025).

6. Architectures, Hyper-Parameters, and Implementation

Key configurable aspects include:

$k$ : Active neurons per token ( $\approx8\%$ of $d_{\text{ff}}$ in Gemma-2 models),
$r$ : Predictor width, optimal at $r\approx d_{\text{model}}/2$ (multiple of $256$ per Gemma-2 constraint),
Thresholding: Statistical Top- $k$ is parameter-free, almost everywhere differentiable, and $O(d)$ ,
Training: No modifications to standard Transformer training pipeline.

In attention, a similar predictor-value split is employed, with $d_{\text{attn}}$ halved ( $r_{\text{attn}}=128$ ), and per-token attention restricted to the top $256$ keys.

7. Pseudocode and Integration in Transformer Layers

The Spark FFN forward pass for a single token $x \in \mathbb{R}^{d_{\text{model}}}$ involves:

x_pred, x_rest = x[:r], x[r:]               # Partition input
W1_pred, W1_rest, W2                        # Parameter matrices
k                                           # Sparsity target

scores = x_pred @ W1_pred                   # Predictor scores (shape: d_ff)
mu = mean(scores)
sigma = std(scores)
theta = mu + sigma * Q(1 - k/d_ff)          # Q = Gaussian quantile
mask = maximum(scores - theta, 0)           # Soft thresholding, ≈k nonzeros
vals = sparse_vector_matmul(mask > 0, x_rest, W1_rest)  # Only selected indices
hidden = GELU(mask) * vals
y = sparse_vector_matmul(hidden, W2)        # Sparse matrix multiplication

Gradients flow through the statistical Top- $k$ everywhere except at zero crossings. Inference reuses the same forward pass, with per-token computation dropping from $4 d_{\text{model}} d_{\text{ff}}$ FLOPs to $2 r d_{\text{ff}} + 2(d_{\text{model}}-r)k + 2 k d_{\text{model}}$ . Practical implementations employ specialized SIMD/tiling/CUDA kernels for memory and compute efficiency (You et al., 7 Jun 2025).

Spark FFN reactivates latent activation sparsity in Transformer FFNs by combining explicit top- $k$ masking, scalable thresholding, and efficient parameter reuse, resulting in substantial computational savings and wall-time improvements without compromising model quality or standard training dynamics.

Markdown Upgrade to Chat

References (1)

Spark Transformer: Reactivating Sparsity in FFN and Attention (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spark FFN.