BlockFFN: Efficient MoE for LLMs

Updated 11 June 2026

BlockFFN is a mixture-of-experts feed-forward network that maximizes hardware efficiency and activation sparsity using fully differentiable, flexible routing.
It employs ReLU and RMSNorm-based routing to achieve over 80% token-level and 70% chunk-level sparsity, optimizing both token and chunk activations.
Custom CUDA kernels with speculative decoding enable end-to-end inference speedups up to 3.67×, making it practical for single-device and edge deployments.

BlockFFN is a mixture-of-experts (MoE) feed-forward network architecture designed to maximize hardware efficiency and activation sparsity in LLMs, specifically targeting practical end-side (single-device) deployment. It introduces fully differentiable, flexible routing and jointly optimizes for both token-level and chunk-level activation sparsity, deploying these innovations with acceleration kernels that integrate speculative decoding and sparse computation. BlockFFN is notable for its ability to achieve over 80% token-level sparsity (TLS), over 70% chunk-level sparsity (CLS) for 8-token batches, and end-to-end inference speedups up to 3.67× over dense baselines, all while matching or exceeding the perplexity and downstream metrics of prior MoE approaches (Song et al., 11 Jul 2025).

1. Motivation and Core Principles

BlockFFN addresses two central bottlenecks in MoE-based LLMs: the non-differentiable/inflexible routing typical of vanilla MoE and the incompatibility of conventional sparsity patterns with high-performance single-device inference. Existing MoE designs often result in low CLS—meaning that, although each token uses few experts, the union of experts across a chunk of tokens remains large, precluding efficient chunk-wise computation. Furthermore, non-differentiable routing hinders model optimization, and activation patterns unaligned across tokens undermine chunk-based acceleration (e.g., via speculative decoding). The design paradigm of BlockFFN is to satisfy three criteria simultaneously: (1) fully differentiable, flexible routing, (2) maximization of both TLS and CLS, and (3) hardware-friendly kernels fusing activation sparsity with speculative decoding (Song et al., 11 Jul 2025).

2. Architecture: Routing and Expert Design

BlockFFN layers are built from sparse MoE blocks of the form

$\mathrm{FFN}(x) = \sum_{i=1}^{N_e} A_i(x) E_i(x),$

where $N_e$ is the number of experts, $E_i(x)$ is the output of expert $i$ , and $A_i(x)$ is its routing weight. The routing of tokens is made fully differentiable: given $x \in \mathbb{R}^{d_h}$ ,

$A^0(x) = W_{\mathrm{router}}^T x,\quad A^1(x) = \mathrm{ReLU}(A^0(x)),\quad A(x) = \mathrm{RMSNorm}(A^1(x)),$

where $W_{\mathrm{router}} \in \mathbb{R}^{d_h \times N_e}$ are learnable parameters. ReLU enforces token-specific, data-dependent sparsity by zeroing out unused experts, while RMSNorm rescales active weights, preventing attenuation of non-zero routing signals even under strong regularization (Song et al., 11 Jul 2025).

Each expert is implemented as a compact two-layer MLP (“block”) with Swish activation:

$E_i(x) = W_{\mathrm{down}}^{(i)T} \sigma(W^{(i)T}_{\mathrm{up}} x),\quad \sigma = \mathrm{Swish},$

where $W^{(i)}_{\mathrm{up}} \in \mathbb{R}^{d_h \times d_e}$ and $N_e$ 0.

3. Activation Sparsity: TLS, CLS, and Training Objectives

BlockFFN optimizes for both token-level and chunk-level activation sparsity, crucial for practical speedups in chunked and speculative inference:

Token-Level Sparsity (TLS) quantifies the fraction of experts not used by an individual token $N_e$ 1. Defining $N_e$ 2,

$N_e$ 3

where $N_e$ 4 is the number of tokens.

Chunk-Level Sparsity (CLS) measures the fraction of experts unused across a chunk of $N_e$ 5 tokens. For chunk $N_e$ 6, define $N_e$ 7,

$N_e$ 8

BlockFFN achieves $N_e$ 9 and $E_i(x)$ 0, outperforming prior MoEs which typically collapse to $E_i(x)$ 1.

To induce these properties, BlockFFN augments the standard language modeling loss with two auxiliary terms:

Training Objective	Definition	Purpose
$E_i(x)$ 2	$E_i(x)$ 3	Encourages activation locality among adjacent tokens
$E_i(x)$ 4	$E_i(x)$ 5	Penalizes union of active experts across $E_i(x)$ 6-token chunk

The total objective is

$E_i(x)$ 7

with an adaptive scheduler on $E_i(x)$ 8 to steadily tighten chunk-level sparsity without destabilizing training (Song et al., 11 Jul 2025).

4. Inference Kernels and End-Side Acceleration

BlockFFN provides custom CUDA (CUTLASS) kernels to exploit activation sparsity and speculative decoding for multi-token inference. The key steps are:

Compute routing activations $E_i(x)$ 9 for $i$ 0 tokens.
Form $i$ 1; $i$ 2 by high CLS.
Perform up-projection for all tokens and experts in $i$ 3 with one GEMM.
Apply masking to zero unused experts per token.
Run Swish and corresponding down-projection; mask and reduce.

For batch size $i$ 4, the block is dense in $i$ 5, leveraging GPU tensor core efficiency. The expected FFN speedup is

$i$ 6

or $i$ 7 for $i$ 8, and empirical results approach these upper bounds (Song et al., 11 Jul 2025).

5. Empirical Results and Hardware Impact

Benchmarking on NVIDIA Jetson Orin NX (16 GB), BlockFFN (2.8B parameters) matches or exceeds previous MoE baselines (Switch/Mixtral TopK, GRIN, ReMoE, DeepSeek-MoE) on both sparsity and language modeling quality.

TLS: consistently over 80%
CLS $i$ 9: consistently over 70%
Inference speed: up to 47.17 tokens/sec (3.67× dense baseline with speculative decoding)
Perplexity: For a 1.2B model, BlockFFN achieves PPL = 8.69 (dense PPL = 8.49, TopK MoE = 8.87, ReMoE = 8.78)
Reuse ratio: over 85% of active experts across token chunks, facilitating SRAM reuse

In practical deployment, this allows experts to be loaded into SRAM a single time and reused across tokens, reducing memory access inefficiency and further increasing throughput.

6. Relationship to Existing MoE Architectures

BlockFFN contrasts with prior MoE schemes in several respects. Vanilla MoE and TopK-based routing deliver high TLS but low CLS, leading to inefficient hardware utilization for chunked or speculative workloads. Heuristic or non-differentiable routers typical of earlier MoE models are replaced by a fully trainable ReLU+RMSNorm router, which jointly optimizes for local and chunked activation sparsity. Previous efforts frequently face performance or training stability trade-offs under strong sparsity constraints; BlockFFN employs auxiliary objectives with adaptive scheduling to mitigate these challenges (Song et al., 11 Jul 2025).

7. Implications and Future Directions

BlockFFN establishes a new standard for end-side, inference-efficient MoE design, making large LLMs viable on single-device or memory-constrained hardware. The integration of chunk-level sparsity into both architectural and training design enables direct compatibility with mainstream acceleration techniques such as speculative decoding and chunked GEMM. A plausible implication is broader application of high-CLS MoE blocks in domains requiring fast, memory-efficient inference, including edge AI and real-time NLP systems. Future work may explore scaling BlockFFN to larger model sizes, tighter integration with more sophisticated speculative strategies, and adaptation to non-NLP modalities (Song et al., 11 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlockFFN.