Papers
Topics
Authors
Recent
Search
2000 character limit reached

BlockFFN: Efficient MoE for LLMs

Updated 11 June 2026
  • BlockFFN is a mixture-of-experts feed-forward network that maximizes hardware efficiency and activation sparsity using fully differentiable, flexible routing.
  • It employs ReLU and RMSNorm-based routing to achieve over 80% token-level and 70% chunk-level sparsity, optimizing both token and chunk activations.
  • Custom CUDA kernels with speculative decoding enable end-to-end inference speedups up to 3.67×, making it practical for single-device and edge deployments.

BlockFFN is a mixture-of-experts (MoE) feed-forward network architecture designed to maximize hardware efficiency and activation sparsity in LLMs, specifically targeting practical end-side (single-device) deployment. It introduces fully differentiable, flexible routing and jointly optimizes for both token-level and chunk-level activation sparsity, deploying these innovations with acceleration kernels that integrate speculative decoding and sparse computation. BlockFFN is notable for its ability to achieve over 80% token-level sparsity (TLS), over 70% chunk-level sparsity (CLS) for 8-token batches, and end-to-end inference speedups up to 3.67× over dense baselines, all while matching or exceeding the perplexity and downstream metrics of prior MoE approaches (Song et al., 11 Jul 2025).

1. Motivation and Core Principles

BlockFFN addresses two central bottlenecks in MoE-based LLMs: the non-differentiable/inflexible routing typical of vanilla MoE and the incompatibility of conventional sparsity patterns with high-performance single-device inference. Existing MoE designs often result in low CLS—meaning that, although each token uses few experts, the union of experts across a chunk of tokens remains large, precluding efficient chunk-wise computation. Furthermore, non-differentiable routing hinders model optimization, and activation patterns unaligned across tokens undermine chunk-based acceleration (e.g., via speculative decoding). The design paradigm of BlockFFN is to satisfy three criteria simultaneously: (1) fully differentiable, flexible routing, (2) maximization of both TLS and CLS, and (3) hardware-friendly kernels fusing activation sparsity with speculative decoding (Song et al., 11 Jul 2025).

2. Architecture: Routing and Expert Design

BlockFFN layers are built from sparse MoE blocks of the form

FFN(x)=i=1NeAi(x)Ei(x),\mathrm{FFN}(x) = \sum_{i=1}^{N_e} A_i(x) E_i(x),

where NeN_e is the number of experts, Ei(x)E_i(x) is the output of expert ii, and Ai(x)A_i(x) is its routing weight. The routing of tokens is made fully differentiable: given xRdhx \in \mathbb{R}^{d_h},

A0(x)=WrouterTx,A1(x)=ReLU(A0(x)),A(x)=RMSNorm(A1(x)),A^0(x) = W_{\mathrm{router}}^T x,\quad A^1(x) = \mathrm{ReLU}(A^0(x)),\quad A(x) = \mathrm{RMSNorm}(A^1(x)),

where WrouterRdh×NeW_{\mathrm{router}} \in \mathbb{R}^{d_h \times N_e} are learnable parameters. ReLU enforces token-specific, data-dependent sparsity by zeroing out unused experts, while RMSNorm rescales active weights, preventing attenuation of non-zero routing signals even under strong regularization (Song et al., 11 Jul 2025).

Each expert is implemented as a compact two-layer MLP (“block”) with Swish activation:

Ei(x)=Wdown(i)Tσ(Wup(i)Tx),σ=Swish,E_i(x) = W_{\mathrm{down}}^{(i)T} \sigma(W^{(i)T}_{\mathrm{up}} x),\quad \sigma = \mathrm{Swish},

where Wup(i)Rdh×deW^{(i)}_{\mathrm{up}} \in \mathbb{R}^{d_h \times d_e} and NeN_e0.

3. Activation Sparsity: TLS, CLS, and Training Objectives

BlockFFN optimizes for both token-level and chunk-level activation sparsity, crucial for practical speedups in chunked and speculative inference:

  • Token-Level Sparsity (TLS) quantifies the fraction of experts not used by an individual token NeN_e1. Defining NeN_e2,

NeN_e3

where NeN_e4 is the number of tokens.

  • Chunk-Level Sparsity (CLS) measures the fraction of experts unused across a chunk of NeN_e5 tokens. For chunk NeN_e6, define NeN_e7,

NeN_e8

BlockFFN achieves NeN_e9 and Ei(x)E_i(x)0, outperforming prior MoEs which typically collapse to Ei(x)E_i(x)1.

To induce these properties, BlockFFN augments the standard language modeling loss with two auxiliary terms:

Training Objective Definition Purpose
Ei(x)E_i(x)2 Ei(x)E_i(x)3 Encourages activation locality among adjacent tokens
Ei(x)E_i(x)4 Ei(x)E_i(x)5 Penalizes union of active experts across Ei(x)E_i(x)6-token chunk

The total objective is

Ei(x)E_i(x)7

with an adaptive scheduler on Ei(x)E_i(x)8 to steadily tighten chunk-level sparsity without destabilizing training (Song et al., 11 Jul 2025).

4. Inference Kernels and End-Side Acceleration

BlockFFN provides custom CUDA (CUTLASS) kernels to exploit activation sparsity and speculative decoding for multi-token inference. The key steps are:

  1. Compute routing activations Ei(x)E_i(x)9 for ii0 tokens.
  2. Form ii1; ii2 by high CLS.
  3. Perform up-projection for all tokens and experts in ii3 with one GEMM.
  4. Apply masking to zero unused experts per token.
  5. Run Swish and corresponding down-projection; mask and reduce.

For batch size ii4, the block is dense in ii5, leveraging GPU tensor core efficiency. The expected FFN speedup is

ii6

or ii7 for ii8, and empirical results approach these upper bounds (Song et al., 11 Jul 2025).

5. Empirical Results and Hardware Impact

Benchmarking on NVIDIA Jetson Orin NX (16 GB), BlockFFN (2.8B parameters) matches or exceeds previous MoE baselines (Switch/Mixtral TopK, GRIN, ReMoE, DeepSeek-MoE) on both sparsity and language modeling quality.

  • TLS: consistently over 80%
  • CLSii9: consistently over 70%
  • Inference speed: up to 47.17 tokens/sec (3.67× dense baseline with speculative decoding)
  • Perplexity: For a 1.2B model, BlockFFN achieves PPL = 8.69 (dense PPL = 8.49, TopK MoE = 8.87, ReMoE = 8.78)
  • Reuse ratio: over 85% of active experts across token chunks, facilitating SRAM reuse

In practical deployment, this allows experts to be loaded into SRAM a single time and reused across tokens, reducing memory access inefficiency and further increasing throughput.

6. Relationship to Existing MoE Architectures

BlockFFN contrasts with prior MoE schemes in several respects. Vanilla MoE and TopK-based routing deliver high TLS but low CLS, leading to inefficient hardware utilization for chunked or speculative workloads. Heuristic or non-differentiable routers typical of earlier MoE models are replaced by a fully trainable ReLU+RMSNorm router, which jointly optimizes for local and chunked activation sparsity. Previous efforts frequently face performance or training stability trade-offs under strong sparsity constraints; BlockFFN employs auxiliary objectives with adaptive scheduling to mitigate these challenges (Song et al., 11 Jul 2025).

7. Implications and Future Directions

BlockFFN establishes a new standard for end-side, inference-efficient MoE design, making large LLMs viable on single-device or memory-constrained hardware. The integration of chunk-level sparsity into both architectural and training design enables direct compatibility with mainstream acceleration techniques such as speculative decoding and chunked GEMM. A plausible implication is broader application of high-CLS MoE blocks in domains requiring fast, memory-efficient inference, including edge AI and real-time NLP systems. Future work may explore scaling BlockFFN to larger model sizes, tighter integration with more sophisticated speculative strategies, and adaptation to non-NLP modalities (Song et al., 11 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlockFFN.