Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Non-Embedding FLOPs per Token

Updated 5 July 2025
  • Non-Embedding FLOPs per Token is a metric that quantifies the computational cost of self-attention and feed-forward operations, excluding the minor embedding computations.
  • It encompasses techniques like learned token pruning, soft token merging, and dynamic compute allocation, all aimed at reducing quadratic FLOPs scaling.
  • Recent strategies have achieved up to 6.7× FLOPs reduction with minimal accuracy loss, making these methods essential for efficient deployment in NLP, vision, and generative tasks.

Non-Embedding FLOPs per Token refers to the floating-point operations (FLOPs) incurred by transformer models, specifically excluding those attributed to the initial embedding stage, with a primary focus on the major computational components—self-attention, feed-forward networks, and related non-embedding mechanisms. This metric is particularly salient in transformer acceleration research, as the quadratic or higher-order relationship between input sequence length and compute in these non-embedding modules often becomes the efficiency bottleneck. Over the last several years, a diverse set of algorithmic strategies has emerged to minimize these FLOPs per token, enabling transformers to operate efficiently at scale and under resource constraints, with applications ranging from NLP and source code modeling to computer vision and generative diffusion.

1. Principles of Non-Embedding FLOPs in Transformer Architectures

The main computational burden in standard transformer models arises from the self-attention mechanism, whose complexity is O(L2d)O(L^2 \cdot d), and the position-wise feed-forward network (FFN), which is O(Ld2)O(L \cdot d^2), where LL is the number of tokens and dd the model's hidden dimension. Since embeddings are typically computed only once and constitute a minor cost, most efficiency research targets non-embedding FLOPs.

The cost per transformer layer can generally be expressed as: CostαL2d+βLd2\text{Cost} \approx \alpha L^2 d + \beta L d^2 where α\alpha and β\beta are constants reflecting the operation counts for self-attention and FFN, respectively (2107.00910).

Any reduction in the effective sequence length LL—per layer, per block, or across the entire model—directly translates into nonlinear savings in non-embedding FLOPs per token, especially for self-attention due to its quadratic scaling.

2. Token Reduction Strategies: Pruning, Merging, and Matrix Transformations

A range of methods has emerged to reduce token count dynamically during inference:

  • Learned Token Pruning (LTP): LTP employs a learned, layer-specific threshold on token importance scores (often derived from attention statistics) to adaptively prune uninformative tokens as a sequence progresses through the model. A token is pruned at layer ll if Score(token)<τl\text{Score}(\text{token}) < \tau_l, where τl\tau_l is learned. This adaptivity allows variation of the surviving token count across layers and inputs, reducing non-embedding FLOPs by a proportional factor of r2r^2 for attention (where r=L/Lr = L'/L is the fraction of tokens retained) (2107.00910).
  • Constraint-aware and Ranking-distilled Pruning (ToP): ToP improves upon baseline pruning by using ranking-distillation from teacher (unpruned) models, learning gate masks (which layers prune) and ranking masks (which tokens are pruned), optimized via L0L_0-regularization with a hard-concrete distribution. This enables fine-grained compute budgeting and better retention of critical tokens, with up to 6.7×6.7\times reduction in FLOPs for BERT (2306.14393).
  • Token Merging and Matrix-based Transformations: Modern frameworks generalize token reduction to many-to-many matrix transformations, where a transformation matrix WRM×NW \in \mathbb{R}^{M \times N} maps NN input tokens to M<NM < N output tokens, with

Y=WXY = W X

rather than use hard pruning or exclusive mergers. The Token Transforming framework constructs WW via attention map informativeness and similarity-based soft assignments, preserving more information and achieving up to 40%40\% FLOPs reduction with marginal accuracy loss (2506.05709). Spectrum-preserving merging (PiToMe) incorporates energy scores to identify redundant tokens for merging while protecting unique tokens, with strong theoretical guarantees on spectral property preservation (2405.16148). Decoupled Embedding approaches (DTEM) introduce dedicated modules to compute merging features, further reducing computational overhead (2412.10569).

3. Adaptive, Dynamic, and Budget-aware Compute Allocation

Cutting-edge work moves beyond static token selection to dynamic compute allocation:

  • Mixture-of-Depths (MoD): MoD uses a learned top-kk routing mechanism per layer to select tokens for full computation, while others are skipped or only receive residual updates. The static computation graph (as kk is fixed a priori) maintains tensor shape regularity. The result is a non-uniform per-token FLOPs allocation, enabling strict compute budgeting and up to 50%50\% reduction in sample-wise stepping time, with minimal loss in accuracy (2404.02258).
  • HAMburger: In hierarchical inference, HAMburger generates multiple tokens per forward pass by “smashing” them into one KV cache entry. Only occasional macro-steps involve full model computation, with lightweight micro-step decoders generating intermediate tokens. This approach reduces the growth of non-embedding FLOPs and KV cache from linear to sub-linear relative to output length, resulting in up to 2×2\times increased tokens per second (2505.20438).
  • FlowHN: In hybrid parallel architectures combining transformers and state-space models, FlowHN dynamically splits tokens between high- and low-FLOP branches in proportion to their compute requirements using

block_size=LFs/Fa+1\text{block\_size} = \frac{L}{F_s/F_a + 1}

ensuring balanced computational loads, minimizing idle time, and maximizing throughput (2505.19472).

4. Specialized Approaches for Vision and Diffusion Models

Vision and generative diffusion models require tailored approaches:

  • Spectrum-Preserving Token Merging: PiToMe merges tokens based on energy scores, such that large clusters of similar embeddings are merged, and unique ones protected. This is achieved by computing, per token,

Ei(vi,W[i,:])=1NjN(i)fm(cos(vi,vj))E_i(v_i, W[i, :]) = \frac{1}{N} \sum_{j \in \mathcal{N}(i)} f_m(\cos(v_i, v_j))

and using bipartite soft matching restricted by the energy metric (2405.16148).

  • Dynamic Token Density (FlexDiT): FlexDiT dynamically modulates token density both spatially (across layers) and temporally (across denoising steps in diffusion). Early layers may replace full self-attention with pooling operations, middle layers use Sparse-Dense Token Modules with a pruning rate

r=1MNr = 1 - \frac{M}{N}

and temporal adaptation further schedules token counts during generation, offering up to 55%55\% FLOPs reduction in DiT-XL on ImageNet with minimal FID increase (2412.06028).

  • Unified Token Transforming: The Token Transforming framework introduces a training-free, many-to-many token compression via matrix multiplication, avoiding exclusive selection or merging and reliably preserving information, which accelerates inference by 1.5×1.5\times on DeiT-S with only 0.1%0.1\% drop in accuracy (2506.05709).

5. FLOPs Reduction versus Model Performance

A central concern is the balance between compute savings and model accuracy. Non-embedding FLOPs reductions up to 2.1×2.1\times (LTP), 6.7×6.7\times (ToP), 55%55\% (FlexDiT), or even $40$–60%60\% (PiToMe, Token Transforming) are observed, typically with less than 1%1\% absolute accuracy drop or even slight gains due to improved regularization (2107.00910, 2306.14393, 2405.16148, 2412.06028, 2506.05709).

Computation-efficient token selection is thus predominantly guided by metrics such as FLOPs/token or total inference cost, cross-referenced against downstream evaluation metrics (GLUE, ImageNet, FID, CIDEr, mIoU). Notably, designs like ToP or Token Transforming demonstrate that accurate token importance ranking and flexible, soft token assignments provide improved robustness and performance retention compared to rigid, early-layer pruning.

6. Theoretical and Empirical Capacity Limits

Recent investigations highlight the substantial gap between theoretically possible FLOPs reductions and those currently attained in practice. By optimizing "memory" input vectors per sample, it is possible to compress up to 1568 tokens into a single embedding, suggesting up to 1500×1500\times reduction in sequential token computation, far exceeding the 10×10\times achievable by previous encoder-based methods (2502.13063). The practical compression ratio is, however, limited by model cross-entropy loss and not simply by input length or vector size, with the information bottleneck at the uncertainty to be reduced.

7. Application Domains and Practical Implications

Non-embedding FLOPs per token has become a central metric for transformer deployment in several domains:

  • NLP and LLMling: Efficient token pruning and merging have been validated on GLUE, SQuAD, and long-sequence code tasks, enhancing feasibility for inference on CPUs/GPUs and enabling longer-context handling (e.g., in SparseCoder as used for code analysis) (2310.07109).
  • Vision: Vision transformers benefit from advanced merging (DTEM, PiToMe) and matrix-based transforming, particularly in classification, segmentation, and dense prediction, providing strong trade-offs between GFLOPs and mIoU/CIDEr (2412.10569, 2405.16148, 2506.05709).
  • Generation/Multimodal: In large-scale generative models (diffusion, LLaVA-like multimodal), fine-grained token density control and many-to-many transformations yield substantial inference acceleration with negligible loss in output quality (2412.06028, 2506.05709).

Careful FLOPs budgeting, adaptive token processing, and information-preserving reduction methods are critical for sustained progress in efficient transformer systems, facilitating practical deployment at scale and under resource constraints.