Non-Embedding FLOPs per Token

Updated 5 July 2025

Non-Embedding FLOPs per Token is a metric that quantifies the computational cost of self-attention and feed-forward operations, excluding the minor embedding computations.
It encompasses techniques like learned token pruning, soft token merging, and dynamic compute allocation, all aimed at reducing quadratic FLOPs scaling.
Recent strategies have achieved up to 6.7× FLOPs reduction with minimal accuracy loss, making these methods essential for efficient deployment in NLP, vision, and generative tasks.

Non-Embedding FLOPs per Token refers to the floating-point operations (FLOPs) incurred by transformer models, specifically excluding those attributed to the initial embedding stage, with a primary focus on the major computational components—self-attention, feed-forward networks, and related non-embedding mechanisms. This metric is particularly salient in transformer acceleration research, as the quadratic or higher-order relationship between input sequence length and compute in these non-embedding modules often becomes the efficiency bottleneck. Over the last several years, a diverse set of algorithmic strategies has emerged to minimize these FLOPs per token, enabling transformers to operate efficiently at scale and under resource constraints, with applications ranging from NLP and source code modeling to computer vision and generative diffusion.

1. Principles of Non-Embedding FLOPs in Transformer Architectures

The main computational burden in standard transformer models arises from the self-attention mechanism, whose complexity is $O(L^2 \cdot d)$ , and the position-wise feed-forward network (FFN), which is $O(L \cdot d^2)$ , where $L$ is the number of tokens and $d$ the model's hidden dimension. Since embeddings are typically computed only once and constitute a minor cost, most efficiency research targets non-embedding FLOPs.

The cost per transformer layer can generally be expressed as: $\text{Cost} \approx \alpha L^2 d + \beta L d^2$ where $\alpha$ and $\beta$ are constants reflecting the operation counts for self-attention and FFN, respectively (Kim et al., 2021).

Any reduction in the effective sequence length $L$ —per layer, per block, or across the entire model—directly translates into nonlinear savings in non-embedding FLOPs per token, especially for self-attention due to its quadratic scaling.

2. Token Reduction Strategies: Pruning, Merging, and Matrix Transformations

A range of methods has emerged to reduce token count dynamically during inference:

Learned Token Pruning (LTP): LTP employs a learned, layer-specific threshold on token importance scores (often derived from attention statistics) to adaptively prune uninformative tokens as a sequence progresses through the model. A token is pruned at layer $l$ if $\text{Score}(\text{token}) < \tau_l$ , where $\tau_l$ is learned. This adaptivity allows variation of the surviving token count across layers and inputs, reducing non-embedding FLOPs by a proportional factor of $r^2$ for attention (where $r = L'/L$ is the fraction of tokens retained) (Kim et al., 2021).
Constraint-aware and Ranking-distilled Pruning (ToP): ToP improves upon baseline pruning by using ranking-distillation from teacher (unpruned) models, learning gate masks (which layers prune) and ranking masks (which tokens are pruned), optimized via $L_0$ -regularization with a hard-concrete distribution. This enables fine-grained compute budgeting and better retention of critical tokens, with up to $6.7\times$ reduction in FLOPs for BERT (Li et al., 2023).
Token Merging and Matrix-based Transformations: Modern frameworks generalize token reduction to many-to-many matrix transformations, where a transformation matrix $W \in \mathbb{R}^{M \times N}$ maps $N$ input tokens to $M < N$ output tokens, with

$Y = W X$

rather than use hard pruning or exclusive mergers. The Token Transforming framework constructs $W$ via attention map informativeness and similarity-based soft assignments, preserving more information and achieving up to $40\%$ FLOPs reduction with marginal accuracy loss (Zeng et al., 6 Jun 2025). Spectrum-preserving merging (PiToMe) incorporates energy scores to identify redundant tokens for merging while protecting unique tokens, with strong theoretical guarantees on spectral property preservation (Tran et al., 25 May 2024). Decoupled Embedding approaches (DTEM) introduce dedicated modules to compute merging features, further reducing computational overhead (Lee et al., 13 Dec 2024).

3. Adaptive, Dynamic, and Budget-aware Compute Allocation

Cutting-edge work moves beyond static token selection to dynamic compute allocation:

Mixture-of-Depths (MoD): MoD uses a learned top- $k$ routing mechanism per layer to select tokens for full computation, while others are skipped or only receive residual updates. The static computation graph (as $k$ is fixed a priori) maintains tensor shape regularity. The result is a non-uniform per-token FLOPs allocation, enabling strict compute budgeting and up to $50\%$ reduction in sample-wise stepping time, with minimal loss in accuracy (Raposo et al., 2 Apr 2024).
HAMburger: In hierarchical inference, HAMburger generates multiple tokens per forward pass by “smashing” them into one KV cache entry. Only occasional macro-steps involve full model computation, with lightweight micro-step decoders generating intermediate tokens. This approach reduces the growth of non-embedding FLOPs and KV cache from linear to sub-linear relative to output length, resulting in up to $2\times$ increased tokens per second (Liu et al., 26 May 2025).
FlowHN: In hybrid parallel architectures combining transformers and state-space models, FlowHN dynamically splits tokens between high- and low-FLOP branches in proportion to their compute requirements using

$\text{block\_size} = \frac{L}{F_s/F_a + 1}$

ensuring balanced computational loads, minimizing idle time, and maximizing throughput (Moradi et al., 26 May 2025).

4. Specialized Approaches for Vision and Diffusion Models

Vision and generative diffusion models require tailored approaches:

Spectrum-Preserving Token Merging: PiToMe merges tokens based on energy scores, such that large clusters of similar embeddings are merged, and unique ones protected. This is achieved by computing, per token,

$E_i(v_i, W[i, :]) = \frac{1}{N} \sum_{j \in \mathcal{N}(i)} f_m(\cos(v_i, v_j))$

and using bipartite soft matching restricted by the energy metric (Tran et al., 25 May 2024).

Dynamic Token Density (FlexDiT): FlexDiT dynamically modulates token density both spatially (across layers) and temporally (across denoising steps in diffusion). Early layers may replace full self-attention with pooling operations, middle layers use Sparse-Dense Token Modules with a pruning rate

$r = 1 - \frac{M}{N}$

and temporal adaptation further schedules token counts during generation, offering up to $55\%$ FLOPs reduction in DiT-XL on ImageNet with minimal FID increase (Chang et al., 8 Dec 2024).

Unified Token Transforming: The Token Transforming framework introduces a training-free, many-to-many token compression via matrix multiplication, avoiding exclusive selection or merging and reliably preserving information, which accelerates inference by $1.5\times$ on DeiT-S with only $0.1\%$ drop in accuracy (Zeng et al., 6 Jun 2025).

5. FLOPs Reduction versus Model Performance

A central concern is the balance between compute savings and model accuracy. Non-embedding FLOPs reductions up to $2.1\times$ (LTP), $6.7\times$ (ToP), $55\%$ (FlexDiT), or even $40$– $60\%$ (PiToMe, Token Transforming) are observed, typically with less than $1\%$ absolute accuracy drop or even slight gains due to improved regularization (Kim et al., 2021, Li et al., 2023, Tran et al., 25 May 2024, Chang et al., 8 Dec 2024, Zeng et al., 6 Jun 2025).

Computation-efficient token selection is thus predominantly guided by metrics such as FLOPs/token or total inference cost, cross-referenced against downstream evaluation metrics (GLUE, ImageNet, FID, CIDEr, mIoU). Notably, designs like ToP or Token Transforming demonstrate that accurate token importance ranking and flexible, soft token assignments provide improved robustness and performance retention compared to rigid, early-layer pruning.

6. Theoretical and Empirical Capacity Limits

Recent investigations highlight the substantial gap between theoretically possible FLOPs reductions and those currently attained in practice. By optimizing "memory" input vectors per sample, it is possible to compress up to 1568 tokens into a single embedding, suggesting up to $1500\times$ reduction in sequential token computation, far exceeding the $10\times$ achievable by previous encoder-based methods (Kuratov et al., 18 Feb 2025). The practical compression ratio is, however, limited by model cross-entropy loss and not simply by input length or vector size, with the information bottleneck at the uncertainty to be reduced.

7. Application Domains and Practical Implications

Non-embedding FLOPs per token has become a central metric for transformer deployment in several domains:

NLP and LLMling: Efficient token pruning and merging have been validated on GLUE, SQuAD, and long-sequence code tasks, enhancing feasibility for inference on CPUs/GPUs and enabling longer-context handling (e.g., in SparseCoder as used for code analysis) (Yang et al., 2023).
Vision: Vision transformers benefit from advanced merging (DTEM, PiToMe) and matrix-based transforming, particularly in classification, segmentation, and dense prediction, providing strong trade-offs between GFLOPs and mIoU/CIDEr (Lee et al., 13 Dec 2024, Tran et al., 25 May 2024, Zeng et al., 6 Jun 2025).
Generation/Multimodal: In large-scale generative models (diffusion, LLaVA-like multimodal), fine-grained token density control and many-to-many transformations yield substantial inference acceleration with negligible loss in output quality (Chang et al., 8 Dec 2024, Zeng et al., 6 Jun 2025).

Careful FLOPs budgeting, adaptive token processing, and information-preserving reduction methods are critical for sustained progress in efficient transformer systems, facilitating practical deployment at scale and under resource constraints.