Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Token Selection (OrthoRank)

Updated 31 May 2026
  • Dynamic Token Selection (OrthoRank) is a method that uses token-to-sink cosine similarity to select salient tokens, reducing unnecessary computations in neural networks.
  • It leverages geometric properties and top-K selection to prune tokens effectively, addressing quadratic attention costs in large-scale models like LLMs and recommendation systems.
  • Empirical evaluations demonstrate significant speedups and memory savings, with improvements up to 23.8× in computation and enhanced performance in tasks such as language modeling and generative applications.

Dynamic Token Selection, commonly referred to as OrthoRank in recent literature, encompasses a family of methods for pruning, selecting, or prioritizing tokens at runtime in large neural architectures. The primary objective is to overcome the computational bottleneck imposed by quadratic attention and excessive memory usage, especially in high-cardinality, sparse, or long-context settings typical of modern recommendation systems, LLMs, and generative architectures. OrthoRank methods leverage token-wise geometric or semantic properties—most prominently the orthogonality of token feature representations to global “sink” or reference directions—to adaptively determine which tokens merit full computation in each layer.

1. Mathematical Foundations of OrthoRank-Based Token Selection

OrthoRank bases token selection on token-to-sink geometries in representation space. Consider a hidden state matrix H()RB×N×dH^{(\ell)}\in\mathbb{R}^{B\times N\times d} at layer \ell, batch size BB, token count NN, and feature dimension dd. For each sequence, the sink token—often the first special or delimiter token—is denoted as hˉ0()\bar h_0^{(\ell)}. All token states are normalized as

hˉi()=hi()hi()\bar h_i^{(\ell)} = \frac{h_i^{(\ell)}}{\|h_i^{(\ell)}\|}

The cosine similarity with the sink is

cos(hˉ0(),hˉi())=h0(),hi()h0()hi()\cos(\bar h_0^{(\ell)}, \bar h_i^{(\ell)}) = \frac{\langle h_0^{(\ell)}, h_i^{(\ell)} \rangle}{\|h_0^{(\ell)}\|\|h_i^{(\ell)}\|}

The OrthoRank importance scores select tokens with maximal orthogonality to the sink; i.e., tokens for which cos(hˉ0(),hˉi())|\cos(\bar h_0^{(\ell)},\bar h_i^{(\ell)})| is minimized. A formalization:

orthogonalityi()=1hˉ0()hˉi()\text{orthogonality}_i^{(\ell)} = 1 - |\bar h_0^{(\ell)\top}\bar h_i^{(\ell)}|

The top-\ell0 tokens by this measure are retained for full update; others bypass attention/FFN updates but may still contribute key/value representations for global attention (Shin et al., 5 Jul 2025).

2. Algorithmic Implementation and Variants

A generic OrthoRank layer operates as follows:

  1. Compute the normalized hidden states for all tokens.
  2. Evaluate the cosine similarity of each token to the sink token.
  3. Select the \ell1 tokens with the highest orthogonality scores.
  4. Apply attention and feed-forward operations to these tokens. Non-selected tokens provide only key/value vectors for context but do not undergo further transformation in that layer.

Pseudocode excerpt (in LaTeX style) (Shin et al., 5 Jul 2025): BB9

STORE (Xu et al., 24 Nov 2025) employs a variant of dynamic token selection within its attention mechanism. For query \ell2, attention scores \ell3 are computed and the \ell4 highest softmaxed scores are selected, allowing only \ell5 key/value pairs to influence the attention update for each query. A threshold gating variant is also supported, adjusting a threshold \ell6 so that on average \ell7.

Other methods (e.g., OptiPrune (Lu, 1 Jul 2025), TokenSelect (Wu et al., 2024), DynTS (Guo et al., 26 Jan 2026)) implement dynamic selection based on similarity to representative tokens, head-wise criticality scoring, or learned importance predictors, but fundamentally share the OrthoRank philosophy of isolating and prioritizing salient tokens per-layer or per-step.

3. Theoretical and Computational Properties

OrthoRank-based selection yields substantial computational and memory benefits by reducing the quadratic complexity of standard attention mechanisms. For a token set of size \ell8 with retention fraction \ell9, the per-layer cost for full attention is BB0 (with BB1 feature dimension), versus BB2 for OrthoRank-style sparse updates plus a small BB3 cost for importance score computations and sorting. Selection overhead is minor relative to the FLOPs saved in self-attention and feed-forward layers (Shin et al., 5 Jul 2025, Xu et al., 24 Nov 2025).

In hybrid systems such as STORE, additional transformations (e.g., orthogonal rotation blocks applied to low-cardinality features) require a negligible BB4 cost per instance (with BB5 the number of rotation matrices, BB6 feature dim), ensuring scalability is not compromised (Xu et al., 24 Nov 2025).

4. Empirical Performance and Ablation Studies

Dynamic token selection via OrthoRank achieves strong empirical results across model architectures and benchmarks:

  • STORE: AUC increased from 0.6774 to 0.6804 (+1.195%), with online CTR uplift of 2.71% and 1.84× training throughput. Removing OrthoRank drops AUC to 0.6780, while replacing sparse attention with dense attention leads to higher FLOP usage with negligible accuracy gain (Xu et al., 24 Nov 2025).
  • OrthoRank for LLMs: On Llama-2-13B, OrthoRank combined with standard layer pruning reduces perplexity from 9.42→8.74 at 20% sparsity and yields +3–6% absolute gains in zero-shot accuracy over baseline pruning. Throughput gains are linear with sparsity, outperforming attention-score-based baselines both in accuracy and efficiency (Shin et al., 5 Jul 2025).
  • Other domains: In diffusion models (e.g., OptiPrune), OrthoRank-style similarity pruning achieves 30–40% speedup for attention layers with no loss in prompt-image alignment (Lu, 1 Jul 2025). In LLMs for long context or reasoning, dynamic selection maintains or improves accuracy while delivering up to 23.8× speedup in attention computation and up to 5× KV-cache compression (Wu et al., 2024, Guo et al., 26 Jan 2026).

Ablations consistently show that geometric or sink-based selection strategies outperform heuristic or naïve attention-magnitude approaches when severely pruning tokens.

Several related streams of research complement OrthoRank principles:

  • Reinforcement-Learned Selection: In TR-BERT (Ye et al., 2021), a learned policy network assigns Select/Skip actions to tokens at each layer, optimizing for end-task accuracy and computational cost. Unlike OrthoRank, which is training-free and geometric, this approach explicitly learns the pruning policy through RL.
  • Similarity-Driven Pruning: OptiPrune (Lu, 1 Jul 2025) computes pairwise (cosine) similarities, selects spatially-central tokens by patch-region, injects controlled randomness to avoid bias, and recovers pruned tokens by copying features from their closest retained counterparts. This preserves both efficiency and alignment in generative pipelines.
  • Criticality Scoring: TokenSelect (Wu et al., 2024) introduces per-head, per-token scoring by Q–K dot-products, aggregating across heads to select globally most-influential tokens. This enables extreme context length extrapolation (over 1M tokens) without retraining the underlying model.
  • Importance Prediction: DynTS (Guo et al., 26 Jan 2026) attaches an MLP-based importance predictor to each token during decoding, learning to retain only those tokens most critical for the reasoning trace.

The prevalence of geometric, similarity, and learned importance metrics in contemporary literature highlights a convergence on token adaptivity as a central tool for efficient and scalable neural computation.

6. Practical Considerations, Limits, and Tuning

Optimal use of OrthoRank methods requires careful tuning of the retention ratio, pruning frequency, and the specific selection metric. For example, retaining 10–30% of tokens per layer balances throughput and accuracy in LLMs (Shin et al., 5 Jul 2025). In STORE (Xu et al., 24 Nov 2025), the top-BB7 selection per query adapts to feature heterogeneity and dynamically evolving token importance, typically with BB8.

Potential limitations include failure cases where the sink token does not well represent the global semantic anchor, edge tasks that rely heavily on rare or singleton tokens, or in domains where attention weights poorly reflect true token criticality (Guo et al., 26 Jan 2026). Empirical evidence suggests robust generalization when sink direction is stable and the metric aligns with task semantics.

For model-agnostic integration, methods such as TokenSelect require only minor kernel augmentations to existing Triton- or CUDA-based serving stacks, and no retraining of the backbone (Wu et al., 2024). OrthoRank and its variants are thus practical both for research benchmarking and real-world deployment.


References: STORE (Xu et al., 24 Nov 2025); OrthoRank (Shin et al., 5 Jul 2025); TR-BERT (Ye et al., 2021); TokenSelect (Wu et al., 2024); OptiPrune (Lu, 1 Jul 2025); DynTS (Guo et al., 26 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Token Selection (OrthoRank).