Papers
Topics
Authors
Recent
2000 character limit reached

Adaptive Token Scheduler in Transformers

Updated 10 January 2026
  • Adaptive Token Scheduler (ATS) is a dynamic mechanism for selecting, pruning, or merging tokens based on instantaneous importance, optimizing compute and energy efficiency.
  • ATS implementations range from importance sampling in Vision Transformers to buffer-aware scheduling in LLMs, achieving notable GFLOP reductions and energy savings.
  • The framework enables plug-and-play integration and adaptive computation across various transformer architectures, balancing accuracy and latency in real-world applications.

An Adaptive Token Scheduler (ATS) is a system module or algorithmic framework that dynamically selects, prunes, or merges tokens in sequence models such as transformers, based on their estimated importance for the current input or serving scenario. The ATS aims to optimize model throughput, energy or compute efficiency, and user-centric quality, by adaptively allocating token-level resources either per-layer within a model or across competing model-serving requests. Several architectures implement the ATS principle in different modalities—vision and language—and on various hardware substrates, including both artificial neural network (ANN) and spiking neural network (SNN) transformer variants (Fayyaz et al., 2021, Chen et al., 3 Oct 2025, Chen et al., 2024, Kang et al., 2024).

1. Core Algorithmic Principles

At the foundational level, an ATS monitors activations, attention scores, output buffer utilization, or structurally learned priorities to determine, in real time and per instance/batch/request, which tokens merit additional computation and which can be pruned, merged, or deprioritized. Multiple instantiations of ATS exist:

  • Token Importance Sampling: ATS computes importance scores (often attention-based), producing a ranked or probabilistic subset of tokens for continued processing (Fayyaz et al., 2021).
  • Merging Mechanisms: Redundant or similar tokens are fused via similarity-based weighted averaging, further reducing sequence length (Chen et al., 2024, Kang et al., 2024).
  • Buffer-Aware Prioritization: In serving LLMs, ATS ranks requests for GPU scheduling by the demand for new tokens, as inferred from buffer occupancy and consumption rates (Chen et al., 3 Oct 2025).
  • Halting or Early-Exit Gating: ATS applies adaptive computation time (ACT) ideas, where tokens accumulate halting scores and are pruned once their usefulness is deemed saturated (Kang et al., 2024).

The principle unifying these approaches is adaptivity: the number and identity of tokens transmitted to subsequent processing stages is not fixed a priori, but is fluid, content-driven, and dictated by instantaneous task demands or system load.

2. Vision Transformer ATS: Adaptive Token Sampler and Extensions

The canonical Adaptive Token Sampler (ATS) (Fayyaz et al., 2021) provides a differentiable, parameter-free module compatible with standard Vision Transformer (ViT) blocks. Insertion occurs after attention matrix calculation but prior to its application to value vectors. The workflow is:

  • Compute class-token attention to all patches.
  • Score each patch token using Sj=A1,jVj2i=2N+1A1,iVi2S_j = \frac{A_{1,j} \cdot \|V_j\|_2}{\sum_{i=2}^{N+1} A_{1,i} \cdot \|V_i\|_2}, where AA is the softmaxed attention, VjV_j the value vector for patch jj.
  • Form a cumulative distribution function (CDF) over scores, and sample up to KK tokens via deterministic quantile slots.
  • Retain the class token and selected patch tokens; prune attention rows and value matrix accordingly.

This mechanism enables dynamic per-layer and per-image token scheduling. Empirically, integrating ATS into DeiT-S, CvT-13, and PS-ViT-B/14 achieves 29–51% reductions in GFLOPs with <0.2% top-1 accuracy degradation across benchmarks such as ImageNet and Kinetics-400/600. The ATS requires no new parameters and may be deployed as a plug-and-play module on pre-trained ViTs, supporting both inference-only insertion and end-to-end fine-tuning. The scheduler can be configured at inference-time to trade off accuracy and latency without retraining (Fayyaz et al., 2021).

3. Adaptive Token Scheduling in Transformer Serving Systems

In adaptive transformer serving contexts, an ATS acts as a dynamic resource orchestrator to maximize system utility. OTAS (Online Token Adaptation Serving) introduces a dual-mode adaptation policy comprising:

  • Prompting: Dynamically prepending learnable tokens to boost accuracy when needed (γ>0\gamma > 0).
  • Merging: Aggressively fusing up to γ|\gamma| least-informative tokens per batch/layer to reduce computation (γ<0\gamma < 0).

A discrete set Γ\Gamma encodes allowable γ\gamma values. OTAS clusters incoming queries into micro-batches aligned by arrival time, deadline, and utility, and for each batch solves:

max{γb}b=1NBUb(γb),        Ub(γ)=(rBbur)Ab(γ)\max_{\{\gamma_b\}} \sum_{b=1}^{N_B} U_b(\gamma_b), \;\;\;\; U_b(\gamma) = (\sum_{r\in B_b} u_r)A_b(\gamma)

with Ab(γ)A_b(\gamma) and Lb(γ)L_b(\gamma) estimated from profiling, under constraints of latency, sequential ordering, and GPU memory. An efficient dynamic programming algorithm (or heuristic when load is low) selects per-batch γb\gamma_b to maximize expected utility. This approach yields up to 18.2–90.1% utility gain, type-1 query success rates of 75.6–85.5%, and system throughput up to 1,000 Req/s on RTX 4080 (Chen et al., 2024).

4. Buffer-Aware ATS and Preemptive Scheduling in LLM Serving

In text streaming LLM serving, TokenFlow implements ATS as a preemptive, buffer-aware request scheduler (Chen et al., 3 Oct 2025). Key operational metrics monitored for each request ii are:

  • bib_i: buffer occupancy (number of unread tokens)
  • rir_i: requested token consumption rate (tokens/sec)

The scheduler computes a per-request priority:

Priorityi=αebi+β(viti)\text{Priority}_i = \alpha e^{-b_i} + \beta (v_i t'_i)

where ebie^{-b_i} gives urgency to requests with low buffer, and vitiv_i t'_i reflects expected useful token generation within the next slice. A greedy algorithm slices intervals, ranking active requests and making admit/eject decisions subject to GPU memory constraints; evicted requests’ KV caches are proactively offloaded or streamed according to real-time I/O versus recompute costs. Such preemptive scheduling, paired with I/O overlap in cache management, delivers up to 82.5% higher user-effective throughput and shrinks P99 time-to-first-token by up to 80.2% with no reduction in raw token throughput (Chen et al., 3 Oct 2025).

5. ATS for Spiking Neural Network Vision Transformers

AT-SNN extends ATS methodology to spiking neural network (SNN) ViTs, introducing a 2D halting mechanism and content-aware token merging (Kang et al., 2024). At each layer and timestep, each token kk calculates a halting score hk(,t)h_k^{(\ell, t)} via a sigmoid gate:

hk(,t)=σ(αTk,1(,t)/NTk(,t)+β)h_k^{(\ell, t)} = \sigma\left( \alpha T_{k,1}^{(\ell, t)}/NT_{k}^{(\ell, t)} + \beta \right)

Accrued halting scores signal when a token can be safely masked (pruned) for subsequent blocks. Simultaneously, a token similarity metric (cosine similarity) underpins aggressive merging in a temporally consistent manner. Experiments on CIFAR-10/100 and TinyImageNet indicate up to 42.4% token-count reduction and up to 60% energy savings, sometimes with increased accuracy compared to non-adaptive baselines (e.g., 77.42%→78.14% on CIFAR-100) (Kang et al., 2024).

6. Comparative Properties and Application Contexts

ATS Variant Domain Adaptation Method Outcomes/Trade-offs
ATS in ViT (Fayyaz et al., 2021) Image/video Score-based token sampling 29–51% GFLOP cut, <0.2% loss
OTAS (Chen et al., 2024) Cloud serving Prompting/merging 18–90% utility gain
TokenFlow (Chen et al., 3 Oct 2025) LLM serving Buffer-aware scheduling 82.5% eff. throughput gain
AT-SNN (Kang et al., 2024) SNN ViT 2D halting + merging 42% token & 40–60% energy cut

Applications range from accelerating vision and language transformer inference on resource-constrained hardware, to maximizing utility in cloud-scale serving under fluctuating load, to reducing compute cost in power-sensitive neuromorphic processors. Each instantiation tailors token scheduling and adaptation to its domain constraints, leveraging plug-and-play deployment, differentiable optimization, or online combinatorial scheduling as appropriate for scenario-specific guarantees.

7. Quantitative Performance and Ablation Analyses

Empirical studies across ATS incarnations demonstrate tight control over cost–accuracy–utility tradeoffs:

  • ViT-based ATS demonstrates linear to quadratic reduction in attention FLOPs with minimal top-1 accuracy drop (e.g., DeiT-S: 4.6→2.9 GFLOPs at 79.7% accuracy) (Fayyaz et al., 2021).
  • OTAS yields up to 90.1% utility boost on a 120h Azure Functions trace, with sub-millisecond scheduler run-times and adaptive batch-wise control (Chen et al., 2024).
  • TokenFlow reduces tail response latency 4× and nearly doubles “timely” token throughput under burst arrivals (Chen et al., 3 Oct 2025).
  • In SNN ViT, AT-SNN achieves similar or better accuracy at significantly reduced token counts and hardware energy costs, benefiting from 2D halting and temporally consistent merges (Kang et al., 2024).

Ablation analyses highlight the importance of dynamic, context-driven adaptation (versus fixed prompting/merging), temporally consistent merges (in SNNs), and buffer-driven prioritization (in LLM streaming).


In aggregate, the Adaptive Token Scheduler paradigm subsumes a family of domain-adapted, content- and demand-aware token-level resource allocation methods that enable elastic and efficient deployment of transformer-based (and SNN-augmented) models across diverse inference and serving environments (Fayyaz et al., 2021, Chen et al., 3 Oct 2025, Chen et al., 2024, Kang et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Adaptive Token Scheduler (ATS).