Adaptive Token Scheduler in Transformers
- Adaptive Token Scheduler (ATS) is a dynamic mechanism for selecting, pruning, or merging tokens based on instantaneous importance, optimizing compute and energy efficiency.
- ATS implementations range from importance sampling in Vision Transformers to buffer-aware scheduling in LLMs, achieving notable GFLOP reductions and energy savings.
- The framework enables plug-and-play integration and adaptive computation across various transformer architectures, balancing accuracy and latency in real-world applications.
An Adaptive Token Scheduler (ATS) is a system module or algorithmic framework that dynamically selects, prunes, or merges tokens in sequence models such as transformers, based on their estimated importance for the current input or serving scenario. The ATS aims to optimize model throughput, energy or compute efficiency, and user-centric quality, by adaptively allocating token-level resources either per-layer within a model or across competing model-serving requests. Several architectures implement the ATS principle in different modalities—vision and language—and on various hardware substrates, including both artificial neural network (ANN) and spiking neural network (SNN) transformer variants (Fayyaz et al., 2021, Chen et al., 3 Oct 2025, Chen et al., 2024, Kang et al., 2024).
1. Core Algorithmic Principles
At the foundational level, an ATS monitors activations, attention scores, output buffer utilization, or structurally learned priorities to determine, in real time and per instance/batch/request, which tokens merit additional computation and which can be pruned, merged, or deprioritized. Multiple instantiations of ATS exist:
- Token Importance Sampling: ATS computes importance scores (often attention-based), producing a ranked or probabilistic subset of tokens for continued processing (Fayyaz et al., 2021).
- Merging Mechanisms: Redundant or similar tokens are fused via similarity-based weighted averaging, further reducing sequence length (Chen et al., 2024, Kang et al., 2024).
- Buffer-Aware Prioritization: In serving LLMs, ATS ranks requests for GPU scheduling by the demand for new tokens, as inferred from buffer occupancy and consumption rates (Chen et al., 3 Oct 2025).
- Halting or Early-Exit Gating: ATS applies adaptive computation time (ACT) ideas, where tokens accumulate halting scores and are pruned once their usefulness is deemed saturated (Kang et al., 2024).
The principle unifying these approaches is adaptivity: the number and identity of tokens transmitted to subsequent processing stages is not fixed a priori, but is fluid, content-driven, and dictated by instantaneous task demands or system load.
2. Vision Transformer ATS: Adaptive Token Sampler and Extensions
The canonical Adaptive Token Sampler (ATS) (Fayyaz et al., 2021) provides a differentiable, parameter-free module compatible with standard Vision Transformer (ViT) blocks. Insertion occurs after attention matrix calculation but prior to its application to value vectors. The workflow is:
- Compute class-token attention to all patches.
- Score each patch token using , where is the softmaxed attention, the value vector for patch .
- Form a cumulative distribution function (CDF) over scores, and sample up to tokens via deterministic quantile slots.
- Retain the class token and selected patch tokens; prune attention rows and value matrix accordingly.
This mechanism enables dynamic per-layer and per-image token scheduling. Empirically, integrating ATS into DeiT-S, CvT-13, and PS-ViT-B/14 achieves 29–51% reductions in GFLOPs with <0.2% top-1 accuracy degradation across benchmarks such as ImageNet and Kinetics-400/600. The ATS requires no new parameters and may be deployed as a plug-and-play module on pre-trained ViTs, supporting both inference-only insertion and end-to-end fine-tuning. The scheduler can be configured at inference-time to trade off accuracy and latency without retraining (Fayyaz et al., 2021).
3. Adaptive Token Scheduling in Transformer Serving Systems
In adaptive transformer serving contexts, an ATS acts as a dynamic resource orchestrator to maximize system utility. OTAS (Online Token Adaptation Serving) introduces a dual-mode adaptation policy comprising:
- Prompting: Dynamically prepending learnable tokens to boost accuracy when needed ().
- Merging: Aggressively fusing up to least-informative tokens per batch/layer to reduce computation ().
A discrete set encodes allowable values. OTAS clusters incoming queries into micro-batches aligned by arrival time, deadline, and utility, and for each batch solves:
with and estimated from profiling, under constraints of latency, sequential ordering, and GPU memory. An efficient dynamic programming algorithm (or heuristic when load is low) selects per-batch to maximize expected utility. This approach yields up to 18.2–90.1% utility gain, type-1 query success rates of 75.6–85.5%, and system throughput up to 1,000 Req/s on RTX 4080 (Chen et al., 2024).
4. Buffer-Aware ATS and Preemptive Scheduling in LLM Serving
In text streaming LLM serving, TokenFlow implements ATS as a preemptive, buffer-aware request scheduler (Chen et al., 3 Oct 2025). Key operational metrics monitored for each request are:
- : buffer occupancy (number of unread tokens)
- : requested token consumption rate (tokens/sec)
The scheduler computes a per-request priority:
where gives urgency to requests with low buffer, and reflects expected useful token generation within the next slice. A greedy algorithm slices intervals, ranking active requests and making admit/eject decisions subject to GPU memory constraints; evicted requests’ KV caches are proactively offloaded or streamed according to real-time I/O versus recompute costs. Such preemptive scheduling, paired with I/O overlap in cache management, delivers up to 82.5% higher user-effective throughput and shrinks P99 time-to-first-token by up to 80.2% with no reduction in raw token throughput (Chen et al., 3 Oct 2025).
5. ATS for Spiking Neural Network Vision Transformers
AT-SNN extends ATS methodology to spiking neural network (SNN) ViTs, introducing a 2D halting mechanism and content-aware token merging (Kang et al., 2024). At each layer and timestep, each token calculates a halting score via a sigmoid gate:
Accrued halting scores signal when a token can be safely masked (pruned) for subsequent blocks. Simultaneously, a token similarity metric (cosine similarity) underpins aggressive merging in a temporally consistent manner. Experiments on CIFAR-10/100 and TinyImageNet indicate up to 42.4% token-count reduction and up to 60% energy savings, sometimes with increased accuracy compared to non-adaptive baselines (e.g., 77.42%→78.14% on CIFAR-100) (Kang et al., 2024).
6. Comparative Properties and Application Contexts
| ATS Variant | Domain | Adaptation Method | Outcomes/Trade-offs |
|---|---|---|---|
| ATS in ViT (Fayyaz et al., 2021) | Image/video | Score-based token sampling | 29–51% GFLOP cut, <0.2% loss |
| OTAS (Chen et al., 2024) | Cloud serving | Prompting/merging | 18–90% utility gain |
| TokenFlow (Chen et al., 3 Oct 2025) | LLM serving | Buffer-aware scheduling | 82.5% eff. throughput gain |
| AT-SNN (Kang et al., 2024) | SNN ViT | 2D halting + merging | 42% token & 40–60% energy cut |
Applications range from accelerating vision and language transformer inference on resource-constrained hardware, to maximizing utility in cloud-scale serving under fluctuating load, to reducing compute cost in power-sensitive neuromorphic processors. Each instantiation tailors token scheduling and adaptation to its domain constraints, leveraging plug-and-play deployment, differentiable optimization, or online combinatorial scheduling as appropriate for scenario-specific guarantees.
7. Quantitative Performance and Ablation Analyses
Empirical studies across ATS incarnations demonstrate tight control over cost–accuracy–utility tradeoffs:
- ViT-based ATS demonstrates linear to quadratic reduction in attention FLOPs with minimal top-1 accuracy drop (e.g., DeiT-S: 4.6→2.9 GFLOPs at 79.7% accuracy) (Fayyaz et al., 2021).
- OTAS yields up to 90.1% utility boost on a 120h Azure Functions trace, with sub-millisecond scheduler run-times and adaptive batch-wise control (Chen et al., 2024).
- TokenFlow reduces tail response latency 4× and nearly doubles “timely” token throughput under burst arrivals (Chen et al., 3 Oct 2025).
- In SNN ViT, AT-SNN achieves similar or better accuracy at significantly reduced token counts and hardware energy costs, benefiting from 2D halting and temporally consistent merges (Kang et al., 2024).
Ablation analyses highlight the importance of dynamic, context-driven adaptation (versus fixed prompting/merging), temporally consistent merges (in SNNs), and buffer-driven prioritization (in LLM streaming).
In aggregate, the Adaptive Token Scheduler paradigm subsumes a family of domain-adapted, content- and demand-aware token-level resource allocation methods that enable elastic and efficient deployment of transformer-based (and SNN-augmented) models across diverse inference and serving environments (Fayyaz et al., 2021, Chen et al., 3 Oct 2025, Chen et al., 2024, Kang et al., 2024).