Adaptive Token Sampler (ATS)
- Adaptive Token Sampler is a dynamic mechanism that adjusts token selection based on data-dependent significance, reducing computational cost in token-based models.
- It employs techniques such as quantile-based sampling, reinforcement learning, and adaptive computation time, achieving significant GFLOPs reductions with minimal accuracy loss.
- ATS is versatile and plug-and-play, enabling integration with pre-trained models across diverse domains including vision, diffusion, spiking neural networks, and video modeling.
Adaptive Token Sampler (ATS) is a class of modules and algorithms designed to selectively sample, merge, or prune tokens in token-based architectures—including vision transformers (ViTs), masked diffusion models, spiking neural networks (SNNs), and masked video modeling—so as to reduce computational complexity without degrading accuracy. ATS dynamically adjusts the computational or sampling budget per input, layer, or iteration, based on data-dependent token significance metrics, hybrid exploration-exploitation scheduling, or motion priors, and is implemented via differentiable or algorithmic mechanisms that can be deployed without retraining or as part of end-to-end optimizations.
1. Core Principles and Motivations
The Adaptive Token Sampler paradigm addresses the inefficiency of static token processing. In models such as ViT, masked diffusion, and SNN-based transformers, computation or generation cost is proportional to the number of tokens. Not all tokens contribute equally to downstream tasks: many encode redundant, background, or low-informative content. ATS aims to allocate compute or sampling to the most critical tokens on a per-instance basis, enabling:
- Dynamic reduction of computational cost (e.g., GFLOPs, energy)
- Maintenance or improvement of accuracy and diversity
- Flexibility across domains—vision, text, diffusion, event-based sensing
- Compatibility with pre-trained models (parameter-free insertion or minimal fine-tuning)
Underlying methods for ATS include deterministic token scoring and sampling (Fayyaz et al., 2021), probabilistically adaptive masking/sampling (Hayakawa et al., 6 Oct 2025, Rai et al., 13 May 2025), and reinforcement learning-based masking (Rai et al., 13 May 2025).
2. ATS in Vision Transformers
The prototypical ATS for ViT is a parameter-free, differentiable module inserted post-attention but pre-FFN within the transformer block. Its operation is as follows (Fayyaz et al., 2021):
- Significance Scoring: For each non-class token , compute importance , with from the class-token row of the attention matrix.
- Quantile-based Inverse-Transform Sampling: Construct the CDF of and select up to tokens using evenly spaced quantiles ; map to token indices via CDF inversion, always retaining the class token.
- Downsampling: Prune the set to and forward only the subset through add&norm, FFN, and further layers.
The ATS mechanism is strictly parameter-free, plug-and-play, and differentiable, permitting both post-hoc acceleration and optional end-to-end fine-tuning; backpropagation transmits through soft assignment or via gradients on and (Fayyaz et al., 2021). Its main hyperparameter, , enforces an upper bound on retained tokens per layer and can be set to match a particular compute budget.
Experimental results on DeiT-S and CvT-13 on ImageNet-1K, and TimeSformer/X-ViT on video benchmarks, demonstrate reductions of 30–50% in GFLOPs at top-1 accuracy loss. Ablations indicate that quantile-based sampling outperforms Top-K, and multi-stage ATS further improves performance under fixed compute.
| Model | Params | GFLOPs | Top-1 Acc. (%) |
|---|---|---|---|
| DeiT-S (baseline) | 22M | 4.6 | 79.8 |
| DeiT-S + ATS | 22M | 2.9 | 79.7 |
Strengths and limitations: ATS is most effective in settings with spatially concentrated class-token attention and can underperform if token significance is diffuse or is set excessively low, risking starvation of the FFN or feature collapse.
3. ATS in Masked Diffusion and Generative Models
In the context of masked diffusion, ATS generalizes token unmasking schedules by formalizing and extending mechanisms from MaskGIT. The process is governed by choose-then-sample (CTS) methodology and augmented by temperature scaling and hybrid exploration–exploitation policies (Hayakawa et al., 6 Oct 2025):
- Moment Sampler: At each diffusion step, unmasked positions are selected by the largest (with Gumbel noise), then tokens are sampled from , with , being MaskGIT’s implicit temperature.
- CTS Unbiasedness: When and one-by-one unmasking is used, the output distribution matches the data distribution, providing a tool to control sampling bias.
- Hybrid Order Scheduling: At each step, positions to unmask are chosen by merging exploitation (descending per-position certainty) and exploration (spatial or low-discrepancy dispersion); this ensures both model confidence and sampling diversity.
Partial caching strategies amortize transformer key/value computation to accelerate forward passes during token unmasking.
Empirical results show that ATS under Moment+Cache+Hybrid orderings matches MaskGIT’s FID () with fewer steps (8–16 on ImageNet), and that hybrid or unbiased CTS policies achieve Pareto-optimal trade-offs between diversity and quality in both images and text. ATS is domain-agnostic and retraining-free; all that is required is access to the model’s output marginals.
| Sampler | Image FID | Text Perplexity | Speedup |
|---|---|---|---|
| MaskGIT | 4–5 | 36 | Baseline |
| ATS (Hybrid/Cache) | 4–5 | 35 | 1.2–2 |
Limitations: Temperature-biased CTS can reduce diversity; caching-induced approximations may affect quality for large . Balance of exploration versus exploitation is tunable but application-dependent.
4. ATS for Efficient Event-based and Spiking Neural Architectures
AT-SNN adapts ATS for SNN-based ViT architectures, targeting minimization of energy by dynamically reducing token count in space, time, and depth (Kang et al., 22 Aug 2024). The method integrates:
- 2D Adaptive Computation Time (ACT): Per-token, per-block, per-timestep halting, with halting score , and cumulative sum triggering early halting .
- Token-Merge Mechanism: At each block, pairs of tokens with maximum cosine similarity are merged, reducing token count, and maintaining SNN sparsity.
- Energy-Constrained Loss: A ponder loss term regularizes the expected sum of tokens across blocks/timesteps.
AT-SNN empirically yields up to 42.4% token reduction and 25–38% energy savings across CIFAR-10, CIFAR-100, and TinyImageNet, sometimes improving or matching classification accuracy. Accumulating halting over both blocks and timesteps improves performance beyond block-only approaches, and temporally-aware merging provides further robustness.
| Dataset | Baseline Acc. (%) | AT-SNN Acc. (%) | Tokens/block | Energy Saved (%) |
|---|---|---|---|---|
| CIFAR-10 | 94.88 | 95.06 | 0.28 | 38 |
| CIFAR-100 | 77.42 | 78.14 | 0.75 | 25 |
5. Adaptive Token Sampling for Masked Video Modeling
In masked video modeling, the Trajectory-Aware Adaptive Token Sampler (TATS) employs reinforcement learning to select visible tokens based on estimated token-wise motion saliency (Rai et al., 13 May 2025):
- Motion-centric State Encoding: Video is patched, projected into tokens, and local trajectories are encoded via specialized Trajectory Attention (TA), producing motion-aware token embeddings.
- Policy and Value Networks: An actor–critic architecture outputs a categorical sampling policy over tokens and value estimates, trained via Proximal Policy Optimization (PPO). The sampling probability distribution is used to stochastically select a sparse token subset per video.
- Unified Optimization: Training alternates between standard MAE reconstruction minimization (updating encoder/decoder) and policy improvement (updating the sampling distribution based on reconstruction loss advantages).
- Aggressive Masking: TATS enables operation at mask ratios up to 95%, focusing compute on high-motion trajectories while reducing memory footprint and maintaining or improving downstream action recognition performance.
Empirically, TATS consistently outperforms random masking and conventional content-aware masking at high mask ratios across classification and transfer benchmarks.
| Dataset | VideoMAE Top-1 | AdaMAE Top-1 | TATS Top-1 |
|---|---|---|---|
| UCF101 | 65.86 | 80.83 | 81.75 |
| HMDB51 | 33.98 | 37.70 | 38.67 |
| Kinetics | 39.73 | 39.42 | 41.70 |
6. Comparative Summary and Extensions
ATS has evolved across domains with several commonalities and domain-specific adaptations. In all scenarios, token importance heuristics (attention-derived, entropy/variance, or motion) drive adaptive selection, with varying degrees of differentiability and autonomy.
| Domain | Scoring/Selection | Fine-tuning | Acceleration | Notable Limitation |
|---|---|---|---|---|
| Vision Transformers (Fayyaz et al., 2021) | Attention × Value norm, quantile CDF | Optional | Plug-in | Over-pruning at low |
| Masked Diffusion (Hayakawa et al., 6 Oct 2025) | CTS hybrid (moment + dispersion) | N/A | Post-hoc | Temperature bias for |
| SNN-ViT (Kang et al., 22 Aug 2024) | 2D-ACT, first-dim. of token, merging | Joint | Learned | Halting granularity |
| Masked Video Modeling (Rai et al., 13 May 2025) | RL with trajectory-based attention | Joint (MAE) | RL-driven | RL stability, detachment artifacts |
Possible future extensions include per-ROI ATS for object detection, temporal masking for event-based data, and fully automated exploration-exploitation balancing. In all cases, ATS provides a general, effective mechanism for compute–efficiency trade-offs in token-based architectures.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free