Papers
Topics
Authors
Recent
2000 character limit reached

Adaptive Token Sampler (ATS)

Updated 22 November 2025
  • Adaptive Token Sampler is a dynamic mechanism that adjusts token selection based on data-dependent significance, reducing computational cost in token-based models.
  • It employs techniques such as quantile-based sampling, reinforcement learning, and adaptive computation time, achieving significant GFLOPs reductions with minimal accuracy loss.
  • ATS is versatile and plug-and-play, enabling integration with pre-trained models across diverse domains including vision, diffusion, spiking neural networks, and video modeling.

Adaptive Token Sampler (ATS) is a class of modules and algorithms designed to selectively sample, merge, or prune tokens in token-based architectures—including vision transformers (ViTs), masked diffusion models, spiking neural networks (SNNs), and masked video modeling—so as to reduce computational complexity without degrading accuracy. ATS dynamically adjusts the computational or sampling budget per input, layer, or iteration, based on data-dependent token significance metrics, hybrid exploration-exploitation scheduling, or motion priors, and is implemented via differentiable or algorithmic mechanisms that can be deployed without retraining or as part of end-to-end optimizations.

1. Core Principles and Motivations

The Adaptive Token Sampler paradigm addresses the inefficiency of static token processing. In models such as ViT, masked diffusion, and SNN-based transformers, computation or generation cost is proportional to the number of tokens. Not all tokens contribute equally to downstream tasks: many encode redundant, background, or low-informative content. ATS aims to allocate compute or sampling to the most critical tokens on a per-instance basis, enabling:

  • Dynamic reduction of computational cost (e.g., GFLOPs, energy)
  • Maintenance or improvement of accuracy and diversity
  • Flexibility across domains—vision, text, diffusion, event-based sensing
  • Compatibility with pre-trained models (parameter-free insertion or minimal fine-tuning)

Underlying methods for ATS include deterministic token scoring and sampling (Fayyaz et al., 2021), probabilistically adaptive masking/sampling (Hayakawa et al., 6 Oct 2025, Rai et al., 13 May 2025), and reinforcement learning-based masking (Rai et al., 13 May 2025).

2. ATS in Vision Transformers

The prototypical ATS for ViT is a parameter-free, differentiable module inserted post-attention but pre-FFN within the transformer block. Its operation is as follows (Fayyaz et al., 2021):

  • Significance Scoring: For each non-class token jj, compute importance Sj=A1,jVj2i=2N+1A1,iVi2S_j = \frac{A_{1,j}\,\|\mathbf{V}_j\|_2}{\sum_{i=2}^{N+1} A_{1,i}\,\|\mathbf{V}_i\|_2}, with A1,jA_{1,j} from the class-token row of the attention matrix.
  • Quantile-based Inverse-Transform Sampling: Construct the CDF of {Sj}\{S_j\} and select up to KK tokens using evenly spaced quantiles uk=(2k1)/(2K)u_k = (2k-1)/(2K); map uku_k to token indices via CDF inversion, always retaining the class token.
  • Downsampling: Prune the set Ofull=AVO_\text{full} = A V to Oreduced=AsVO_\text{reduced} = A^s V and forward only the subset through add&norm, FFN, and further layers.

The ATS mechanism is strictly parameter-free, plug-and-play, and differentiable, permitting both post-hoc acceleration and optional end-to-end fine-tuning; backpropagation transmits through soft assignment or via gradients on A1,jA_{1,j} and Vj\|\mathbf{V}_j\| (Fayyaz et al., 2021). Its main hyperparameter, KK, enforces an upper bound on retained tokens per layer and can be set to match a particular compute budget.

Experimental results on DeiT-S and CvT-13 on ImageNet-1K, and TimeSformer/X-ViT on video benchmarks, demonstrate reductions of 30–50% in GFLOPs at <0.3%<0.3\% top-1 accuracy loss. Ablations indicate that quantile-based sampling outperforms Top-K, and multi-stage ATS further improves performance under fixed compute.

Model Params GFLOPs Top-1 Acc. (%)
DeiT-S (baseline) 22M 4.6 79.8
DeiT-S + ATS 22M 2.9 79.7

Strengths and limitations: ATS is most effective in settings with spatially concentrated class-token attention and can underperform if token significance is diffuse or KK is set excessively low, risking starvation of the FFN or feature collapse.

3. ATS in Masked Diffusion and Generative Models

In the context of masked diffusion, ATS generalizes token unmasking schedules by formalizing and extending mechanisms from MaskGIT. The process is governed by choose-then-sample (CTS) methodology and augmented by temperature scaling and hybrid exploration–exploitation policies (Hayakawa et al., 6 Oct 2025):

  • Moment Sampler: At each diffusion step, unmasked positions are selected by the largest logpiββ+ηi\log \|\mathbf{p}_i\|_{\beta}^{\beta}+\eta_i (with Gumbel noise), then tokens are sampled from piγ/piγγp_i^{\gamma}/\|\mathbf{p}_i\|_{\gamma}^{\gamma}, with γ=β=1+1/α\gamma=\beta=1+1/\alpha, α\alpha being MaskGIT’s implicit temperature.
  • CTS Unbiasedness: When γ=1\gamma=1 and one-by-one unmasking is used, the output distribution matches the data distribution, providing a tool to control sampling bias.
  • Hybrid Order Scheduling: At each step, positions to unmask are chosen by merging exploitation (descending per-position certainty) and exploration (spatial or low-discrepancy dispersion); this ensures both model confidence and sampling diversity.

Partial caching strategies amortize transformer key/value computation to accelerate forward passes during token unmasking.

Empirical results show that ATS under Moment+Cache+Hybrid orderings matches MaskGIT’s FID (45\approx 4-5) with fewer steps (8–16 on ImageNet), and that hybrid or unbiased CTS policies achieve Pareto-optimal trade-offs between diversity and quality in both images and text. ATS is domain-agnostic and retraining-free; all that is required is access to the model’s output marginals.

Sampler Image FID Text Perplexity Speedup
MaskGIT \approx4–5 \approx36 Baseline
ATS (Hybrid/Cache) \approx4–5 \approx35 1.2–2×\times

Limitations: Temperature-biased CTS can reduce diversity; caching-induced approximations may affect quality for large Bn|B_n|. Balance of exploration versus exploitation is tunable but application-dependent.

4. ATS for Efficient Event-based and Spiking Neural Architectures

AT-SNN adapts ATS for SNN-based ViT architectures, targeting minimization of energy by dynamically reducing token count in space, time, and depth (Kang et al., 22 Aug 2024). The method integrates:

  • 2D Adaptive Computation Time (ACT): Per-token, per-block, per-timestep halting, with halting score hkl,t=σ(α(Tk,1l,t/NTkl,t)+β)h^{l,t}_k = \sigma(\alpha(T^{l,t}_{k,1}/NT^{l,t}_k)+\beta), and cumulative sum triggering early halting Hk(l,t)1ϵH_k(l,t) \geq 1-\epsilon.
  • Token-Merge Mechanism: At each block, γ\gamma pairs of tokens with maximum cosine similarity are merged, reducing token count, and maintaining SNN sparsity.
  • Energy-Constrained Loss: A ponder loss term regularizes the expected sum of tokens across blocks/timesteps.

AT-SNN empirically yields up to 42.4% token reduction and \sim25–38% energy savings across CIFAR-10, CIFAR-100, and TinyImageNet, sometimes improving or matching classification accuracy. Accumulating halting over both blocks and timesteps improves performance beyond block-only approaches, and temporally-aware merging provides further robustness.

Dataset Baseline Acc. (%) AT-SNN Acc. (%) Tokens/block Energy Saved (%)
CIFAR-10 94.88 95.06 0.28×\times 38
CIFAR-100 77.42 78.14 0.75×\times 25

5. Adaptive Token Sampling for Masked Video Modeling

In masked video modeling, the Trajectory-Aware Adaptive Token Sampler (TATS) employs reinforcement learning to select visible tokens based on estimated token-wise motion saliency (Rai et al., 13 May 2025):

  • Motion-centric State Encoding: Video is patched, projected into tokens, and local trajectories are encoded via specialized Trajectory Attention (TA), producing motion-aware token embeddings.
  • Policy and Value Networks: An actor–critic architecture outputs a categorical sampling policy over tokens and value estimates, trained via Proximal Policy Optimization (PPO). The sampling probability distribution is used to stochastically select a sparse token subset per video.
  • Unified Optimization: Training alternates between standard MAE reconstruction minimization (updating encoder/decoder) and policy improvement (updating the sampling distribution based on reconstruction loss advantages).
  • Aggressive Masking: TATS enables operation at mask ratios up to 95%, focusing compute on high-motion trajectories while reducing memory footprint and maintaining or improving downstream action recognition performance.

Empirically, TATS consistently outperforms random masking and conventional content-aware masking at high mask ratios across classification and transfer benchmarks.

Dataset VideoMAE Top-1 AdaMAE Top-1 TATS Top-1
UCF101 65.86 80.83 81.75
HMDB51 33.98 37.70 38.67
Kinetics 39.73 39.42 41.70

6. Comparative Summary and Extensions

ATS has evolved across domains with several commonalities and domain-specific adaptations. In all scenarios, token importance heuristics (attention-derived, entropy/variance, or motion) drive adaptive selection, with varying degrees of differentiability and autonomy.

Domain Scoring/Selection Fine-tuning Acceleration Notable Limitation
Vision Transformers (Fayyaz et al., 2021) Attention × Value norm, quantile CDF Optional Plug-in Over-pruning at low KK
Masked Diffusion (Hayakawa et al., 6 Oct 2025) CTS hybrid (moment + dispersion) N/A Post-hoc Temperature bias for γ>1\gamma > 1
SNN-ViT (Kang et al., 22 Aug 2024) 2D-ACT, first-dim. of token, merging Joint Learned Halting granularity
Masked Video Modeling (Rai et al., 13 May 2025) RL with trajectory-based attention Joint (MAE) RL-driven RL stability, detachment artifacts

Possible future extensions include per-ROI ATS for object detection, temporal masking for event-based data, and fully automated exploration-exploitation balancing. In all cases, ATS provides a general, effective mechanism for compute–efficiency trade-offs in token-based architectures.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Adaptive Token Sampler (ATS).