Adaptive Token Sampler (ATS)

Updated 22 November 2025

Adaptive Token Sampler is a dynamic mechanism that adjusts token selection based on data-dependent significance, reducing computational cost in token-based models.
It employs techniques such as quantile-based sampling, reinforcement learning, and adaptive computation time, achieving significant GFLOPs reductions with minimal accuracy loss.
ATS is versatile and plug-and-play, enabling integration with pre-trained models across diverse domains including vision, diffusion, spiking neural networks, and video modeling.

Adaptive Token Sampler (ATS) is a class of modules and algorithms designed to selectively sample, merge, or prune tokens in token-based architectures—including vision transformers (ViTs), masked diffusion models, spiking neural networks (SNNs), and masked video modeling—so as to reduce computational complexity without degrading accuracy. ATS dynamically adjusts the computational or sampling budget per input, layer, or iteration, based on data-dependent token significance metrics, hybrid exploration-exploitation scheduling, or motion priors, and is implemented via differentiable or algorithmic mechanisms that can be deployed without retraining or as part of end-to-end optimizations.

1. Core Principles and Motivations

The Adaptive Token Sampler paradigm addresses the inefficiency of static token processing. In models such as ViT, masked diffusion, and SNN-based transformers, computation or generation cost is proportional to the number of tokens. Not all tokens contribute equally to downstream tasks: many encode redundant, background, or low-informative content. ATS aims to allocate compute or sampling to the most critical tokens on a per-instance basis, enabling:

Dynamic reduction of computational cost (e.g., GFLOPs, energy)
Maintenance or improvement of accuracy and diversity
Flexibility across domains—vision, text, diffusion, event-based sensing
Compatibility with pre-trained models (parameter-free insertion or minimal fine-tuning)

Underlying methods for ATS include deterministic token scoring and sampling (Fayyaz et al., 2021), probabilistically adaptive masking/sampling (Hayakawa et al., 6 Oct 2025, Rai et al., 13 May 2025), and reinforcement learning-based masking (Rai et al., 13 May 2025).

2. ATS in Vision Transformers

The prototypical ATS for ViT is a parameter-free, differentiable module inserted post-attention but pre-FFN within the transformer block. Its operation is as follows (Fayyaz et al., 2021):

Significance Scoring: For each non-class token $j$ , compute importance $S_j = \frac{A_{1,j}\,\|\mathbf{V}_j\|_2}{\sum_{i=2}^{N+1} A_{1,i}\,\|\mathbf{V}_i\|_2}$ , with $A_{1,j}$ from the class-token row of the attention matrix.
Quantile-based Inverse-Transform Sampling: Construct the CDF of $\{S_j\}$ and select up to $K$ tokens using evenly spaced quantiles $u_k = (2k-1)/(2K)$ ; map $u_k$ to token indices via CDF inversion, always retaining the class token.
Downsampling: Prune the set $O_\text{full} = A V$ to $O_\text{reduced} = A^s V$ and forward only the subset through add&norm, FFN, and further layers.

The ATS mechanism is strictly parameter-free, plug-and-play, and differentiable, permitting both post-hoc acceleration and optional end-to-end fine-tuning; backpropagation transmits through soft assignment or via gradients on $A_{1,j}$ and $\|\mathbf{V}_j\|$ (Fayyaz et al., 2021). Its main hyperparameter, $K$ , enforces an upper bound on retained tokens per layer and can be set to match a particular compute budget.

Experimental results on DeiT-S and CvT-13 on ImageNet-1K, and TimeSformer/X-ViT on video benchmarks, demonstrate reductions of 30–50% in GFLOPs at $<0.3\%$ top-1 accuracy loss. Ablations indicate that quantile-based sampling outperforms Top-K, and multi-stage ATS further improves performance under fixed compute.

Model	Params	GFLOPs	Top-1 Acc. (%)
DeiT-S (baseline)	22M	4.6	79.8
DeiT-S + ATS	22M	2.9	79.7

Strengths and limitations: ATS is most effective in settings with spatially concentrated class-token attention and can underperform if token significance is diffuse or $K$ is set excessively low, risking starvation of the FFN or feature collapse.

3. ATS in Masked Diffusion and Generative Models

In the context of masked diffusion, ATS generalizes token unmasking schedules by formalizing and extending mechanisms from MaskGIT. The process is governed by choose-then-sample (CTS) methodology and augmented by temperature scaling and hybrid exploration–exploitation policies (Hayakawa et al., 6 Oct 2025):

Moment Sampler: At each diffusion step, unmasked positions are selected by the largest $\log \|\mathbf{p}_i\|_{\beta}^{\beta}+\eta_i$ (with Gumbel noise), then tokens are sampled from $p_i^{\gamma}/\|\mathbf{p}_i\|_{\gamma}^{\gamma}$ , with $\gamma=\beta=1+1/\alpha$ , $\alpha$ being MaskGIT’s implicit temperature.
CTS Unbiasedness: When $\gamma=1$ and one-by-one unmasking is used, the output distribution matches the data distribution, providing a tool to control sampling bias.
Hybrid Order Scheduling: At each step, positions to unmask are chosen by merging exploitation (descending per-position certainty) and exploration (spatial or low-discrepancy dispersion); this ensures both model confidence and sampling diversity.

Partial caching strategies amortize transformer key/value computation to accelerate forward passes during token unmasking.

Empirical results show that ATS under Moment+Cache+Hybrid orderings matches MaskGIT’s FID ( $\approx 4-5$ ) with fewer steps (8–16 on ImageNet), and that hybrid or unbiased CTS policies achieve Pareto-optimal trade-offs between diversity and quality in both images and text. ATS is domain-agnostic and retraining-free; all that is required is access to the model’s output marginals.

Sampler	Image FID	Text Perplexity	Speedup
MaskGIT	$\approx$ 4–5	$\approx$ 36	Baseline
ATS (Hybrid/Cache)	$\approx$ 4–5	$\approx$ 35	1.2–2 $\times$

Limitations: Temperature-biased CTS can reduce diversity; caching-induced approximations may affect quality for large $|B_n|$ . Balance of exploration versus exploitation is tunable but application-dependent.

4. ATS for Efficient Event-based and Spiking Neural Architectures

AT-SNN adapts ATS for SNN-based ViT architectures, targeting minimization of energy by dynamically reducing token count in space, time, and depth (Kang et al., 2024). The method integrates:

2D Adaptive Computation Time (ACT): Per-token, per-block, per-timestep halting, with halting score $h^{l,t}_k = \sigma(\alpha(T^{l,t}_{k,1}/NT^{l,t}_k)+\beta)$ , and cumulative sum triggering early halting $H_k(l,t) \geq 1-\epsilon$ .
Token-Merge Mechanism: At each block, $\gamma$ pairs of tokens with maximum cosine similarity are merged, reducing token count, and maintaining SNN sparsity.
Energy-Constrained Loss: A ponder loss term regularizes the expected sum of tokens across blocks/timesteps.

AT-SNN empirically yields up to 42.4% token reduction and $\sim$ 25–38% energy savings across CIFAR-10, CIFAR-100, and TinyImageNet, sometimes improving or matching classification accuracy. Accumulating halting over both blocks and timesteps improves performance beyond block-only approaches, and temporally-aware merging provides further robustness.

Dataset	Baseline Acc. (%)	AT-SNN Acc. (%)	Tokens/block	Energy Saved (%)
CIFAR-10	94.88	95.06	0.28 $\times$	38
CIFAR-100	77.42	78.14	0.75 $\times$	25

5. Adaptive Token Sampling for Masked Video Modeling

In masked video modeling, the Trajectory-Aware Adaptive Token Sampler (TATS) employs reinforcement learning to select visible tokens based on estimated token-wise motion saliency (Rai et al., 13 May 2025):

Motion-centric State Encoding: Video is patched, projected into tokens, and local trajectories are encoded via specialized Trajectory Attention (TA), producing motion-aware token embeddings.
Policy and Value Networks: An actor–critic architecture outputs a categorical sampling policy over tokens and value estimates, trained via Proximal Policy Optimization (PPO). The sampling probability distribution is used to stochastically select a sparse token subset per video.
Unified Optimization: Training alternates between standard MAE reconstruction minimization (updating encoder/decoder) and policy improvement (updating the sampling distribution based on reconstruction loss advantages).
Aggressive Masking: TATS enables operation at mask ratios up to 95%, focusing compute on high-motion trajectories while reducing memory footprint and maintaining or improving downstream action recognition performance.

Empirically, TATS consistently outperforms random masking and conventional content-aware masking at high mask ratios across classification and transfer benchmarks.

Dataset	VideoMAE Top-1	AdaMAE Top-1	TATS Top-1
UCF101	65.86	80.83	81.75
HMDB51	33.98	37.70	38.67
Kinetics	39.73	39.42	41.70

6. Comparative Summary and Extensions

ATS has evolved across domains with several commonalities and domain-specific adaptations. In all scenarios, token importance heuristics (attention-derived, entropy/variance, or motion) drive adaptive selection, with varying degrees of differentiability and autonomy.

Domain	Scoring/Selection	Fine-tuning	Acceleration	Notable Limitation
Vision Transformers (Fayyaz et al., 2021)	Attention × Value norm, quantile CDF	Optional	Plug-in	Over-pruning at low $K$
Masked Diffusion (Hayakawa et al., 6 Oct 2025)	CTS hybrid (moment + dispersion)	N/A	Post-hoc	Temperature bias for $\gamma > 1$
SNN-ViT (Kang et al., 2024)	2D-ACT, first-dim. of token, merging	Joint	Learned	Halting granularity
Masked Video Modeling (Rai et al., 13 May 2025)	RL with trajectory-based attention	Joint (MAE)	RL-driven	RL stability, detachment artifacts

Possible future extensions include per-ROI ATS for object detection, temporal masking for event-based data, and fully automated exploration-exploitation balancing. In all cases, ATS provides a general, effective mechanism for compute–efficiency trade-offs in token-based architectures.