Dynamic Token Mixers
- Dynamic token mixers are architectural modules that adaptively aggregate tokens based on input features, providing an efficient alternative to full self-attention.
- They employ methods like adaptive local aggregation, dynamic frequency filtering, and content-aware MLP mixing to improve scalability and performance.
- Empirical studies show these mixers achieve state-of-the-art results in tasks such as image segmentation, time series forecasting, and dynamic graph processing.
Dynamic token mixers are architectural modules that enable input-dependent, adaptive aggregation or transformation of sets of tokens—such as embeddings in sequences, graphs, or spatial grids—in place of, or as an alternative to, the quadratic-complexity attention mechanisms characteristic of Transformers. Unlike static mixers, which use fixed parameterizations independent of input content, dynamic token mixers condition their mixing (aggregation, filtering, or routing) on features of the current input, allowing per-instance adaptivity at sub-quadratic computational cost. Dynamic token mixers have been developed across modalities including graphs, images, time series, and have yielded advances in both representational power and scalability.
1. Theoretical Foundations and Definitions
The prototypical token mixer aggregates information from a collection of tokens , with tokens of dimensions. Classic self-attention computes a quadratic affinity matrix conditioned on learned queries/keys, yielding per-query, per-instance mixing. Dynamic token mixers generalize this adaptivity principle but often replace or constrain the fully-pairwise attention pattern to improve computational or statistical efficiency.
Dynamic token mixers differ from static mixers (e.g., fixed convolutions, uniform pooling) by introducing one or more forms of adaptive selection or weighting, which may be based on token order, content, temporal intervals, spatial neighborhood, or frequency features. Formally, for token , the output is
where and are functions of the input (and potentially position, time, or context), and denotes a (possibly input-dependent) aggregation set.
In recent work, dynamic token mixing has materialized via context-aware local aggregations (Zou et al., 16 Nov 2025), content-adaptive frequency-domain filters (Guibas et al., 2021, Tatsunami et al., 2023, Huang et al., 2023), learned dynamic spatial convolutions (Lou et al., 2023), dynamic routing or masking (Wu et al., 2 Oct 2025), and content-dependent MLP-mixing (Wang et al., 2022).
2. Principal Dynamic Token Mixer Designs
Several influential architectures illustrate distinct realizations of dynamic token mixing:
a. Adaptive Local Aggregation (GLFormer)
GLFormer replaces self-attention for dynamic graphs by introducing an adaptive token mixer that combines learnable order-dependent weights and softmax-normalized time-interval based scores in a sliding window (Zou et al., 16 Nov 2025). For each neighbors’ sequence position ,
0
where 1 are learned per-offset weights, 2 are softmax-normalized recency scores given by
3
and 4 is a learned scalar. The mixer computes a weighted sum within a local, hierarchical neighborhood.
b. Frequency-Domain Dynamic Filtering
Dynamic token mixing can occur globally and efficiently in the frequency domain. Adaptive Fourier Neural Operators (AFNO) (Guibas et al., 2021) and FFT-based Dynamic Filters (Tatsunami et al., 2023, Huang et al., 2023) parameterize content-adaptive multipliers in the Fourier domain. For AFNO, a small MLP operates on frequency coefficients and applies soft-thresholding for mode selection, yielding input-adaptive, channel-specific gating. In FFT-based Dynamic Filters, per-input, per-channel mixture coefficients are derived via a small MLP and used to combine learnable filter bases, with filtering performed by FFT→multiplication→IFFT.
c. Dynamic Masking and Routing (Pure-Pass)
Pure-Pass introduces fine-grained, pixel-wise masking for super-resolution (Wu et al., 2 Oct 2025). Each pixel is classified (via fixed color centers in RGB) as "pure" or "hard"; the dynamic mask controls routing, ensuring expensive mixing is only performed on complex pixels. Compensation ensures consistency when bypassing certain mixing modules.
d. Dynamic Convolution and Dual Mixers (TransXNet)
TransXNet’s Dual Dynamic Token Mixer (D-Mixer) combines input-dependent global attention with input-dependent depthwise convolution (Lou et al., 2023). The attention branch operates after an overlapping spatial reduction; the convolution branch pools features and synthesizes per-input, per-channel convolutional kernels via grouped MLPs and softmax attention over parameter groups.
e. Content-Aware MLP Mixing
DynaMixer dynamically generates mixing matrices from the flattened token features, using a dimensionality-reduced projection followed by multi-segment fusion, and applies row and column-wise content-dependent mixing to reduce computational cost (Wang et al., 2022).
f. Active Token Selection (ATMNet)
ATMNet predicts, for each query token and channel, a spatial offset from which to sample and then fuses these signals using content-dependent, channel-wise gating, enabling global receptive field mixing with per-channel adaptivity (Wei et al., 2022).
g. Factorized Mixing for Time Series
MTS-Mixers employ factorized dynamic mixing modules along temporal and channel axes, with learned low-rank MLPs enabling global, content-aware fusion without per-sequence quadratic cost (Li et al., 2023).
3. Computational Efficiency and Scalability
A central impetus for dynamic token mixers is to circumvent the 5 cost of full self-attention, especially when 6 is large (e.g., high-resolution images, long time series, large neighborhoods in graphs). The prevailing dynamic mixers achieve this via:
- Locality restrictions (GLFormer): Reducing aggregation to 7 by constraining the context window.
- Frequency domain operations (AFNO, FFT-based): Using FFT/IFFT to effect 8 global mixing.
- Dynamic masking/routing (Pure-Pass): Selectively applying expensive computation via fine-grained, data-dependent gating.
- Channel/axis factorization (DynaMixer, MTS-Mixers): Decomposing global mixing into axial or low-rank sub-problems.
- Per-input kernel synthesis (TransXNet): Producing depthwise dynamic kernels and fusing branches without attention matrices.
The table below summarizes reported complexities:
| Architecture | Dominant Mixer Complexity | Memory Scaling |
|---|---|---|
| Self-attention | 9 | 0 |
| GLFormer | 1 (D: receptive) | 2 |
| AFNO | 3 | 4 |
| FFT-based Dynamic | 5 | 6 |
| DynaMixer | 7 | 8 |
| ATM | 9 | 0 |
4. Empirical Trends and Quantitative Outcomes
Dynamic token mixers have demonstrated strong empirical results across modalities:
- GLFormer established state-of-the-art (SOTA) on continuous-time link-prediction benchmarks (Wikipedia, Reddit, MOOC, LastFM, SocialEvo, Enron), matching or surpassing Transformer backbones with 2–101 faster inference (Zou et al., 16 Nov 2025).
- AFNO achieved parity or better than standard self-attention on ImageNet-1K inpainting (PSNR = 27.05 dB), few-shot segmentation, and high-resolution semantic segmentation, but with 30–65% lower FLOPs and robust linear scaling in token count (Guibas et al., 2021).
- FFT-based Dynamic Filters in DFFormer/CDFFormer yielded higher throughput and equivalent or better top-1 accuracy on ImageNet-1K and ADE20K versus PoolFormer and ConvFormer; empirical ablations confirmed dynamic mixing outperformed static filter designs (Tatsunami et al., 2023).
- Pure-Pass saved on average 9% (up to 21%) of total FLOPs with negligible parameters (<1K) in ATD-light, while maintaining SOTA super-resolution on Urban100 (~33.26 dB PSNR) and outperforming both static and coarse-grained dynamic gating (Wu et al., 2 Oct 2025).
- TransXNet delivered superior accuracy and efficiency: TransXNet-T achieved 81.6% top-1 on ImageNet-1K with only 1.8G FLOPs and 12.8M params, outperforming Swin-T by 0.3% while requiring less than half the computational cost; ablation revealed the necessity of input adaptivity in both global and local branches (Lou et al., 2023).
- DynaMixer consistently outperformed other MLP-based models in terms of top-1 ImageNet accuracy at matched parameter cost, confirming the benefit of content-aware dynamic mixing, with a theoretical complexity reduction from 2 to 3 (Wang et al., 2022).
- On multivariate time series, MTS-Mixers reduced MSE by 24% on ECL (96→96) compared to FEDformer, and trained 3–54 faster per epoch (Li et al., 2023).
5. Comparative Properties and Design Trade-offs
Dynamic token mixers exhibit a range of behaviors and trade-offs compared to self-attention and static mixers:
- Adaptivity: Dynamic mixers modulate aggregation in real time, often depending on token content, position, or neighborhood structure.
- Sparsity: Mechanisms such as frequency soft-thresholding (AFNO) or pixel masking (Pure-Pass) introduce instance-specific sparsity, focusing compute on relevant tokens/frequencies.
- Local vs. Global: Approaches vary in which axes are mixed (spatial, channel, temporal, frequency), and in the spatial extent of aggregation, controlled via windowing, dilation, or global transforms.
- Parameter Efficiency: Techniques like block-diagonalization (AFNO), groupwise convolutions (AFF), and matrix factorization (MTS-Mixers) reduce parameter/compute overhead while maintaining expressive power.
- Routing Flexibility: Masking and gating strategies allow per-pixel or per-patch routing to high- or low-complexity mixers, balancing performance against hardware or resource constraints.
Key ablation studies highlight that the removal of adaptivity (e.g., replacing affine weights by static values, omitting interval/time/ordering cues) causes significant degradation, underscoring the criticality of dynamic mixing mechanisms (Zou et al., 16 Nov 2025, Guibas et al., 2021, Wu et al., 2 Oct 2025).
6. Modalities and Application Domains
Dynamic token mixers have been applied extensively to:
- Dynamic graphs: GLFormer demonstrates that context-aware, order- and time-adaptive mixers can perform scalable, accurate dynamic link prediction (Zou et al., 16 Nov 2025).
- Vision: AFNO, DFFormer, CDFFormer, Pure-Pass, TransXNet, AFFNet, ATMNet, and DynaMixer compete or outperform attention- and convolution-based baselines in classification, segmentation, and super-resolution by leveraging instance-adaptive mixing (Guibas et al., 2021, Tatsunami et al., 2023, Wu et al., 2 Oct 2025, Lou et al., 2023, Huang et al., 2023, Wei et al., 2022, Wang et al., 2022).
- Time series forecasting: MTS-Mixers exploit decoupled, factorized token mixing to supersede attention-based models in efficiency and accuracy (Li et al., 2023).
A plausible implication is that dynamic token mixing can serve as a unified principle for feature fusion across modalities, provided that adaptivity and scaling are preserved.
7. Future Directions and Open Questions
Ongoing research seeks to further unify, optimize, and understand dynamic token mixers across modalities. Open problems include:
- Characterizing the limits of adaptive mixing in terms of expressivity and inductive bias, relative to full self-attention or classic convolutional structures.
- Exploring hybridization of dynamic token mixing with learned sparsity patterns, instance-dependent routing, and hierarchical processing.
- Extending dynamic mixing principles to other data types, e.g., non-Euclidean domains, spatiotemporal manifolds, or combinatorial structures.
- Developing theoretical generalizations that capture the observed empirical success in replacing self-attention with lower-complexity, content-adaptive mixers.
Empirical reports suggest that context-aware, highly dynamic token mixing, combined with careful computational and architectural design, is sufficient to reach or exceed SOTA accuracy across varied tasks at reduced cost. Dynamic token mixers thus present a promising avenue in the search for scalable, efficient, and general feature fusion mechanisms in deep learning models.