Frequency-Aware Token Reduction

Updated 29 November 2025

Frequency-aware token reduction strategies are methods that selectively retain high-frequency tokens and aggregate low-frequency tokens to optimize transformer efficiency and mitigate rank collapse and over-smoothing.
They employ frequency decomposition of the self-attention matrix—via techniques such as k-means clustering on Q–K similarities—to distinguish and preserve critical high-frequency details.
Empirical evaluations show these approaches achieve 30–50% MAC savings and maintain or improve accuracy in tasks like ImageNet1K classification and hyperspectral pansharpening.

Frequency-aware token reduction strategies are mechanisms for reducing the computational complexity of transformer architectures—particularly Vision Transformers (ViTs)—by selectively retaining or aggregating tokens based on their frequency characteristics. These strategies aim to address fundamental issues in self-attention such as rank collapsing and over-smoothing, while also improving efficiency and preserving salient information for tasks like hyperspectral pansharpening and large-scale image classification (Jin et al., 11 Aug 2025, Lee et al., 26 Nov 2025).

1. Theoretical Foundations: Frequency Properties in Self-Attention

Self-attention functions as a low-pass filter in the frequency domain, suppressing high-frequency components (edge, texture, detail) and retaining dominant low-frequency information (mean, broad structure). This behavior manifests as two intertwined phenomena:

Rank Collapsing: Repeated application of self-attention exponentially attenuates high-frequency signals, driving token representations toward rank-1 structure. Given token matrix $X \in \mathbb{R}^{n \times d}$ , its high-frequency component is $H_f[X] = (I - \frac{1}{n}\mathbf{1}\mathbf{1}^T)X$ ; self-attention $SA$ yields $\|H_f[SA(X)]\|_F \leq \lambda \|H_f[X]\|_F$ , with $0 < \lambda < 1$ (Lee et al., 26 Nov 2025).
Over-Smoothing: Attention weights tend to average tokens, reducing inter-token diversity, particularly across deeper layers. Decomposition $A = \mathrm{Softmax}(Q K^T / \sqrt{d_k})$ into $A^{LP}$ (DC filter) and $A^{HP}$ (high-pass) exposes the tendency to preserve only zero-frequency (DC) content, suppressing higher frequencies.

These effects degrade model expressivity, especially when vanilla token reduction methods (e.g., uniform merging or pruning) are employed without frequency awareness.

2. High-Frequency Token Identification and Partitioning

Frequency-aware reduction strategies target the retention of tokens contributing disproportionately to the feature's high-frequency content. Formally, for tokens $X \in \mathbb{R}^{n \times d}$ , attention matrix $A$ is decomposed as $A^{LP} = \frac{1}{n}\mathbf{1}\mathbf{1}^T$ and $A^{HP} = A - A^{LP}$ . The average high-frequency contribution per token $k$ is:

$\widetilde{A}_k = \frac{1}{n h} \sum_{h'=1}^h \sum_{q=1}^n A^{HP\;(h')}_{q,k}$

where $h$ is the number of heads. For a target retain-rate $\tau$ , the top- $r$ ( $r = \lfloor n\tau \rfloor$ ) tokens by $\widetilde{A}_k$ form the high-frequency set $N_{HF}$ ; the remainder $N_{LF}$ are considered low-frequency (Lee et al., 26 Nov 2025). In certain approaches, as in THAT, token importance is derived implicitly via k-means clustering over token similarity matrices, with clusters corresponding to pivotal (presumed high-frequency) and non-pivotal tokens (Jin et al., 11 Aug 2025).

3. Aggregation and Handling of Low-Frequency Tokens

To preserve low-frequency (DC) context, rather than discarding $N_{LF}$ entirely, these methods aggregate LF tokens into a compact representation. Two variants are established:

Global DC Token: $x_{DC} = \frac{1}{|N_{LF}|} \sum_{i \in N_{LF}} x_i$
Local DC Tokens: Partition patch tokens into $w^2$ spatial windows $\{N^j\}$ , aggregate within each: $x^j_{DC} = \frac{1}{|N^j\cap N_{LF}|} \sum_{i \in N^j \cap N_{LF}} x_i$ , $j = 1, \dots, w^2$

This process ensures retention of essential low-frequency information while achieving complexity reduction (Lee et al., 26 Nov 2025).

4. Architectural Instantiations in Vision Transformers

Several recent frameworks implement frequency-aware reduction:

Token-wise High-frequency Augmentation Transformer (THAT) (Jin et al., 11 Aug 2025): Pivotal Token Selective Attention (PTSA) harnesses k-means clustering on the Q–K similarity matrix to select informative (likely high-frequency) tokens and suppress redundancy. Multi-level Variance-aware Feed-forward Network (MVFN) uses multi-scale convolutions and local variance estimation, enhancing high-frequency detail. Outputs from PTSA are processed by window-based self-attention and MVFN before reconstruction.
Frequency-Aware Token Reduction for Efficient Vision Transformer (Lee et al., 26 Nov 2025): Implements formal partitioning of high- and low-frequency tokens as previously described, aggregates LF tokens into DC tokens, and passes the reduced token set through subsequent attention and MLP blocks.

A representative algorithm for token reduction in one transformer layer is given below (from (Lee et al., 26 Nov 2025)):

Q = X @ W_Q
K = X @ W_K
A = softmax(Q @ K.T / sqrt(d_k))  # average over heads
A_LP = (1 / n) * np.ones((n, n))
A_HP = A - A_LP
A_score = np.mean(A_HP, axis=(0, 1))
N_HF = top_r_indices(A_score)  # r = floor(n * τ)
N_LF = setdiff(range(n), N_HF)
if w == 1:
    x_DC = np.mean(X[N_LF], axis=0)
    DC_set = [x_DC]
else:
    DC_set = [mean(X[window & N_LF]) for window in windows]
X_reduced = concatenate(X[N_HF], DC_set)

5. Computational Complexity Implications

Let $n$ denote the original token count, $d$ the token dimension, and $r = \tau n$ the number of high-frequency tokens retained.

Block	Complexity (Original)	Complexity (Reduced)
Q–K/Product	$O(n^2d)$	$O(n(r + w^2)d)$
Attn–V/Product	$O(n^2d)$	$O(n(r + w^2)d)$
FFN	$O(nd^2)$	$O((r + w^2)d^2)$

In practice, reduction of token count leads to $\sim$ 30–50% multiply–accumulate (MAC) savings with no performance drop when $\tau \approx 0.3$ (Lee et al., 26 Nov 2025).

6. Empirical Performance and Mitigation of Rank Collapse

Frequency-aware token reduction strategies consistently match or outperform baseline models:

On ImageNet1K with DeiT-S, accuracy rises slightly from 79.8% (baseline) to 79.9% at 30–40% reduced MACs (Lee et al., 26 Nov 2025).
THAT achieves state-of-the-art spectral fidelity and spatial sharpness on hyperspectral pansharpening benchmarks, outperforming eleven baselines on PSNR, SSIM, SAM, ERGAS, and SCC (Jin et al., 11 Aug 2025).

These methods demonstrably slow down over-smoothing and rank collapse, as measured by CKA similarity and frequency-spectrum amplitude tracking. White-noise ablation experiments reveal that high-frequency tokens are substantially more critical for downstream performance (Lee et al., 26 Nov 2025).

Method	Params (M)	MACs (G)	Acc. (%)
Baseline	22.1	4.6	79.8
EViT	22.1	3.0	79.5
ToMe	22.1	2.9	79.5
DiffRate	22.1	2.9	79.6
Freq-Aware	22.1	3.0	79.9

7. Connections, Limitations, and Outlook

Frequency-aware token reduction strategies directly confront weaknesses of existing pruning and merging techniques, which can erode high-frequency components and accelerate rank collapse. By always retaining top- $r$ HF tokens and aggregating LF information into DC tokens, these methods achieve a new tradeoff: substantially lower computational cost, slowed over-smoothing, and robust performance preservation. Notably, in frameworks like THAT, explicit frequency scores per token are not computed; token selection relies on the Q–K similarity matrix as an approximate proxy for informativeness, which strongly aligns with high-frequency regions (Jin et al., 11 Aug 2025).

A plausible implication is that frequency-aware token reduction may generalize across modalities (images, hyperspectral, video) wherever local detail is critical, provided suitable proxies for token informativeness are identified. Ongoing research is evaluating dynamic tuning of retain-rates, alternate clustering schemes, and joint frequency–spatial priors for further efficiency–performance improvements.