Token-level SEP in Vision Transformers
- Token-level SEP is a diagnostic tool that measures individual token frequency-based energy allocation in Vision Transformers.
- It reveals an encoding mismatch where global low-rank representations contrast with broad per-token energy distribution, impacting feature distillation.
- Empirical analyses in models like CaiT-S24 and DeiT-Tiny show that despite low-rank global features, tokens require nearly all frequency bins to recover their energy.
The token-level Spectral Energy Pattern (SEP) is a quantitative diagnostic of how individual tokens in a Vision Transformer allocate their representational energy across channel (feature) modes, as measured in a frequency-domain basis. SEP provides insight into channel utilization at the per-token level, revealing surprising differences between local encoding and global representational statistics. This distinction has direct implications for feature distillation and knowledge transfer in Vision Transformers (ViTs), notably identifying encoding mismatches that can undermine distillation methods that otherwise appear feasible given global matrix properties (Tian et al., 19 Nov 2025).
1. Mathematical Definition and Computation
Let denote a feature-map from a Transformer layer, where is the number of tokens (including [CLS], patch tokens, etc.) and is the channel (embedding) dimension. For a given token , let be its channel vector.
Token-level SEP is constructed as follows:
- Frequency Decomposition: Compute the 1D Discrete Fourier Transform (DFT) of along the channel axis, resulting in .
- Spectral Energy: For each DFT frequency bin , compute energy .
- Cumulative Energy: Since the DFT is conjugate-symmetric for real signals, consider only the first unique bins.
- SEP Curve: The cumulative energy up to bin is
- Bandwidth Quantile: For any target cumulative ratio , define the normalized spectral bandwidth
Large indicates broadly spread per-token energy utilization.
2. Relationship to SVD and Per-Token Representational Structure
A global view of feature-matrix structure is provided by singular value decomposition (SVD): , with rank and top- energy fraction . In ViTs, it is empirically observed that the last-layer feature map is globally low-rank, e.g., in CaiT-S24, only 121 of 384 channels account for 99% of the population energy.
A per-token SVD analysis, writing each in the SVD basis, allows for analogous token-level spectra in the principal component coordinates. However, standard computation of SEP uses the DFT, focusing on frequency rather than principal axis alignment. Both analyses capture local energy allocation: SEP provides direct frequency-mode utilization, while SVD-based spectra reflect principal component overlap.
3. Algorithmic Procedure for SEP Calculation
The computation of SEP for a given feature-map, as analyzed in (Tian et al., 19 Nov 2025), proceeds as follows:
1 2 3 4 5 6 7 8 9 10 11 |
X_hat_S = X_S @ P # ℝⁿˣDT for t in range(n): x_t = X[t, :] # ℝ^D z_hat_t = F @ x_t # ℂ^{Dp} F_t = abs(z_hat_t)**2 # ℝ^{Dp} cumF = cumulative_sum(F_t) # cumF[d] = sum_{i=1}^d F_t(i) totE = cumF[Dp] for d in range(1, Dp+1): SEP_t(d) = cumF[d] / totE |
The main outputs are the per-token SEP curves, , and normalized bandwidths , which quantify how much of the frequency spectrum is required for a token to amass a given fraction of its total channel energy.
4. SEP Analysis and Encoding Mismatch
SEP diagnostics reveal a fundamental "encoding mismatch" in ViTs between global subspace structure and local token utilization:
- Global SVD view: The teacher's feature matrix in the final layer is low-rank—with only a fraction of channels required to account for nearly all representational energy.
- Token-level SEP view: Individual token vectors distribute their energy broadly, utilizing nearly all frequency bins. For example, mean for both CaiT-S24 () and DeiT-Tiny (), indicating each token spreads its energy over 90% of frequency bins to recover 90% of its own energy.
The SEP curve is nearly diagonal:
- 50% of token energy requires ~50% of frequency bins,
- 70% energy requires ~70% of bins,
- 90% energy requires ~90% of bins.
This high per-token bandwidth, in a globally low-rank space, prevents narrow (low ) students from matching the teacher’s output at a fine-grained, token-local level, regardless of global subspace alignments or linear projections. This is the primary obstacle to effective feature-map distillation in ViTs via conventional or feature-MSE loss.
5. Empirical Characterization in CaiT-S24 and DeiT-Tiny
The SEP and SVD-based energy bandwidths for prominent ViT models are:
| Model | |||
|---|---|---|---|
| CaiT-S24 | 384 | 0.805 | 0.901 |
| DeiT-Tiny | 192 | 0.797 | 0.901 |
For the global SVD of CaiT-S24’s final layer (), the dimensionality to capture a given fraction of total energy is:
- 80%:
- 90%:
- 95%:
- 99%:
Notably, 99% of energy is contained in only of global modes, but any given token requires 90% of bins for 90% of its energy.
6. Implications for Feature Distillation in Vision Transformers
The diagnosis of encoding mismatch through SEP leads to concrete remedies:
- Conventional feature-map MSE fails: Due to high per-token bandwidth, compact students () cannot realize the teacher's local encoding, making standard feature map distillation ineffective or only marginally beneficial.
- Remedy 1: Post-hoc feature lifting: Insert and retain a lightweight linear projector after the student’s last block, expanding its width back to . This restores necessary per-token channel capacity, making simple MSE-based distillation effective.
- Remedy 2: Native width alignment: Replace only the last Transformer block of the student with a block of channel width , preserving model compactness but meeting the local bandwidth requirement.
Both strategies robustly recover the effectiveness of feature-map distillation (MSE/SpectralKD), raising top-1 accuracy by up to 3.3 percentage points on ImageNet-1K (e.g., DeiT-Tiny from 74.86% to up to 78.23%) and even providing modest improvements when no teacher is used. This confirms encoding mismatch, as quantified by token-level SEP, as a true architectural limiting factor (Tian et al., 19 Nov 2025).
7. Broader Significance and Diagnostic Utility
SEP, together with global SVD, constitutes a two-view analytic framework for understanding representational bottlenecks in ViTs. Whereas SVD captures population-level, global structure, SEP directly measures local, per-token channel utilization, revealing when representational mismatch, rather than mere dimensionality reduction, becomes the limiting factor for distillation or student-teacher alignment. These diagnostics thus not only explain prior negative results in ViT feature distillation but also directly inform minimal modifications to ViT student architectures to close the gap to state-of-the-art distillation performance (Tian et al., 19 Nov 2025).