Papers
Topics
Authors
Recent
2000 character limit reached

Token-level SEP in Vision Transformers

Updated 26 November 2025
  • Token-level SEP is a diagnostic tool that measures individual token frequency-based energy allocation in Vision Transformers.
  • It reveals an encoding mismatch where global low-rank representations contrast with broad per-token energy distribution, impacting feature distillation.
  • Empirical analyses in models like CaiT-S24 and DeiT-Tiny show that despite low-rank global features, tokens require nearly all frequency bins to recover their energy.

The token-level Spectral Energy Pattern (SEP) is a quantitative diagnostic of how individual tokens in a Vision Transformer allocate their representational energy across channel (feature) modes, as measured in a frequency-domain basis. SEP provides insight into channel utilization at the per-token level, revealing surprising differences between local encoding and global representational statistics. This distinction has direct implications for feature distillation and knowledge transfer in Vision Transformers (ViTs), notably identifying encoding mismatches that can undermine distillation methods that otherwise appear feasible given global matrix properties (Tian et al., 19 Nov 2025).

1. Mathematical Definition and Computation

Let XRn×DX \in \mathbb{R}^{n \times D} denote a feature-map from a Transformer layer, where nn is the number of tokens (including [CLS], patch tokens, etc.) and DD is the channel (embedding) dimension. For a given token tt, let xtRDx_t \in \mathbb{R}^D be its channel vector.

Token-level SEP is constructed as follows:

  1. Frequency Decomposition: Compute the 1D Discrete Fourier Transform (DFT) of xtx_t along the channel axis, resulting in x^t=DFT(xt)CD\widehat{x}_t = \mathrm{DFT}(x_t) \in \mathbb{C}^D.
  2. Spectral Energy: For each DFT frequency bin ii, compute energy Ft(i)=x^t(i)2\mathcal{F}_t(i) = |\widehat{x}_t(i)|^2.
  3. Cumulative Energy: Since the DFT is conjugate-symmetric for real signals, consider only the first D=(D+1)/2D' = \lceil (D+1)/2 \rceil unique bins.
  4. SEP Curve: The cumulative energy up to bin dd is

SEPt(d)=i=1dFt(i)j=1DFt(j)×100%,d=1,,D\mathrm{SEP}_t(d) = \frac{\sum_{i=1}^d \mathcal{F}_t(i)}{\sum_{j=1}^{D'} \mathcal{F}_t(j)} \times 100\% \,, \quad d = 1, \ldots, D'

  1. Bandwidth Quantile: For any target cumulative ratio α(0,1)\alpha \in (0, 1), define the normalized spectral bandwidth

bt,α=min{dD    SEPt(d)α}b_{t, \alpha} = \min \left\{ \frac{d}{D'} \;\bigg|\; \mathrm{SEP}_t(d) \geq \alpha \right\}

Large bt,αb_{t,\alpha} indicates broadly spread per-token energy utilization.

2. Relationship to SVD and Per-Token Representational Structure

A global view of feature-matrix structure is provided by singular value decomposition (SVD): X=UΣVX = U \Sigma V^\top, with rank rr and top-dd energy fraction Esvd(d)=i=1dσi2/j=1rσj2E_{\mathrm{svd}}(d) = \sum_{i=1}^d \sigma_i^2 / \sum_{j=1}^r \sigma_j^2. In ViTs, it is empirically observed that the last-layer feature map is globally low-rank, e.g., in CaiT-S24, only 121 of 384 channels account for 99% of the population energy.

A per-token SVD analysis, writing each xt=Vctx_t = V c_t in the SVD basis, allows for analogous token-level spectra in the principal component coordinates. However, standard computation of SEP uses the DFT, focusing on frequency rather than principal axis alignment. Both analyses capture local energy allocation: SEP provides direct frequency-mode utilization, while SVD-based spectra reflect principal component overlap.

3. Algorithmic Procedure for SEP Calculation

The computation of SEP for a given feature-map, as analyzed in (Tian et al., 19 Nov 2025), proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
X_hat_S = X_S @ P             # ℝⁿˣDT

for t in range(n):
    x_t = X[t, :]                      # ℝ^D
    z_hat_t = F @ x_t                  # ℂ^{Dp}
    F_t = abs(z_hat_t)**2              # ℝ^{Dp}
    cumF = cumulative_sum(F_t)         # cumF[d] = sum_{i=1}^d F_t(i)
    totE = cumF[Dp]
    for d in range(1, Dp+1):
        SEP_t(d) = cumF[d] / totE

The main outputs are the per-token SEP curves, SEPt(d)\mathrm{SEP}_t(d), and normalized bandwidths bt,αb_{t,\alpha}, which quantify how much of the frequency spectrum is required for a token to amass a given fraction of its total channel energy.

4. SEP Analysis and Encoding Mismatch

SEP diagnostics reveal a fundamental "encoding mismatch" in ViTs between global subspace structure and local token utilization:

  • Global SVD view: The teacher's feature matrix XTX_T in the final layer is low-rank—with only a fraction of channels required to account for nearly all representational energy.
  • Token-level SEP view: Individual token vectors xtx_t distribute their energy broadly, utilizing nearly all frequency bins. For example, mean bt,0.90.9b_{t,0.9} \approx 0.9 for both CaiT-S24 (D=384D=384) and DeiT-Tiny (D=192D=192), indicating each token spreads its energy over 90% of frequency bins to recover 90% of its own energy.

The SEP curve is nearly diagonal:

  • 50% of token energy requires ~50% of frequency bins,
  • 70% energy requires ~70% of bins,
  • 90% energy requires ~90% of bins.

This high per-token bandwidth, in a globally low-rank space, prevents narrow (low DSD_S) students from matching the teacher’s output at a fine-grained, token-local level, regardless of global subspace alignments or linear projections. This is the primary obstacle to effective feature-map distillation in ViTs via conventional L2L_2 or feature-MSE loss.

5. Empirical Characterization in CaiT-S24 and DeiT-Tiny

The SEP and SVD-based energy bandwidths for prominent ViT models are:

Model DD b80%b_{80\%} b90%b_{90\%}
CaiT-S24 384 0.805 0.901
DeiT-Tiny 192 0.797 0.901

For the global SVD of CaiT-S24’s final layer (DT=384D_T=384), the dimensionality to capture a given fraction of total energy is:

  • 80%: r0.8=14r_{0.8}=14
  • 90%: r0.9=34r_{0.9}=34
  • 95%: r0.95=61r_{0.95}=61
  • 99%: r0.99=121r_{0.99}=121

Notably, 99% of energy is contained in only 121/38431%121/384 \approx 31\% of global modes, but any given token requires \sim90% of bins for 90% of its energy.

6. Implications for Feature Distillation in Vision Transformers

The diagnosis of encoding mismatch through SEP leads to concrete remedies:

  • Conventional feature-map MSE fails: Due to high per-token bandwidth, compact students (DSDTD_S \ll D_T) cannot realize the teacher's local encoding, making standard feature map distillation ineffective or only marginally beneficial.
  • Remedy 1: Post-hoc feature lifting: Insert and retain a lightweight linear projector PRDS×DTP \in \mathbb{R}^{D_S \times D_T} after the student’s last block, expanding its width back to DTD_T. This restores necessary per-token channel capacity, making simple MSE-based distillation effective.
  • Remedy 2: Native width alignment: Replace only the last Transformer block of the student with a block of channel width DTD_T, preserving model compactness but meeting the local bandwidth requirement.

Both strategies robustly recover the effectiveness of feature-map distillation (MSE/SpectralKD), raising top-1 accuracy by up to 3.3 percentage points on ImageNet-1K (e.g., DeiT-Tiny from 74.86% to up to 78.23%) and even providing modest improvements when no teacher is used. This confirms encoding mismatch, as quantified by token-level SEP, as a true architectural limiting factor (Tian et al., 19 Nov 2025).

7. Broader Significance and Diagnostic Utility

SEP, together with global SVD, constitutes a two-view analytic framework for understanding representational bottlenecks in ViTs. Whereas SVD captures population-level, global structure, SEP directly measures local, per-token channel utilization, revealing when representational mismatch, rather than mere dimensionality reduction, becomes the limiting factor for distillation or student-teacher alignment. These diagnostics thus not only explain prior negative results in ViT feature distillation but also directly inform minimal modifications to ViT student architectures to close the gap to state-of-the-art distillation performance (Tian et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Token-level Spectral Energy Pattern (SEP).