Spectral Vision Transformer for Efficient Tokenization with Limited Data

Published 12 May 2026 in cs.CV, cs.AI, and eess.SP | (2605.12026v1)

Abstract: We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces a spectral vision transformer that uses PCA tokenization to reduce parameter count and achieve robust classification in low-data scenarios.
It replaces spatial patch-based tokenization with spectral projections, leveraging variance hierarchy for enhanced signal compaction and noise reduction.
Empirical results show superior performance in medical imaging tasks, requiring far fewer samples than traditional spatial Vision Transformers.

Spectral Vision Transformer for Efficient Tokenization with Limited Data

Architectural Innovations

The paper "Spectral Vision Transformer for Efficient Tokenization with Limited Data" (2605.12026) introduces a vision transformer (ViT) architecture that leverages spectral tokenization via principal component analysis (PCA) and related eigenspace decompositions for efficient learning on small datasets. The proposed spectral ViT replaces conventional spatial patch-based tokenization with projections onto a fixed spectral basis, reducing parameter count and enabling robust performance in data-limited regimes, particularly relevant to medical imaging.

Spectral tokenization utilizes projections of an input image onto PCA eigenvectors, Fourier modes, or Laplacian harmonics, resulting in tokens that encode global inductive biases and invariant or hierarchical relationships. Unlike spatial ViTs, where patch positions are typically encoded based on grid location, spectral ViTs embed principal component indices and spectral properties as token positions and leverage inherent variance ranks for hierarchical encoding.

Figure 1: The spectral ViT differs from spatial ViTs by parameterizing the linear projection $E$ via spectral decomposition, yielding tokens $\mathbf{s}$ with hierarchical ranks.

Theoretical Properties and Signal Representation

Spectral tokenization admits several theoretical advantages over spatial patch-based methods:

Optimal Reconstruction: The PCA basis yields minimal least-squares error for a fixed number of components, providing the best $n$ -rank linear approximation by minimizing Frobenius norm. This is critical in scarce data settings where redundancy and noise can undermine spatial embeddings.
Energy Compaction and SNR Maximization: Spectral tokenization concentrates signal variance efficiently. The SNR for a truncated spectral representation is maximized as the PCA retains dominant eigenvalues, crucial for encoding clinically relevant information in medical imaging.
Figure 2: Spectral tokenization and reconstruction using different bases highlight variance and SNR compaction.

The architecture is extensible to other bases such as Fourier (enabling strict spatial invariance) and Laplacian (preserving topology for non-Euclidean data); tokens can be ordered and weighted by variance rank or frequency, with hierarchical embeddings determined by spectral properties.

Computational Complexity

Spectral ViT decouples image resolution from sequence length, as spectral tokenization compresses an image of $m$ voxels into $n$ spectral tokens $(n \ll m)$ . The complexity analysis demonstrates:

Spectral Tokenization: $\mathcal{O}(nm)$ for PCA basis; $\mathcal{O}(m \log m)$ for Fourier.
Transformer Layers: $\mathcal{O}(n^2)$ per layer, compared to $\mathcal{O}(m^2)$ for spatial patch-ViT where sequence length scales with resolution.

This paradigm offers significant computational efficiency, especially for volumetric medical imaging, by reducing quadratic costs associated with attention.

Empirical Results

The proposed spectral ViT is evaluated across simulated, public, and clinical datasets:

Pattern Classification: In low-sample, high-noise regimes, spectral ViTs substantially outperform spatial ViTs; at low SNR, spatial ViTs require up to $\mathbf{s}$ 0 samples for parity, while spectral ViTs achieve competitive AUC with fewer than 1,000 samples.
Figure 3: Binary classification differentiates pure noise from patterns with increasing SNR, illustrating superior spectral ViT discrimination.

Figure 4: Crossover sample size $\mathbf{s}$ 1 required for spatial ViT to match spectral ViT performance increases as SNR decreases.
Object Classification under Distribution Shift: With a spatially invariant Fourier basis, spectral ViT achieves perfect separation in cases where spatial ViT fails due to spurious positional correlations.
Figure 5: Under spurious spatial-label correlations, spatial ViT fails while spectral ViT (Fourier basis) provides perfect class separation.
IXI Sex Classification: On a public dataset with 566 MRI volumes, spectral ViT achieves $\mathbf{s}$ 2 accuracy and $\mathbf{s}$ 3 AUC with only $\mathbf{s}$ 4 parameters, surpassing spatial ViT and matching U-net with attention while maintaining higher parameter efficiency.
Neurostimulation Candidate Prediction: Using clinical data, spectral ViT attains $\mathbf{s}$ 5 balanced accuracy and $\mathbf{s}$ 6 AUC, outperforming spatial and clinical transformer models. Saliency maps localize the superior cerebellar peduncle (SCP), a tract relevant for deep brain stimulation (DBS) outcomes.
Figure 6: QSM and saliency maps show SCP localization by spectral ViT, whereas spatial methods highlight less clinically relevant regions.

Limitations

The spectral ViT’s reliance on a fixed linear basis (such as PCA) introduces bias and spatial variance unless alternative bases like Fourier are used for invariance. Registration is required for PCA-based tokenization. Degenerate spectral modes and task-irrelevant components may arise but are mitigated by attention’s feature selection capability. In abundant data settings, learned embeddings may outperform spectral approaches. The architecture inherits artifacts characteristic of the underlying spectral transform, including noise sensitivity and aliasing.

Implications and Future Directions

Spectral ViT presents a viable path to practical transformer deployment in medical imaging and other domains with limited data, improving efficiency and interpretability by leveraging global basis representations. Theoretical advantages in signal compaction and optimal low-rank approximation position spectral tokenization as useful for rapid prototyping, data-scarce tasks, and interpretable modeling.

The architecture can be generalized to alternate spectral bases, providing flexibility for diverse topological or invariance requirements. Basis construction and hybrid approaches with learned embeddings are promising areas for future development. Integration with transfer learning and domain adaptation approaches is crucial for clinical translation. Further research should investigate spectral ViT performance in large-scale settings and its robustness to real-world dataset shifts.

Conclusion

The spectral ViT reduces parameter burden and data requirements by encoding input images through spectral decomposition, leveraging variance hierarchies and spectral positional embeddings for efficient tokenization. Empirical results demonstrate substantially improved performance in low-data and distribution-shifted regimes compared to spatial ViTs and other compact models. The approach offers a theoretically principled and computationally efficient paradigm, with practical implications for clinical imaging and future foundation model architectures.

Markdown Report Issue