Papers
Topics
Authors
Recent
Search
2000 character limit reached

DCT-Former: Efficiency via Discrete Cosine Transform

Updated 9 February 2026
  • The paper demonstrates that integrating DCT in transformer architectures significantly reduces computational complexity by compressing attention with low-frequency projections.
  • DCT-Former extends to vision and medical imaging, achieving state-of-the-art accuracy (e.g., 96% in breast cancer classification) with substantially lower computational costs.
  • Alternative instantiations, including Dynamic Clone Transformer and DEFormer, highlight the method’s flexibility in reducing FLOPs and improving image enhancement across tasks.

The term DCT-Former encompasses a set of neural network architectures that incorporate the Discrete Cosine Transform (DCT) into the transformer or convolutional neural network (CNN) paradigm. These models leverage DCT’s ability to compact energy and reduce dimensionality, targeting the high computational costs endemic to fully-attentive transformers, particularly for long sequences or high-resolution images. Across the literature, DCT-Former models are instantiated in domains including natural language processing, computer vision, low-light enhancement, and histopathology, with architectural variants reflecting different strategies for DCT integration (Scribano et al., 2022, Ye, 2021, Ranjbar et al., 2024, Yin et al., 2023).

1. DCT-Former in Transformer Architectures

The initial DCT-Former implementation provides an approximation of standard self-attention by projecting input sequences into a lower-dimensional frequency domain using the Type-II DCT. This method exploits the DCT basis to concentrate most of the signal variance into a small number of low-frequency coefficients. The process involves three key stages:

  1. Compression: Given input XRn×dX \in \mathbb{R}^{n \times d}, compute the DCT basis DRn×nD \in \mathbb{R}^{n \times n} and retain only the first nˉn\bar{n} \ll n rows (Dˉ\bar{D}), yielding a low-frequency compressed representation Xˉ=DˉX\bar{X} = \bar{D} X.
  2. Attention in Compressed Domain: Compute queries, keys, and values in the compressed space, forming an attention matrix Aˉ=softmax(QˉKˉT/d)\bar{A} = \mathrm{softmax}(\bar{Q} \bar{K}^T / \sqrt{d}) of size nˉ×nˉ\bar{n} \times \bar{n}. Apply this to Vˉ\bar{V} to obtain Yˉ\bar{Y}.
  3. Reconstruction: Map back to the original sequence length via the transpose DCT basis: YDCT=DˉTYˉY_{\mathrm{DCT}} = \bar{D}^T \bar{Y}.

By avoiding explicit construction and storage of the full n×nn \times n attention matrix, this pipeline achieves time and memory cost reductions from O(n2)O(n^2) to roughly O(nlogn+nˉ2)O(n \log n + \bar{n}^2) (using FFT-based DCT), with empirical evidence for 65–80% reductions in inference memory and latency for long sequences (Scribano et al., 2022).

2. DCT-Former in Lightweight Vision and Medical Transformers

DCT-HistoTransformer (also entitled DCT-Former in (Ranjbar et al., 2024)) extends this principle to vision domains, particularly for high-resolution histopathological image analysis. Each block comprises two parallel branches:

  • DCT-Attention Branch: Performs per-channel 2D-DCT, low-pass filtering (retaining only an M/r×N/rM/r \times N/r block for M×NM \times N inputs), followed by multi-head self-attention in the low-frequency domain. The output is re-mapped via 2D inverse DCT and (if necessary) upsampled.
  • MobileConv Branch: Applies pointwise and depthwise convolutions to operate in the spatial domain, complementing the global modeling of the DCT-Attention branch.

Summation or concatenation fuses the two streams, providing both local and global context with a drastically lower attention token count and associated computational cost. Evaluation on BreaKHis yields SOTA-comparable accuracy: 96.00 ± 0.48 % (binary) and 87.85 ± 0.93 % (multiclass) breast cancer classification, with computational burden reduced relative to vanilla ViTs or ResNet baselines (Ranjbar et al., 2024). The aggressive frequency-domain truncation exploits DCT's energy compaction, preserving the essential structure for classification while discarding noise and high-frequency details.

Model Accuracy (%)
VGG16 90.00
ResNet-50 92.70
Vanilla ViT 84.10
Swin-Transformer 84.30
DCT-HistoTransformer 96.00

3. Alternative "DCT-Former" Instantiations: Dynamic Clone Transformer

Not all models titled "DCT-Former" (or similar) directly leverage the Discrete Cosine Transform. The Dynamic Clone Transformer (DCT) instead references a "dual-branch" unit for inexpensive channel expansion within CNNs (Ye, 2021). In this context:

  • Clone Generation Branch: Replicates a low-dimensional feature map pp times (no parameters).
  • Difference Vector Branch: Employs a Squeeze-and-Excitation-style recalibrator on the global pooled feature to provide additive channel-wise modulation.

Fusion of these branches allows replacement of expensive pointwise convolutions (PWC) in bottleneck blocks, permitting >3×>3\times FLOPs reduction for channel expansion, with minimal parameter or compute overhead. While effective for resource-constrained or mobile vision backbones, this approach is structurally distinct from frequency-domain DCT-based self-attention and should not be conflated with it (Ye, 2021).

4. DCT-Frequency Modeling in Low-Light and Enhancement Transformers

Integrating the DCT domain into transformers for enhancement tasks, DEFormer (described as a DCT-Former in (Yin et al., 2023)) adopts an explicit frequency branch:

  • Learnable Frequency Branch (LFB): Images are patched and transformed via 2D-DCT. Features are split by frequency band, and a curvature-based enhancement further increases discriminative capacity.
  • Cross-Domain Fusion (CDF): Combines frequency and RGB domain features via channel-wise gating and spatial attention.
  • Transformer Blocks: Feed both RGB and frequency features through transformers for context integration.

Empirical evidence demonstrates substantial gains in perceptual metrics (PSNR/SSIM) and improved downstream detection performance, illustrating the efficacy of "RGB + DCT" hybrid pipelines for challenging, information-sparse settings (Yin et al., 2023).

5. Efficiency Trade-offs, Empirical Results, and Limitations

The principal empirical findings for standard DCT-Former (in NLP/vision) include:

  • Efficiency: Up to 80% reduction in GPU memory and over 65% speedup in attention block inference (e.g., 4096-token batch).
  • Accuracy vs. Efficiency: Slight decrease in task performance vs. full-attention models (3% drop in F1 for IMDb sentiment analysis), but superior to Linformer and on par with Nyströmformer with comparable memory profiles (Scribano et al., 2022).
  • Domain Adaptation: Extensible to 2D/3D-DCT for video or medical images (Scribano et al., 2022, Ranjbar et al., 2024).
Model F1 Score Memory (MB) Latency (ms)
Vanilla Attn 0.90 1250 45.6
DCT-Former 0.87 326 15.8
Linformer 0.82
Nyströmformer 0.88

Limitations include relaxation error due to the non-commutativity of softmax and DCT, the choice of compression rank nˉ\bar{n}, and absence of ablation between spatial and frequency branches in some vision models. Adaptive or learnable selection of DCT coefficients may further optimize performance.

6. Theoretical and Practical Implications

DCT-Former models validate that frequency-domain compression is an effective, interpretable strategy for addressing quadratic self-attention bottlenecks. The technique's parameter-free, globally linear structure is amenable to hardware optimizations (e.g., FFT-based DCT), and is particularly suitable for edge deployment, long-sequence modeling, and image domains requiring efficient global context aggregation. Its modularity also enables integration with other transformer efficiency paradigms, including locality-sensitive attention and hybrid CNN-ViT architectures (Scribano et al., 2022, Ranjbar et al., 2024).

A plausible implication is that DCT-based token reduction may serve as a generic pre-attention compression step in many transformer and vision architectures, especially where critical information is predominantly low-frequency.

7. Future Directions

Suggested avenues for further research include:

  • Softmax-DCT Commutativity: Designing mechanisms to further reduce relaxation error by better aligning attention and frequency projections.
  • Extension to Multi-Modality: Applying 2D/3D DCT-Former blocks in video, multimodal, or volumetric imagery.
  • Adaptive Compression: Learning which DCT coefficients to retain during training for an optimal trade-off between speed, memory, and task performance.
  • Fusion with Local Attention: Integrating DCT-global and local (windowed/neighbor) attention for very long or structured sequences.

These directions aim to combine the efficiency of DCT-based compression with robustness and expressivity for diverse domains.


References:

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform DCT-HistoTransformer: Efficient Lightweight Vision Transformer with DCT Integration for histopathological image analysis Dynamic Clone Transformer for Efficient Convolutional Neural Networks DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DCT-Former.