Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DCT-Based Decorrelated Attention for Vision Transformers (2405.13901v2)

Published 22 May 2024 in cs.CV, cs.LG, and eess.SP

Abstract: Central to the Transformer architectures' effectiveness is the self-attention mechanism, a function that maps queries, keys, and values into a high-dimensional vector space. However, training the attention weights of queries, keys, and values is non-trivial from a state of random initialization. In this paper, we propose two methods. (i) We first address the initialization problem of Vision Transformers by introducing a simple, yet highly innovative, initialization approach utilizing Discrete Cosine Transform (DCT) coefficients. Our proposed DCT-based attention initialization marks a significant gain compared to traditional initialization strategies; offering a robust foundation for the attention mechanism. Our experiments reveal that the DCT-based initialization enhances the accuracy of Vision Transformers in classification tasks. (ii) We also recognize that since DCT effectively decorrelates image information in the frequency domain, this decorrelation is useful for compression because it allows the quantization step to discard many of the higher-frequency components. Based on this observation, we propose a novel DCT-based compression technique for the attention function of Vision Transformers. Since high-frequency DCT coefficients usually correspond to noise, we truncate the high-frequency DCT components of the input patches. Our DCT-based compression reduces the size of weight matrices for queries, keys, and values. While maintaining the same level of accuracy, our DCT compressed Swin Transformers obtain a considerable decrease in the computational overhead.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
  2. Toeplitz approximation to empirical correlation matrix of asset returns: A signal processing perspective. IEEE Journal of Selected Topics in Signal Processing, 6(4):319–326, 2012.
  3. Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, pages 35–49. Springer, 2022.
  4. Stevo Bozinovski. Reminder of the first paper on transfer learning in neural networks, 1976. Informatica, 44(3), 2020.
  5. Fourier image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1846–1854, 2022.
  6. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
  7. Cascade r-cnn: High quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence, 43(5):1483–1498, 2019.
  8. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  9. A fast computational algorithm for the discrete cosine transform. IEEE Transactions on communications, 25(9):1004–1009, 1977.
  10. Fast fourier convolution. Advances in Neural Information Processing Systems, 33:4479–4488, 2020.
  11. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  13. R Dony et al. Karhunen-loeve transform. The transform and data compression handbook, 1(1-34):29, 2001.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  15. Mpeg-4 natural video coding–an overview. Signal Processing: Image Communication, 15(4-5):365–385, 2000.
  16. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  17. gswin: Gated mlp vision model with hierarchical structure of shifted window. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  18. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  20. Exploring weight symmetry in deep neural networks. Computer Vision and Image Understanding, 187:102786, 2019.
  21. Provable benefit of orthogonal initialization in optimizing deep linear networks. In International Conference on Learning Representations, 2019.
  22. Time-frequency-autoregressive random processes: Modeling and fast parameter estimation. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., volume 6, pages VI–125. IEEE, 2003.
  23. Transnetr: Transformer-based residual network for polyp segmentation with multi-center out-of-distribution testing. In Medical Imaging with Deep Learning, 2023.
  24. Colorformer: Image colorization via color memory assisted hybrid-attention transformer. In European Conference on Computer Vision, pages 20–36. Springer, 2022.
  25. Learning multiple layers of features from tiny images. 2009.
  26. Discrete cosin transformer: Image modeling from frequency domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5468–5478, 2023.
  27. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
  28. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  30. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  31. James Martens et al. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.
  32. Training neural network with zero weight initialization. In Proceedings of the CUBE International Information Technology Conference, pages 235–239, 2012.
  33. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations, 2021.
  34. Block walsh–hadamard transform-based binary layers in deep neural networks. ACM Transactions on Embedded Computing Systems, 21(6):1–25, 2022.
  35. Multichannel orthogonal transform-based perceptron layers for efficient resnets. IEEE Transactions on Neural Networks and Learning Systems, 2024.
  36. A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer. In International Conference on Machine Learning, pages 26891–26903. PMLR, 2023.
  37. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  38. Low-complexity rounded klt approximation for image compression. Journal of Real-Time Image Processing, pages 1–11, 2022.
  39. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  40. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  41. Dct-former: Efficient self-attention with discrete cosine transform. Journal of Scientific Computing, 94(3):67, 2023.
  42. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  43. Optimization-inspired cross-attention transformer for compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6174–6184, 2023.
  44. Vladislav Sovrasov. ptflops: a flops counting tool for neural networks in pytorch framework, 2018–2024.
  45. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14454–14463, 2021.
  46. Compressive estimation and imaging based on autoregressive models. IEEE Transactions on Image Processing, 25(11):5077–5087, 2016.
  47. Transresu-net: A transformer based resu-net for real-time colon polyp segmentation. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1–4. IEEE, 2023.
  48. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Gregory K Wallace. The jpeg still picture compression standard. Communications of the ACM, 34(4):30–44, 1991.
  51. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
  52. A fast jpeg image compression algorithm based on dct. In 2020 IEEE International Conference on Smart Cloud (SmartCloud), pages 106–110. IEEE, 2020.
  53. Bert, mbert, or bibert? a study on contextualized embeddings for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6663–6675, 2021.
  54. Csformer: Bridging convolution and transformer for compressive sensing. IEEE Transactions on Image Processing, 2023.
  55. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022.
  56. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  57. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11304–11314, 2022.
  58. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  59. Dynamic linear transformer for 3d biomedical image segmentation. In International Workshop on Machine Learning in Medical Imaging, pages 171–180. Springer, 2022.
  60. Zero initialization: Initializing neural networks with only zeros and ones. Transactions on Machine Learning Research, 2022.
  61. Embedding human brain function via transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 366–375. Springer, 2022.
  62. Potter: Pooling attention transformer for efficient human mesh recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1620, 2023.
  63. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
  64. A novel asymmetrical autoencoder with a sparsifying discrete cosine stockwell transform layer for gearbox sensor data compression. Engineering Applications of Artificial Intelligence, 127:107322, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hongyi Pan (32 papers)
  2. Emadeldeen Hamdan (4 papers)
  3. Xin Zhu (38 papers)
  4. Koushik Biswas (31 papers)
  5. Ulas Bagci (154 papers)
  6. Ahmet Enis Cetin (33 papers)

Summary

We haven't generated a summary for this paper yet.