Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 82 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RGB no more: Minimally-decoded JPEG Vision Transformers (2211.16421v2)

Published 29 Nov 2022 in cs.CV and eess.IV

Abstract: Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (123)
  1. Faster Neural Networks Straight from JPEG. Advances in Neural Information Processing Systems, 31, 2018.
  2. Learning in the frequency domain. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  3. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  4. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  5. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR), oct 2020.
  6. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  7. Training data-efficient image transformers & distillation through attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 18–24 Jul 2021.
  8. How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
  9. Effective vision transformer training: A data-centric perspective, 2022.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18613–18624. Curran Associates, Inc., 2020.
  11. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  12. Imagenet large scale visual recognition challenge. IJCV, 2015.
  13. Faster and Accurate Classification for JPEG2000 Compressed Images in Networked Applications. ArXiv, sep 2019.
  14. FD-CNN: A Frequency-Domain FPGA Acceleration Scheme for CNN-based Image Processing Applications. ACM Transactions on Embedded Computing Systems (TECS), dec 2021.
  15. DCT-CompCNN: A novel image classification network using JPEG compressed DCT coefficients. 2019 IEEE Conference on Information and Communication Technology, CICT 2019, dec 2019.
  16. On using CNN with DCT based Image Data. Proceedings of the 19th Irish Machine Vision and Image Processing conference, pages 44–51, 2017.
  17. How Far Can We Get with Neural Networks Straight from JPEG? ArXiv, dec 2020.
  18. Samuel Felipe dos Santos and Jurandy Almeida. Less is more: Accelerating faster neural networks straight from jpeg. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 237–247, Cham, 2021. Springer International Publishing.
  19. Dct-domain deep convolutional neural networks for multiple jpeg compression classification. Signal Processing: Image Communication, 67:22–33, 2018.
  20. Deep residual learning in the jpeg transform domain. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3484–3493, 2019.
  21. An intriguing struggle of cnns in jpeg steganalysis and the onehot solution. IEEE Signal Processing Letters, 27:830–834, 2020.
  22. Fast and Accurate Lane Detection via Frequency Domain Learning. MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia, pages 890–898, oct 2021.
  23. The DCT-CNN-ResNet50 architecture to classify brain tumors with super-resolution, convolutional neural network, and the ResNet50. Neuroscience Informatics, 1(4):100013, dec 2021.
  24. Small-Scale Deep Network for DCT-Based Images Classification. ICRAIE 2019 - 4th International Conference and Workshops on Recent Advances and Innovations in Engineering: Thriving Technologies, nov 2019.
  25. High speed deep networks based on Discrete Cosine Transformation. 2014 IEEE International Conference on Image Processing, ICIP 2014, pages 5921–5925, jan 2014.
  26. Using compression to speed up image classification in artificial neural networks. Technical report, 2016.
  27. Fast object detection in compressed jpeg images. In 2019 ieee intelligent transportation systems conference (itsc), pages 333–338. IEEE, 2019.
  28. Object detection in the DCT domain: Is luminance the solution? Proceedings - International Conference on Pattern Recognition, pages 2627–2634, 2020.
  29. Rethinking fun: Frequency-domain utilization networks. ArXiv, abs/2012.03357, 2020.
  30. Fast object detection in hevc intra compressed domain. In 2021 29th European Signal Processing Conference (EUSIPCO), pages 756–760. IEEE, 2021.
  31. Samuel Felipe dos Santos and Jurandy Almeida. Faster and accurate compressed video action recognition straight from the frequency domain. In 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 62–68. IEEE, 2020.
  32. Compressed video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  33. Compressed video contrastive learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 14176–14187. Curran Associates, Inc., 2021.
  34. Compressed vision for efficient video understanding. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 4581–4597, December 2022.
  35. Deep learning in latent space for video prediction and compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 701–710, June 2021.
  36. Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  37. Deep residual learning for image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, page 0, 2019.
  38. Autoencoder based image compression: Can the learning be quantization independent? In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1188–1192, 2018.
  39. Learning end-to-end lossy image compression: A benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4194–4211, 2022.
  40. Learning for video compression. IEEE Transactions on Circuits and Systems for Video Technology, 30(2):566–576, 2020.
  41. End-to-end learning for video frame compression with self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
  42. Swift: Adaptive video streaming with layered neural codecs. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 103–118, Renton, WA, April 2022. USENIX Association.
  43. Learning for video compression with recurrent auto-encoder and recurrent probability model. IEEE Journal of Selected Topics in Signal Processing, 15(2):388–401, 2021.
  44. Learned video compression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  45. Video compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  46. G.K. Wallace. The jpeg still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992.
  47. International Telecommunication Union. T.871: Information technology – digital compression and coding of continuous-tone still images: Jpeg file interchange format (jfif). Telecommunication Standardization Sector of ITU, 2011.
  48. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, 2012.
  49. Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc). IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1669–1684, 2012.
  50. Better plain vit baselines for imagenet-1k. Technical report, Google Research, 2022.
  51. Kanjar De and V. Masilamani. Image Sharpness Measure for Blurred Images in Frequency Domain. Procedia Engineering, 64:149–158, jan 2013.
  52. Efficient method of detecting globally blurry or sharp images. In 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, pages 171–174, 2008.
  53. Blur determination in the compressed domain using dct information. In Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348), volume 2, pages 386–390 vol.2, 1999.
  54. Perceptual blur detection and assessment in the dct domain. In 2015 4th International Conference on Electrical Engineering (ICEE), pages 1–4, 2015.
  55. Image sharpening in the jpeg domain. IEEE Transactions on Image Processing, 8(6):874–878, 1999.
  56. Text and image sharpening of scanned images in the jpeg domain. In Proceedings of International Conference on Image Processing, volume 2, pages 326–329 vol.2, 1997.
  57. An efficient usm sharpening detection method for small-size jpeg image. Journal of Information Security and Applications, 51:102451, 2020.
  58. A new approach for jpeg resize and image splicing detection. In Proceedings of the First ACM workshop on Multimedia in forensics, pages 43–48, 2009.
  59. Image coding using adaptive resizing in the block-DCT domain. In Reiner Creutzburg, Jarmo H. Takala, and Chang Wen Chen, editors, Multimedia on Mobile Devices II, volume 6074, page 607405. International Society for Optics and Photonics, SPIE, 2006.
  60. Fast arbitrary resizing of images in dct domain. In 2007 IEEE International Conference on Multimedia and Expo, pages 1671–1674, 2007.
  61. L/m-fold image resizing in block-dct domain using symmetric convolution. IEEE Transactions on Image Processing, 12(9):1016–1034, 2003.
  62. A complexity scalable universal dct domain image resizing algorithm. IEEE Transactions on Circuits and Systems for Video Technology, 17(4):495–499, 2007.
  63. Subband dct: definition, analysis, and applications. IEEE Transactions on Circuits and Systems for Video Technology, 6(3):273–286, 1996.
  64. The spatial relationship of dct coefficients between a block and its sub-blocks. IEEE Transactions on Signal Processing, 50(5):1160–1169, 2002.
  65. J. Mukherjee and S.K. Mitra. Arbitrary resizing of images in dct space. IEE Proceedings - Vision, Image and Signal Processing, 152:155–164(9), April 2005.
  66. Hidden: Hiding data with deep networks. In Proceedings of the European conference on computer vision (ECCV), pages 657–672, 2018.
  67. A dct-domain system for robust image watermarking. Signal Processing, 66(3):357–372, 1998.
  68. A dwt-dct based blind watermarking algorithm for copyright protection. In 2010 3rd International Conference on Computer Science and Information Technology, volume 7, pages 455–458, 2010.
  69. Multi watermarking algorithm based on dct and hash functions for color satellite images. In 2013 9th International Conference on Innovations in Information Technology (IIT), pages 30–35, 2013.
  70. Digital watermarking-based dct and jpeg model. IEEE Transactions on Instrumentation and Measurement, 52(5):1640–1647, 2003.
  71. Detection of operation chain: Jpeg-resampling-jpeg. Signal Processing: Image Communication, 57:8–20, 2017.
  72. Dct domain watermarking scheme using chinese remainder theorem for image authentication. In 2010 IEEE International Conference on Multimedia and Expo, pages 111–116, 2010.
  73. Dct based segmentation applied to a scalable zenithal people counter. In Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), volume 3, pages III–1005, 2003.
  74. Exploring semantic segmentation on the dct representation. In Proceedings of the ACM Multimedia Asia, pages 1–6. Association for Computing Machinery, 2019.
  75. Deep learning based image segmentation directly in the jpeg compressed domain. In 2021 IEEE 8th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), pages 1–6. IEEE, 2021.
  76. Dct-mask: Discrete cosine transform mask representation for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8720–8729, June 2021.
  77. Semantic segmentation of images exploiting dct based features and random forest. Pattern Recognition, 52:260–273, 2016.
  78. A novel robust digital image watermarking algorithm based on two-level dct. In 2014 International Conference on Information Science, Electronics and Electrical Engineering, volume 3, pages 1804–1809, 2014.
  79. Algorithms for manipulating compressed images. IEEE Computer Graphics and Applications, 13(5):34–42, 1993.
  80. Improved dct coefficient analysis for forgery localization in jpeg images. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2444–2447, 2011.
  81. Detecting doctored jpeg images via dct coefficient analysis. In Aleš Leonardis, Horst Bischof, and Axel Pinz, editors, Computer Vision – ECCV 2006, pages 423–435, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
  82. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations, 2022.
  83. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications, 2022.
  84. Mocovit: Mobile convolutional vision transformer, 2022.
  85. Lightweight vision transformer with cross feature attention, 2022.
  86. Dual-former: Hybrid self-attention transformer for efficient image restoration, 2022.
  87. Parc-net: Position aware circular convolution with merits from convnets and transformer, 2022.
  88. Using cnn to improve the performance of the light-weight vit. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2022.
  89. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 294–311, Cham, 2022. Springer Nature Switzerland.
  90. Lightvit: Towards light-weight convolution-free vision transformers, 2022.
  91. Towards efficient vision transformer inference: A first study of transformers on mobile devices. In Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications, HotMobile ’22, page 1–7, New York, NY, USA, 2022. Association for Computing Machinery.
  92. Swin transformer with local aggregation. In 2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS), pages 77–81, 2022.
  93. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition, 2022.
  94. Hydra attention: Efficient attention with many heads, 2022.
  95. Lightweight and optimization acceleration methods for vision transformer: A review. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pages 2154–2160, 2022.
  96. Efficientformer: Vision transformers at mobilenet speed, 2022.
  97. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design, 2022.
  98. Srdd: a lightweight end-to-end object detection with transformer. Connection Science, 34(1):2448–2465, 2022.
  99. Faster attention is what you need: A fast self-attention neural network backbone architecture for the edge via double-condensing attention condensers, 2022.
  100. Towards light weight object detection system, 2022.
  101. JPEG-1 standard 25 years: past, present, and future reasons for a success. Journal of Electronic Imaging, 27(4):040901, 2018.
  102. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  103. Improving deep learning with generic data augmentation. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1542–1547. IEEE, 2018.
  104. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
  105. Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pages 117–122. IEEE, 2018.
  106. The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Networks Vis. Recognit, 11:1–8, 2017.
  107. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 113–123, 2019.
  108. Swin transformer v2: Scaling up capacity and resolution. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  109. TorchVision maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016.
  110. Mixed precision training. In International Conference on Learning Representations, 2018.
  111. Affine theorem for two-dimensional fourier transform. Electronics Letters, 29(3):304–304, 1993.
  112. R Bernardini. Image distortions inherited by the fourier transform. Electronics Letters, 36(17):1, 2000.
  113. Rotation, scale and translation invariant digital image watermarking. In Proceedings of International Conference on Image Processing, volume 1, pages 536–539. IEEE, 1997.
  114. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  115. Independent JPEG Group. Libjpeg. https://www.ijg.org/, 2022. [accessed: Nov 1, 2022].
  116. Quantization guided jpeg artifact correction. Proceedings of the European Conference on Computer Vision, 2020.
  117. Meta AI Computer vision team, FAIR. fvcore. https://github.com/facebookresearch/fvcore, 2022.
  118. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  119. ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
  120. Revisiting resNets: Improved training and scaling strategies. NeurIPS, 2021.
  121. Gilbert Strang. The discrete cosine transform. SIAM Review, 41(1):135–147, 1999.
  122. Pillow-SIMD maintainers and contributors. Pillow-simd. https://github.com/uploadcare/pillow-simd, 2015.
  123. Pillow. https://github.com/python-pillow/Pillow, 2010.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.