Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 TPS
Gemini 2.5 Pro 37 TPS Pro
GPT-5 Medium 38 TPS
GPT-5 High 27 TPS Pro
GPT-4o 90 TPS
GPT OSS 120B 467 TPS Pro
Kimi K2 139 TPS Pro
2000 character limit reached

Wavelet-Based Image Tokenizer for Vision Transformers (2405.18616v1)

Published 28 May 2024 in cs.CV

Abstract: Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on improving the image tokenizer itself has been reported in the literature. In this paper, we propose a new image tokenizer based on wavelet transformation. We show that ViT models with the new tokenizer achieve both higher training throughput and better top-1 precision for the ImageNet validation set. We present a theoretical analysis on why the proposed tokenizer improves the training throughput without any change to ViT model architecture. Our analysis suggests that the new tokenizer can effectively handle high-resolution images and is naturally resistant to adversarial attack. Furthermore, the proposed image tokenizer offers a fresh perspective on important new research directions for ViT-based model design, such as image tokens on a non-uniform grid for image understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures. Wiley-Interscience, 2004.
  2. Etc: Encoding long and structured inputs in transformers. In EMNLP, 2020.
  3. An iterative wavelet threshold for signal denoising. ArXiv, abs/2307.10509, 2019.
  4. Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020.
  5. Flexivit: One model for all patch sizes. In CVPR, 2023.
  6. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  7. Generative pretraining from pixels. Proceedings of the 37th International Conference on Machine Learning, PMLR 119, 2020.
  8. PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  9. Masked language modeling for proteins via linearly scalable long-context transformers. ArXiv, abs/2006.03555, 2020.
  10. PaLM: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
  11. The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
  12. Randaugment: Practical automated data augmentation with a reduced search space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3008–3017, 2019.
  13. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. ArXiv, abs/2307.08691, 2023.
  14. Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135, 2022.
  15. Ingrid Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, PHILADELPHIA, PENNSYLVANIA, 1992.
  16. Patch n’ Pack: NaViT, a vision transformer for any aspect ratio and resolution. ArXiv, abs/2307.06304, 2023.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  18. Taming transformers for high-resolution image synthesis. CVPR, pages 12868–12878, 2021.
  19. N. P. Jouppi et al. Ten lessons from three generations shaped google’s TPUv4i : Industrial product. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture, ISCA ’21, Valencia, Spain, 2021.
  20. The pascal visual object classes challenge. International Journal of Computer Vision, 2010.
  21. Are we ready for autonomous driving? the kitti vision benchmark suite. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  22. Transformer feed-forward layers are key-value memories. ArXiv, abs/2012.14913, 2020.
  23. Digital Image Processing 4th edition. Pearson India, 2019.
  24. Explaining and Harnessing Adversarial Examples. ArXiv e-prints, 2014.
  25. Geoffrey E. Hinton. How to represent part-whole hierarchies in a neural network. ArXiv, abs/2102.12627, 2021.
  26. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
  28. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA ’23, Orlando, FL, USA, 2023.
  29. Reformer: The efficient transformer. ArXiv, abs/2001.04451, 2020.
  30. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Conference on Empirical Methods in Natural Language Processing, 2018.
  31. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. ArXiv, abs/2210.03347, 2022.
  32. Swin Transformer V2: Scaling Up Capacity and Resolution. CVPR, 2022.
  33. Gabor convolutional networks. IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1254–1262, 2018.
  34. Infographicvqa. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2582–2591, 2021.
  35. Sandip Mazumder. Numerical Methods for Partial Differential Equations: Finite Difference and Finite Volume Methods 1st Edition. Academic Press, 2015.
  36. Generating images with sparse representations. ArXiv, abs/2103.03841, 2021.
  37. Image transformer. ArXiv, abs/1802.05751, 2018.
  38. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019.
  40. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
  41. Vision transformers with mixed-resolution tokenization. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4613–4622, 2023.
  42. Efficient content-based sparse attention with routing transformers. ArXiv, abs/2003.05997, 2020.
  43. Imagenet large scale visual recognition challenge. IJCV, pages 1–42, 2014.
  44. Adversarial attacks on image classification models: Fgsm and patch attacks and their impact. ArXiv, abs/2307.02055, 2023.
  45. G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods (Oxford Applied Mathematics and Computing Science Series) 3rd Edition. Clarendon Press, 1986.
  46. Fast WordPiece tokenization. In Conference on Empirical Methods in Natural Language Processing, 2020.
  47. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  48. Wavelets and Filter Banks. Wellesley-Cambridge Press, 1996.
  49. Efficient transformers: A survey. ArXiv, abs/2009.06732, 2020.
  50. UL2: Unifying language learning paradigms. In ICLR, 2022.
  51. Charformer: Fast character transformers via gradient-based subword tokenization. ArXiv, abs/2106.12672, 2021.
  52. Multi-stage image denoising with the wavelet transform. Pattern Recognit., 134:109050, 2022.
  53. DeiT III: Revenge of the ViT. In ECCV, 2022.
  54. Neural discrete representation learning. ArXiv, abs/1711.00937, 2017.
  55. Scaling local self-attention for parameter efficient visual backbones. CVPR, pages 12889–12899, 2021.
  56. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  57. Fu Wang and Tariq Alkhalifah. Learnable gabor kernels in convolutional neural networks for seismic interpretation tasks. ArXiv, abs/2308.05202, 2023.
  58. Linformer: Self-attention with linear complexity. ArXiv, abs/2006.04768, 2020.
  59. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2021.
  60. Wave-vit: Unifying wavelet and transformers for visual representation learning. In ECCV, 2022.
  61. A-vit: Adaptive tokens for efficient vision transformer. CVPR, pages 10799–10808, 2022.
  62. Vector-quantized image modeling with improved vqgan. ICLR, 2022.
  63. Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022, 2022.
  64. Big bird: Transformers for longer sequences. ArXiv, abs/2007.14062, 2020.
  65. Scaling vision transformers. CVPR, pages 1204–1213, 2022.
  66. mixup: Beyond empirical risk minimization. ArXiv, abs/1710.09412, 2017.
  67. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302 – 321, 2016.
  68. H-transformer-1d: Fast one-dimensional hierarchical attention for sequences. In Annual Meeting of the Association for Computational Linguistics, 2021.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.