Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Dimensional Hyena for Spatial Inductive Bias (2309.13600v1)

Published 24 Sep 2023 in cs.CV and cs.LG

Abstract: In recent years, Vision Transformers have attracted increasing interest from computer vision researchers. However, the advantage of these transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. Our empirical findings indicate that the proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, 6836–6846.
  2. 2-D SSM: A General Spatial Layer for Visual Transformers. arXiv preprint arXiv:2306.06635.
  3. Tensor decompositions: computations, applications, and challenges. Tensors for Data Processing, 1–30.
  4. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
  5. Conditional detr v2: Efficient detection transformer with box queries. arXiv preprint arXiv:2207.08914.
  6. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, 698–728. PMLR.
  7. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems, 34: 3965–3977.
  8. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052.
  9. Decision S4: Efficient Sequence-Based RL via State Spaces Layers. In The Eleventh International Conference on Learning Representations.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  12. Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, 2286–2296. PMLR.
  13. Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646.
  14. Fukushima, K. 1980. A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol, Cybern, 36: 193–202.
  15. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
  16. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34: 572–585.
  17. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12175–12185.
  18. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35: 22982–22994.
  19. Simplifying and Understanding State Space Models with Diagonal Linear RNNs. arXiv preprint arXiv:2212.00768.
  20. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  21. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4): 541–551.
  22. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): 2278–2324.
  23. Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492.
  24. What Makes Convolutional Models Great on Long Sequence Modeling? arXiv preprint arXiv:2210.09298.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
  26. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11976–11986.
  27. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3202–3211.
  28. Structured state space models for in-context reinforcement learning. arXiv preprint arXiv:2303.03982.
  29. Focus Your Attention (with Adaptive IIR Filters). arXiv preprint arXiv:2305.14952.
  30. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655.
  31. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947.
  32. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35: 2846–2861.
  33. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794.
  34. RWKV: Reinventing RNNs for the Transformer Era. arXiv preprint arXiv:2305.13048.
  35. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
  36. Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
  38. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. IEEE.
  39. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347–10357. PMLR.
  40. Attention is all you need. Advances in neural information processing systems, 30.
  41. Pretraining without attention. arXiv preprint arXiv:2212.10544.
  42. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 12077–12090.
  43. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in neural information processing systems, 34: 28522–28535.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Itamar Zimerman (17 papers)
  2. Lior Wolf (217 papers)
Citations (3)