Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dimension Mixer: Group Mixing of Input Dimensions for Efficient Function Approximation (2311.18735v3)

Published 30 Nov 2023 in cs.LG

Abstract: The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be interpreted through the lens of a general concept of dimension mixing. Research on coupling flows and the butterfly transform shows that partial and hierarchical signal mixing schemes are sufficient for efficient and expressive function approximation. In this work, we study group-wise sparse, non-linear, multi-layered and learnable mixing schemes of inputs and find that they are complementary to many standard neural architectures. Following our observations and drawing inspiration from the Fast Fourier Transform, we generalize Butterfly Structure to use non-linear mixer function allowing for MLP as mixing function called Butterfly MLP. We were also able to sparsely mix along sequence dimension for Transformer-based architectures called Butterfly Attention. Experiments on CIFAR and LRA datasets demonstrate that the proposed Non-Linear Butterfly Mixers are efficient and scale well when the host architectures are used as mixing function. Additionally, we propose Patch-Only MLP-Mixer for processing spatial 2D signals demonstrating a different dimension mixing strategy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL: http://yann. lecun. com/exdb/lenet, vol. 20, no. 5, p. 14, 2015.
  2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
  3. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  4. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  5. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  6. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  7. I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 261–24 272, 2021.
  8. L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014.
  9. A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible residual network: Backpropagation without storing activations,” Advances in neural information processing systems, vol. 30, 2017.
  10. X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856.
  11. N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
  12. Y. Idelbayev and M. A. Carreira-Perpinán, “Low-rank compression of neural nets: Learning the rank of each layer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8049–8059.
  13. J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,” Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965.
  14. A. Prabhu, A. Farhadi, M. Rastegari et al., “Butterfly transform: An efficient fft based neural architecture design,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 024–12 033.
  15. B. Chen, T. Dao, K. Liang, J. Yang, Z. Song, A. Rudra, and C. Re, “Pixelated butterfly: Simple and efficient sparse training for neural network models,” arXiv preprint arXiv:2112.00029, 2021.
  16. T. Dao, B. Chen, N. S. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, and C. Ré, “Monarch: Expressive structured matrices for efficient and accurate training,” in International Conference on Machine Learning.   PMLR, 2022, pp. 4690–4721.
  17. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  18. C.-F. Chen, R. Panda, and Q. Fan, “Regionvit: Regional-to-local attention for vision transformers,” arXiv preprint arXiv:2106.02689, 2021.
  19. Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” ACM Computing Surveys (CSUR), 2020.
  20. Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” arXiv preprint arXiv:2204.01697, 2022.
  21. A. Hassani, S. Walton, J. Li, S. Li, and H. Shi, “Neighborhood attention transformer,” arXiv preprint arXiv:2204.07143, 2022.
  22. B. Chen, T. Dao, E. Winsor, Z. Song, A. Rudra, and C. Ré, “Scatterbrain: Unifying sparse and low-rank attention,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 413–17 426, 2021.
  23. B. Chen, R. Wang, D. Ming, and X. Feng, “Vit-p: Rethinking data-efficient vision transformers from locality,” arXiv preprint arXiv:2203.02358, 2022.
  24. I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
  25. R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
  26. G. M. Correia, V. Niculae, and A. F. Martins, “Adaptively sparse transformers,” arXiv preprint arXiv:1909.00015, 2019.
  27. Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh, “Nyströmformer: A nyström-based algorithm for approximating self-attention,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 14 138–14 148.
  28. N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” arXiv preprint arXiv:2001.04451, 2020.
  29. S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
  30. K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020.
  31. Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, “Synthesizer: Rethinking self-attention for transformer models,” in International conference on machine learning.   PMLR, 2021, pp. 10 183–10 192.
  32. S. Wang, L. Zhou, Z. Gan, Y.-C. Chen, Y. Fang, S. Sun, Y. Cheng, and J. Liu, “Cluster-former: Clustering-based sparse transformer for long-range dependency encoding,” arXiv preprint arXiv:2009.06097, 2020.
  33. W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” 2021.
  34. R. Liu, Y. Li, L. Tao, D. Liang, and H.-T. Zheng, “Are we ready for a new paradigm shift? a survey on visual deep mlp,” Patterns, vol. 3, no. 7, p. 100520, 2022.
  35. Y. Tang, K. Han, J. Guo, C. Xu, Y. Li, C. Xu, and Y. Wang, “An image patch is a wave: Phase-aware vision mlp,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 935–10 944.
  36. S. Chen, E. Xie, C. Ge, D. Liang, and P. Luo, “Cyclemlp: A mlp-like architecture for dense prediction,” arXiv preprint arXiv:2107.10224, 2021.
  37. Z. Wang, W. Jiang, Y. M. Zhu, L. Yuan, Y. Song, and W. Liu, “Dynamixer: a vision mlp architecture with dynamic mixing,” in International Conference on Machine Learning.   PMLR, 2022, pp. 22 691–22 701.
  38. J. Guo, Y. Tang, K. Han, X. Chen, H. Wu, C. Xu, C. Xu, and Y. Wang, “Hire-mlp: Vision mlp via hierarchical rearrangement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 826–836.
  39. H. Zheng, P. He, W. Chen, and M. Zhou, “Mixing and shifting: Exploiting global and local dependencies in vision mlps,” arXiv preprint arXiv:2202.06510, 2022.
  40. T. Yu, X. Li, Y. Cai, M. Sun, and P. Li, “S2-mlpv2: Improved spatial-shift mlp architecture for vision,” arXiv preprint arXiv:2108.01072, 2021.
  41. Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision permutator: A permutable mlp-like architecture for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  42. H. Liu, Z. Dai, D. So, and Q. V. Le, “Pay attention to mlps,” Advances in Neural Information Processing Systems, vol. 34, pp. 9204–9215, 2021.
  43. D. Lian, Z. Yu, X. Sun, and S. Gao, “As-mlp: An axial shifted mlp architecture for vision,” arXiv preprint arXiv:2107.08391, 2021.
  44. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  45. D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” Advances in neural information processing systems, vol. 31, 2018.
  46. J.-H. Jacobsen, A. Smeulders, and E. Oyallon, “i-revnet: Deep invertible networks,” arXiv preprint arXiv:1802.07088, 2018.
  47. J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++: Improving flow-based generative models with variational dequantization and architecture design,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2722–2730.
  48. T. Teshima, I. Ishikawa, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama, “Coupling-based invertible neural networks are universal diffeomorphism approximators,” Advances in Neural Information Processing Systems, vol. 33, pp. 3362–3373, 2020.
  49. A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R Ruiz, J. Schrittwieser, G. Swirszcz et al., “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature, vol. 610, no. 7930, pp. 47–53, 2022.
  50. V. Strassen et al., “Gaussian elimination is not optimal,” Numerische mathematik, vol. 13, no. 4, pp. 354–356, 1969.
  51. C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4353–4361.
  52. S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, M. Pechenizkiy, D. Mocanu, and Z. Wang, “More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity,” arXiv preprint arXiv:2207.03620, 2022.
  53. M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 096–10 106.
  54. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  55. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems, vol. 28, 2015.
  56. T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2020, pp. 1–14.
  57. T. Dao, A. Gu, M. Eichhorn, A. Rudra, and C. Ré, “Learning fast algorithms for linear transforms using butterfly factorizations,” in International conference on machine learning.   PMLR, 2019, pp. 1517–1527.
  58. H. Fan, T. Chau, S. I. Venieris, R. Lee, A. Kouris, W. Luk, N. D. Lane, and M. S. Abdelfattah, “Adaptable butterfly accelerator for attention-based nns via hardware and algorithm co-design,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2022, pp. 599–615.
  59. Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long range arena: A benchmark for efficient transformers,” arXiv preprint arXiv:2011.04006, 2020.
  60. N. Nangia and S. R. Bowman, “Listops: A diagnostic dataset for latent tree learning,” arXiv preprint arXiv:1804.06028, 2018.
  61. A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
  62. D. R. Radev, P. Muthukrishnan, V. Qazvinian, and A. Abu-Jbara, “The acl anthology network corpus,” Language Resources and Evaluation, vol. 47, pp. 919–944, 2013.
  63. A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  64. D. Linsley, J. Kim, V. Veerabadran, C. Windolf, and T. Serre, “Learning long-range spatial dependencies with horizontal gated recurrent units,” Advances in neural information processing systems, vol. 31, 2018.
  65. M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
  66. M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv preprint arXiv:2104.13478, 2021.
  67. T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
  68. R. Lin, J. Ran, K. H. Chiu, G. Chesi, and N. Wong, “Deformable butterfly: A highly structured and sparse linear transform,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 145–16 157, 2021.
  69. T. Hayase and R. Karakida, “Mlp-mixer as a wide and sparse mlp,” arXiv preprint arXiv:2306.01470, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Suman Sapkota (8 papers)
  2. Binod Bhattarai (60 papers)