Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator (2403.02411v5)

Published 4 Mar 2024 in cs.CV and cs.LG

Abstract: The attention mechanism is the main component of the transformer architecture, and since its introduction, it has led to significant advancements in deep learning that span many domains and multiple tasks. The attention mechanism was utilized in computer vision as the Vision Transformer ViT, and its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While this mechanism is very expressive and capable, it comes with the drawback of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perciver-IO, and many more. This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens by replacing the normal attention layers with a Network in Network structure that enhances the static approach of the MLP-Mixer with a dynamic system of learning an element-wise gating function by a token mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever and Dario Amodei. “Language Models are Few-Shot Learners.” (2020).
  2. Radford, Alec and Karthik Narasimhan. “Improving Language Understanding by Generative Pre-Training.” (2018).
  3. Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave and Guillaume Lample. “LLaMA: Open and Efficient Foundation Language Models.” (2023).
  4. Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, Ruxandra-Aimée Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei and Julien Launay. “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.” (2023).
  5. Jiang, Albert Qiaochu, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix and William El Sayed. “Mistral 7B.” (2023).
  6. Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.”(2020).
  7. Tolstikhin, Ilya O., Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic and Alexey Dosovitskiy. “MLP-Mixer: An all-MLP Architecture for Vision.” Neural Information Processing Systems (2021).
  8. Trockman, Asher and J. Zico Kolter. “Patches Are All You Need?” Trans. Mach. Learn. Res. 2023 (2022).
  9. Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021).
  10. Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko. “End-to-End Object Detection with Transformers.” (2020).
  11. Jaegle, Andrew, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Andrew Brock, Evan Shelhamer, Olivier J. H’enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals and João Carreira. “Perceiver IO: A General Architecture for Structured Inputs & Outputs.” (2021).
  12. Lu, Jiasen, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi and Aniruddha Kembhavi. “Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks.” (2022).
  13. Zhang, Hao, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun-Juan Zhu, Lionel Ming-shuan Ni and Heung-yeung Shum. “DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.” (2022).
  14. Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár and Ross B. Girshick. “Segment Anything.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023).
  15. Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang and Hao Ma. “Linformer: Self-Attention with Linear Complexity.” (2020).
  16. Lee-Thorp, James, Joshua Ainslie, Ilya Eckstein and Santiago Ontañón. “FNet: Mixing Tokens with Fourier Transforms.” (2021).
  17. Li, Yawei, K. Zhang, Jie Cao, Radu Timofte and Luc Van Gool. “LocalViT: Bringing Locality to Vision Transformers.” (2021).
  18. Tu, Zhengzhong, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Conrad Bovik and Yinxiao Li. “MaxViT: Multi-Axis Vision Transformer.” European Conference on Computer Vision (2022).
  19. Xiong, Yunyang, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Moo Fung, Yin Li and Vikas Singh. “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention.” Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence 35 16 (2021).
  20. Keles, Feyza Duman, Pruthuvi Maheshakya Wijewardena and Chinmay Hegde. “On The Computational Complexity of Self-Attention.” International Conference on Algorithmic Learning Theory (2022).
  21. Lin, Tianyang, Yuxin Wang, Xiangyang Liu and Xipeng Qiu. “A Survey of Transformers.” AI Open 3 (2021).
  22. Tay, Yi, Mostafa Dehghani, Dara Bahri and Donald Metzler. “Efficient Transformers: A Survey.” ACM Computing Surveys 55 (2020).
  23. Fournier, Quentin, Gaétan Marceau Caron and Daniel Aloise. “A Practical Survey on Faster and Lighter Transformers.” ACM Computing Surveys 55 (2021).
  24. Khan, Salman Hameed, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan and Mubarak Shah. “Transformers in Vision: A Survey.” ACM Computing Surveys (CSUR) 54 (2021).
  25. Guo, Qipeng, Xipeng Qiu, Pengfei Liu, Yunfan Shao, X. Xue and Zheng Zhang. “Star-Transformer.” (2019).
  26. Beltagy, Iz, Matthew E. Peters and Arman Cohan. “Longformer: The Long-Document Transformer.” (2020).
  27. Kitaev, Nikita, Lukasz Kaiser and Anselm Levskaya. “Reformer: The Efficient Transformer.” (2020).
  28. Zaheer, Manzil, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang and Amr Ahmed. “Big Bird: Transformers for Longer Sequences.” (2020).
  29. Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas and Franccois Fleuret. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.” International Conference on Machine Learning (2020).
  30. Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell and Adrian Weller. “Rethinking Attention with Performers.” (2020).
  31. LI, SHIYANG, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang and Xifeng Yan. “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting.” (2019).
  32. Qiu, Jiezhong, Hao Ma, Omer Levy, Scott Yih, Sinong Wang and Jie Tang. “Blockwise Self-Attention for Long Document Understanding.” (2019).
  33. Tay, Yi, Dara Bahri, Liu Yang, Donald Metzler and Da-Cheng Juan. “Sparse Sinkhorn Attention.” International Conference on Machine Learning (2020).
  34. Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le and Ruslan Salakhutdinov. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.” (2019).
  35. Vyas, Apoorv, Angelos Katharopoulos and Franccois Fleuret. “Fast Transformers with Clustered Attention.” (2020).
  36. Zhang, Hang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan and Weizhu Chen. “Poolingformer: Long Document Modeling with Pooling Attention.” International Conference on Machine Learning (2021).
  37. Liu, Peter J., Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser and Noam M. Shazeer. “Generating Wikipedia by Summarizing Long Sequences.” (2018).
  38. Dai, Zihang, Guokun Lai, Yiming Yang and Quoc V. Le. “Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing.” (2020).
  39. Tay, Yi, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao and Che Zheng. “Synthesizer: Rethinking Self-Attention for Transformer Models.” International Conference on Machine Learning (2020).
  40. Dauphin, Yann, Angela Fan, Michael Auli and David Grangier. “Language Modeling with Gated Convolutional Networks.” International Conference on Machine Learning (2016).
  41. LeCun, Yann, Léon Bottou, Yoshua Bengio and Patrick Haffner. “Gradient-based learning applied to document recognition.” Proc. IEEE 86 (1998)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Abdullah Nazhat Abdullah (3 papers)
  2. Tarkan Aydin (3 papers)