Activator: GLU Activation Function as the Core Component of a Vision Transformer (2405.15953v2)
Abstract: Transformer architecture currently represents the main driver behind many successes in a variety of tasks addressed by deep learning, especially the recent advances in NLP culminating with LLMs (LLM). In addition, transformer architecture has found a wide spread of interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multi-task and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities both for training and inference. This paper investigates substituting the attention mechanism usually adopted for transformer architecture with an architecture incorporating gated linear unit (GLU) activation within a multi-layer perceptron (MLP) structure in conjunction with the default MLP incorporated in the traditional transformer design. Another step forward taken by this paper is to eliminate the second non-gated MLP to further reduce the computational cost. Experimental assessments conducted by this research show that both proposed modifications and reductions offer competitive performance in relation to baseline architectures, in support of the aims of this work in establishing a more efficient yet capable alternative to the traditional attention mechanism as the core component in designing transformer architectures.
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
- Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever and Dario Amodei. “Language Models are Few-Shot Learners.” (2020).
- Radford, Alec and Karthik Narasimhan. “Improving Language Understanding by Generative Pre-Training.” (2018).
- Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave and Guillaume Lample. “LLaMA: Open and Efficient Foundation Language Models.” (2023).
- Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, Ruxandra-Aimée Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei and Julien Launay. “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.” (2023).
- Jiang, Albert Qiaochu, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix and William El Sayed. “Mistral 7B.” (2023).
- Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.”(2020).
- Tolstikhin, Ilya O., Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic and Alexey Dosovitskiy. “MLP-Mixer: An all-MLP Architecture for Vision.” Neural Information Processing Systems (2021).
- Trockman, Asher and J. Zico Kolter. “Patches Are All You Need?” Trans. Mach. Learn. Res. 2023 (2022).
- Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021).
- Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko. “End-to-End Object Detection with Transformers.” (2020).
- Jaegle, Andrew, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Andrew Brock, Evan Shelhamer, Olivier J. H’enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals and João Carreira. “Perceiver IO: A General Architecture for Structured Inputs & Outputs.” (2021).
- Lu, Jiasen, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi and Aniruddha Kembhavi. “Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks.” (2022).
- Zhang, Hao, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun-Juan Zhu, Lionel Ming-shuan Ni and Heung-yeung Shum. “DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.” (2022).
- Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár and Ross B. Girshick. “Segment Anything.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023).
- Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang and Hao Ma. “Linformer: Self-Attention with Linear Complexity.” (2020).
- Lee-Thorp, James, Joshua Ainslie, Ilya Eckstein and Santiago Ontañón. “FNet: Mixing Tokens with Fourier Transforms.” (2021).
- Li, Yawei, K. Zhang, Jie Cao, Radu Timofte and Luc Van Gool. “LocalViT: Bringing Locality to Vision Transformers.” (2021).
- Tu, Zhengzhong, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Conrad Bovik and Yinxiao Li. “MaxViT: Multi-Axis Vision Transformer.” European Conference on Computer Vision (2022).
- Xiong, Yunyang, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Moo Fung, Yin Li and Vikas Singh. “Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention.” Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence 35 16 (2021).
- Dauphin, Yann, Angela Fan, Michael Auli and David Grangier. “Language Modeling with Gated Convolutional Networks.” International Conference on Machine Learning (2016).
- Lin, Tianyang, Yuxin Wang, Xiangyang Liu and Xipeng Qiu. “A Survey of Transformers.” AI Open 3 (2021).
- Tay, Yi, Mostafa Dehghani, Dara Bahri and Donald Metzler. “Efficient Transformers: A Survey.” ACM Computing Surveys 55 (2020).
- Fournier, Quentin, Gaétan Marceau Caron and Daniel Aloise. “A Practical Survey on Faster and Lighter Transformers.” ACM Computing Surveys 55 (2021).
- Khan, Salman Hameed, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan and Mubarak Shah. “Transformers in Vision: A Survey.” ACM Computing Surveys (CSUR) 54 (2021).
- Beltagy, Iz, Matthew E. Peters and Arman Cohan. “Longformer: The Long-Document Transformer.” (2020).
- Guo, Qipeng, Xipeng Qiu, Pengfei Liu, Yunfan Shao, X. Xue and Zheng Zhang. “Star-Transformer.” (2019).
- Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas and Franccois Fleuret. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.” International Conference on Machine Learning (2020).
- Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell and Adrian Weller. “Rethinking Attention with Performers.” (2020).
- LI, SHIYANG, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang and Xifeng Yan. “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting.” (2019).
- Qiu, Jiezhong, Hao Ma, Omer Levy, Scott Yih, Sinong Wang and Jie Tang. “Blockwise Self-Attention for Long Document Understanding.” (2019).
- Tay, Yi, Dara Bahri, Liu Yang, Donald Metzler and Da-Cheng Juan. “Sparse Sinkhorn Attention.” International Conference on Machine Learning (2020).
- Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le and Ruslan Salakhutdinov. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.” (2019).
- Kitaev, Nikita, Lukasz Kaiser and Anselm Levskaya. “Reformer: The Efficient Transformer.” (2020).
- Zaheer, Manzil, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang and Amr Ahmed. “Big Bird: Transformers for Longer Sequences.” (2020).
- Vyas, Apoorv, Angelos Katharopoulos and Franccois Fleuret. “Fast Transformers with Clustered Attention.” (2020).
- Zhang, Hang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan and Weizhu Chen. “Poolingformer: Long Document Modeling with Pooling Attention.” International Conference on Machine Learning (2021).
- Liu, Peter J., Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser and Noam M. Shazeer. “Generating Wikipedia by Summarizing Long Sequences.” (2018).
- Dai, Zihang, Guokun Lai, Yiming Yang and Quoc V. Le. “Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing.” (2020).
- Tay, Yi, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao and Che Zheng. “Synthesizer: Rethinking Self-Attention for Transformer Models.” International Conference on Machine Learning (2020).
- De, Soham, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando de Freitas and Caglar Gulcehre. “Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models.” (2024).
- Ramesh, Mahesh and Aswinkumar Ramkumar. “MABViT - Modified Attention Block Enhances Vision Transformers.” (2023).
- Shazeer, Noam M. ,“GLU Variants Improve Transformer.” (2020).
- Abdullah Nazhat Abdullah (3 papers)
- Tarkan Aydin (3 papers)