Conditional computation in neural networks: principles and research trends (2403.07965v2)
Abstract: This article summarizes principles and ideas from the emerging area of applying \textit{conditional computation} methods to the design of neural networks. In particular, we focus on neural networks that can dynamically activate or de-activate parts of their computational graph conditionally on their input. Examples include the dynamic selection of, e.g., input tokens, layers (or sets of layers), and sub-modules inside each layer (e.g., channels in a convolutional filter). We first provide a general formalism to describe these techniques in an uniform way. Then, we introduce three notable implementations of these principles: mixture-of-experts (MoEs) networks, token selection mechanisms, and early-exit neural networks. The paper aims to provide a tutorial-like introduction to this growing field. To this end, we analyze the benefits of these modular designs in terms of efficiency, explainability, and transfer learning, with a focus on emerging applicative areas ranging from automated scientific discovery to semantic communication.
- CoLT5: Faster long-range transformers with conditional computation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5085–5100, Singapore, Dec. 2023. Association for Computational Linguistics.
- Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 39–48, 2016.
- Composable sparse fine-tuning for cross-lingual transfer. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1778–1796, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Improving the accuracy of early exits in multi-exit architectures via curriculum learning. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
- Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
- A scalable model specialization framework for training and inference using submodels and its application to speech model personalization. In Interspeech 2022. ISCA, 2022.
- Language models can explain neurons in language models. 2023. OpenAI Blog, 14, 2023.
- Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 527–536. JMLR.org, 2017.
- Token merging: Your ViT but faster. In Proceedings of the 2023 International Conference on Learning Representations (ICLR), 2023.
- D. Bolya and J. Hoffman. Token merging for fast stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4598–4602, 2023.
- Concept-level debugging of part-prototype networks. arXiv preprint arXiv:2205.15769, 2022.
- Chasing sparsity in vision transformers: An end-to-end exploration. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19974–19988, 2021.
- Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2061–2070, June 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Unified scaling laws for routed language models. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 4057–4086. PMLR, 17–23 Jul 2022.
- Adaptively sparse transformers. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2174–2184. Association for Computational Linguistics, Nov. 2019.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Vision transformers need registers. In Proceedings of the 2024 International Conference on Learning Representations (ICLR), 2024.
- T. Dettmers and L. Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019.
- A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- A practical survey on faster and lighter transformers. ACM Computing Surveys, 55(14s):1–40, 2023.
- Frameexit: Conditional early exiting for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15608–15618, 2021.
- Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
- Learning to weight samples for dynamic early-exiting networks. In European Conference on Computer Vision, pages 362–378. Springer, 2022.
- Interpreting black-box models: a review on explainable artificial intelligence. Cognitive Computation, 16(1):45–74, 2024.
- Which tokens to use? investigating token reduction in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 773–783, 2023.
- Msvit: Dynamic mixed-scale tokenization for vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 838–848, October 2023.
- Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. Advances in Neural Information Processing Systems, 34:29335–29347, 2021.
- Magic pyramid: Accelerating inference with early exiting and token pruning. In NeurIPS 2021 Workshop on Efficient Natural Language and Speech Processing, 2021.
- Channel selection using Gumbel Softmax. In European Conference on Computer Vision, pages 241–257. Springer, 2020.
- Token dropping for efficient BERT pretraining. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3774–3784, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition. arXiv preprint arXiv:2402.15175, 2024.
- A review of the Gumbel-max trick and its extensions for discrete stochasticity in machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1353–1371, 2022.
- Categorical reparameterization with Gumbel-Softmax. 2017.
- Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907, 2021.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Querynet: Querying neural networks for lightweight specialized models. Information Sciences, 589:186–198, 2022.
- M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the em algorithm. 1993.
- Git-theta: a git extension for collaborative development of machine learning models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Fiancee: Faster inference of adversarial networks via conditional early exits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12032–12043, 2023.
- Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1383–1392, January 2024.
- Ancestral gumbel-top-k sampling for sampling without replacement. The Journal of Machine Learning Research, 21(1):1726–1761, 2020.
- Multi-exit semantic segmentation networks. In European Conference on Computer Vision, pages 330–349. Springer, 2022.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In Proceedings of the 2021 International Conference on Learning Representations (ICLR), 2021.
- Learning language specific sub-network for multilingual machine translation. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 293–305, Online, Aug. 2021. Association for Computational Linguistics.
- Darts: Differentiable architecture search. 2019.
- Routers in vision mixture of experts: An empirical study. arXiv preprint arXiv:2401.15969, 2024.
- The concrete distribution: A continuous relaxation of discrete random variables. 2017.
- Split computing and early exiting for deep learning applications: Survey and research challenges. ACM Computing Surveys, 55(5):1–30, 2022.
- Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12309–12318, 2022.
- Augmented language models: a survey. Transactions on Machine Learning Research, 6:1–35, 2023.
- Is a modular architecture enough? Advances in Neural Information Processing Systems, 35:28747–28760, 2022.
- Models with conditional computation learn suboptimal solutions. In I Can’t Believe It’s Not Better Workshop: Understanding Deep Learning Through Empirical Falsification, 2022.
- Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv:2402.05859, 2024.
- Soft merging of experts with adaptive routing. arXiv preprint arXiv:2306.03745, 2023.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
- Discrete latent structure in neural networks. arXiv preprint arXiv:2301.07473, 2023.
- Ia-red22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Interpretability-aware redundancy reduction for vision transformers. Advances in Neural Information Processing Systems, 34:24898–24911, 2021.
- Stitchable neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16102–16112, 2023.
- Less is more: Pay less attention in vision transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2):2035–2043, Jun. 2022.
- Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits. Pattern Recognition, 105:107346, 2020.
- xGQA: Cross-lingual visual question answering. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2497–2511. Association for Computational Linguistics, May 2022.
- Lifting the curse of multilinguality by pre-training modular transformers. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495, Seattle, United States, July 2022.
- AdapterFusion: Non-destructive task composition for transfer learning. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503. Association for Computational Linguistics, Apr. 2021.
- Modular deep learning. Transactions on Machine Learning Research, 11:1–76, 2023.
- Exploiting transformer activation sparsity with dynamic inference. arXiv preprint arXiv:2310.04361, 2023.
- A probabilistic re-intepretation of confidence scores in multi-exit models. Entropy, 24:1, 2021.
- Combining modular skills in multitask learning. arXiv preprint arXiv:2202.13914, 2022.
- From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023.
- Scalable transfer learning with expert models. In Proceedings of the 2021 International Conference on Learning Representations (ICLR), 2021.
- Emergent mixture-of-experts: Can dense pre-trained transformers benefit from emergent modular structures? arXiv preprint arXiv:2310.10908, 2023.
- DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 18332–18346. PMLR, 2022.
- Attend, adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), 2017.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, 34:13937–13949, 2021.
- Learning to merge tokens in vision transformers. arXiv preprint arXiv:2202.12015, 2022.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Routing networks: Adaptive selection of non-linear functions for multi-task learning. In Proceedings of the 2018 International Conference on Learning Representations (ICLR), 2018.
- Anticipate, ensemble and prune: Improving convolutional neural networks via aggregated early exits. Procedia Computer Science, 222:519–528, 2023.
- Differentiable branching in deep networks for fast inference. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4167–4171. IEEE, 2020.
- Why should we add early exits to neural networks? Cognitive Computation, 12(5):954–966, 2020.
- Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), 2017.
- Mixture models for diverse machine translation: Tricks of the trade. In International Conference on Machine Learning, pages 5719–5728. PMLR, 2019.
- Goal-oriented and semantic communication in 6g ai-native networks: The 6g-goals approach. arXiv preprint arXiv:2402.07573, 2024.
- Sparse universal transformer. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 169–179. Association for Computational Linguistics, Dec. 2023.
- You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10781–10791, 2023.
- Branchynet: Fast inference via early exiting from deep neural networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), pages 2464–2469. IEEE, 2016.
- Predicting attention sparsity in transformers. In A. Vlachos, P. Agrawal, A. Martins, G. Lampouras, and C. Lyu, editors, Proceedings of the Sixth Workshop on Structured Prediction for NLP, pages 67–81. Association for Computational Linguistics, May 2022.
- T. Verelst and T. Tuytelaars. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
- Overcoming catastrophic forgetting in zero-shot cross-lingual generation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9279–9300, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
- Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
- Dynexit: A dynamic early-exit strategy for deep residual networks. In Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS), pages 178–183. IEEE, 2019.
- X. Wang and Y. Li. Harmonized dense knowledge distillation training for multi-exit architectures. Proceedings of the AAAI Conference on Artificial Intelligence, 35:10218–10226, 2021.
- Adaptive computation modules: Granular conditional computation for efficient inference. arXiv preprint arXiv:2312.10193, 2023.
- Zero time waste: Recycling predictions in early exit neural networks. Advances in Neural Information Processing Systems, 34:2516–2528, 2021.
- DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251. Association for Computational Linguistics, July 2020.
- Early exit or not: Resource-efficient blind quality enhancement for compressed images. In European Conference on Computer Vision, pages 275–292. Springer, 2020.
- Gtp-vit: Efficient vision transformers via graph-based token propagation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 86–95, January 2024.
- Adaptive computation with elastic input sequence. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
- Modeling point clouds with self-attention and gumbel subset sampling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3318–3327, 2019.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022.
- Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012.
- Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
- A unified multi-task semantic communication system with domain adaptation. In GLOBECOM 2022-2022 IEEE Global Communications Conference, pages 3971–3976. IEEE, 2022.
- Mixture of attention heads: Selecting attention heads per token. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4150–4162. Association for Computational Linguistics, Dec. 2022.
- MoEfication: Transformer feed-forward layers are mixtures of experts. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Emergent modularity in pre-trained transformers. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 4066–4083. Association for Computational Linguistics, July 2023.
- Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33:18330–18341, 2020.
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
- Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems, 35:2664–2678, 2022.
- W. Zhu. LeeBERT: Learned early exit for BERT with cross-level optimization. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2968–2980, Online, Aug. 2021. Association for Computational Linguistics.
- Taming sparsely activated transformer with stochastic experts. In Proceedings of the 2022 International Conference on Learning Representations (ICLR), 2022.