Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization (2402.12550v4)

Published 19 Feb 2024 in cs.CV and cs.LG
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

Abstract: The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts ($\mu$MoE) layer to address this, focusing on vision models. $\mu$MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, $\mu$MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling $\mu$MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched $\mu$MoE blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/muMoE.

Enhanced Specialization and Interpretability in Vision Models with Multilinear Mixture of Experts

Introduction

The Mixture of Experts (MoE) architecture has been instrumental in advancing machine learning models by allowing different subsets of layers, or "experts", to process inputs, thereby enabling more expressive and efficient computations. Despite the success of MoEs, scaling the number of experts to enhance model capacity and specialization faces significant challenges. High computational costs, training instability, and difficulty in scaling the expert count have limited the practical applicability of MoEs, especially in sparse configurations. Addressing these challenges, this paper presents the Multilinear Mixture of Experts (MMoE) layer, engineered for scalable expert specialization in vision models through a comprehensive factorization approach.

MMoE: A Path to Scalable Expert Specialization

MMoE layers leverage factorized weight tensors, facilitating the implicit computation of large numbers of experts without the need for dense weight matrices or non-differentiable operations. This design not only mitigates the computational expense associated with traditional MoE models but also fosters expert specialization by allowing for tens of thousands of experts to operate within a tractable computational framework. The MMoE model encapsulates both increased expert specificity and hierarchical structure, making it adept at dealing with complex, hierarchical data.

Empirical Validation

Through extensive experimentation, the MMoE architecture demonstrates significant advances in task modularity and expert specialization. Utilizing qualitative visualizations alongside quantitative counterfactual interventions, the paper provides evidence that increasing the number of MMoE experts leads to a marked improvement in model performance on vision tasks. Specifically, it is shown that MMoE-enhanced foundation models for vision tasks achieve competitive performance metrics while facilitating a greater degree of interpretability and editability compared to conventional approaches.

Practical Implications and Future Applications

In practice, the MMoE model’s ability to decompose complex computations into understandable subtasks significantly aids in debugging, editing, and understanding model behavior. This characteristic is especially valuable in mitigating demographic biases in attribute classification tasks, as demonstrated through manual corrections in CelebA attribute classification. Looking forward, the paper suggests the potential for MMoE layers to serve as a foundational component in developing highly modular, interpretable, and efficient models across a broad spectrum of machine learning applications, extending beyond vision tasks to domains like natural language processing and multimodal learning.

Conclusion

The introduction of the Multilinear Mixture of Experts layer addresses critical challenges in scaling MoE architectures, offering a pathway to enhanced expert specialization without the computational overhead typically associated with such endeavors. By demonstrating the viability of MMoE layers in promoting interpretability, editability, and reduced demographic bias in machine learning models, this work contributes significantly to the ongoing pursuit of building more comprehensible and controllable AI systems. As this domain continues to evolve, the MMoE framework stands to play a pivotal role in shaping the future of AI, where transparency and efficiency are paramount.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018.
  2. Tesa: Tensor element self-attention via matricization. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 13945–13954, 2020.
  3. Poly-nl: Linear complexity non-local layers with 3rd order polynomials. In Int. Conf. Comput. Vis. (ICCV), pp.  10518–10528, 2021.
  4. Factorized dynamic fully-connected layers for neural networks. In Int. Conf. Comput. Vis. Worksh. (ICCVW), pp.  1374–1383, October 2023.
  5. Conditional computation in neural networks for faster models. In Int. Conf. Mach. Learn. Worksh. (ICMLW), 2015.
  6. An interpretability illusion for bert. arXiv preprint arXiv:2104.07143, 2021.
  7. Incremental multi-domain learning with network latent tensor factorization. In Conf. on Artifi. Intel. (AAAI), volume 34, pp. 10470–10477, 2020.
  8. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp.  77–91. PMLR, 2018.
  9. Emerging properties in self-supervised vision transformers. In Int. Conf. Comput. Vis. (ICCV), 2021.
  10. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika, 35:283–319, 1970.
  11. Casper, S. Broad critiques of interpretability research. 2023. URL https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7/p/gwG9uqw255gafjYN4.
  12. Dynamic convolution: Attention over convolution kernels. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 11030–11039, 2020.
  13. Multilinear operator networks, 2024.
  14. Technical challenges for training fair neural networks. arXiv preprint arXiv:2102.06764, 2021.
  15. P-nets: Deep polynomial neural networks. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 7325–7335, 2020.
  16. Deep polynomial neural networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), pp.  1–1, 2021. ISSN 1939-3539.
  17. Adaptively sparse transformers. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2174–2184, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1223.
  18. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013.
  19. On the best rank-1 and rank-(r1 ,r2 ,. . .,rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications, 21(4):1324–1342, 2000. doi: 10.1137/S0895479898346995.
  20. Imagenet: A large-scale hierarchical image database. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 248–255, 2009.
  21. Glam: Efficient scaling of language models with mixture-of-experts. In Int. Conf. Mach. Learn. (ICML), pp.  5547–5569. PMLR, 2022.
  22. Learning factored representations in a deep mixture of experts. In Int. Conf. Mach. Learn. Worksh. (ICMLW), volume abs/1312.4314, 2013.
  23. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
  24. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  25. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  26. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5, 2023.
  27. Parameter-efficient mixture-of-experts architecture for pre-trained language models. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.
  28. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
  29. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  30. Multilinear latent conditioning for generating unseen attribute combinations. In Int. Conf. Mach. Learn. (ICML), 2020.
  31. Mitigating demographic bias in facial datasets with style-based multi-attribute transfer. Int. J. Comput. Vis. (IJCV), 129(7):2288–2307, 2021.
  32. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  33. Sparsely activated mixture-of-experts are robust multi-task learners. arXiv preprint arXiv:2204.07689, 2022.
  34. Demix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.naacl-main.407.
  35. Hypernetworks. In Int. Conf. Learn. Represent. (ICLR), 2017.
  36. Dynamic neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 44(11):7436–7456, 2021.
  37. Equality of opportunity in supervised learning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2016.
  38. Hitchcock, F. L. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6:164–189, 1927.
  39. Quantifying local specialization in deep neural networks. arXiv preprint arXiv:2110.08058, 2021.
  40. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  41. Patching open-vocabulary models by interpolating weights. Adv. Neural Inform. Process. Syst. (NeurIPS), 35:29262–29277, 2022.
  42. Editing models with task arithmetic. In Int. Conf. Learn. Represent. (ICLR), 2023.
  43. Interpretable mixture of experts. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  44. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive science, 15(2):219–250, 1991a.
  45. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991b.
  46. Distilling model failures as directions in latent space. In Int. Conf. Learn. Represent. (ICLR), 2023.
  47. Mixtral of experts, 2024.
  48. Hierarchical mixtures of experts and the em algorithm. In Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), volume 2, pp.  1339–1344 vol.2, 1993. doi: 10.1109/IJCNN.1993.716791.
  49. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. doi: 10.1137/07070111X.
  50. Tensor contraction layers for parsimonious deep nets. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh. (CVPRW), pp.  26–32, 2017.
  51. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). IEEE, June 2020.
  52. Fairness without demographics through adversarially reweighted learning. Adv. Neural Inform. Process. Syst. (NeurIPS), 33:728–740, 2020.
  53. GShard: Scaling giant models with conditional computation and automatic sharding. In Int. Conf. Learn. Represent. (ICLR), 2021.
  54. Base layers: Simplifying training of large, sparse models. In Int. Conf. Mach. Learn. (ICML), 2021.
  55. Revisiting dynamic convolution via matrix decomposition. In Int. Conf. Learn. Represent. (ICLR), 2021.
  56. Lipton, Z. C. The mythos of model interpretability. Communications of the ACM, 61(10):36–43, September 2018. ISSN 1557-7317.
  57. Deep learning face attributes in the wild. In Int. Conf. Comput. Vis. (ICCV), December 2015.
  58. Last-layer fairness fine-tuning is simple and effective for neural networks. In Proceedings of the 2nd Workshop on Spurious Correlations, Invariance and Stability at the International Conference on Machine Learning (ICML 2023), 2023.
  59. Locating and editing factual associations in gpt. Adv. Neural Inform. Process. Syst. (NeurIPS), 35:17359–17372, 2022.
  60. Models with conditional computation learn suboptimal solutions. In I Can’t Believe It’s Not Better Workshop: Understanding Deep Learning Through Empirical Falsification, 2022.
  61. Multimodal contrastive learning with LIMoe: the language-image mixture of experts. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
  62. Tensorizing neural networks. Adv. Neural Inform. Process. Syst. (NeurIPS), 28, 2015.
  63. Exponential machines. In Int. Conf. Learn. Represent. Worksh., 2017.
  64. Task arithmetic in the tangent space: Improved editing of pre-trained models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
  65. Oseledets, I. Tensor-train decomposition. SIAM J. Sci. Comput., 33:2295–2317, 2011.
  66. Sparsely-gated mixture-of-expert layers for cnn interpretability. In 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, June 2023. doi: 10.1109/ijcnn54540.2023.10191904.
  67. Sparse sequence-to-sequence models. In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1504–1519, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1146.
  68. From sparse to soft mixtures of experts. In Int. Conf. Learn. Represent. (ICLR), 2024.
  69. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn. (ICML), 2021.
  70. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.  464–483. IEEE, 2023.
  71. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Bisazza, A. and Abend, O. (eds.), Proceedings of the 25th Conference on Computational Natural Language Learning, pp.  194–209, Online, November 2021. Association for Computational Linguistics.
  72. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1135–1144, 2016.
  73. Scaling vision with sparse mixture of experts. Adv. Neural Inform. Process. Syst. (NeurIPS), 34:8583–8595, 2021.
  74. Rogozhnikov, A. Einops: Clear and reliable tensor manipulations with einstein-like notation. In Int. Conf. Learn. Represent. (ICLR), 2022.
  75. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
  76. Sharkey, L. A technical note on bilinear layers for interpretability. arXiv preprint arXiv:2305.03452, 2023.
  77. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Int. Conf. Learn. Represent. (ICLR), 2017.
  78. MLP-mixer: An all-MLP architecture for vision. Adv. Neural Inform. Process. Syst. (NeurIPS), 34:24261–24272, 2021.
  79. Tucker, L. R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279–311, 1966.
  80. Attention is all you need. Adv. Neural Inform. Process. Syst. (NeurIPS), 30, 2017.
  81. Mitigating bias in face recognition using skewness-aware reinforcement learning. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 9322–9331, 2020.
  82. Towards fairness in visual recognition: Effective strategies for bias mitigation. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 8919–8928, 2020.
  83. Residual mixture of experts, 2022.
  84. Go wider instead of deeper. In Conf. on Artifi. Intel. (AAAI), volume 36, pp. 8779–8787, 2022.
  85. Condconv: Conditionally parameterized convolutions for efficient inference. Adv. Neural Inform. Process. Syst. (NeurIPS), 32, 2019.
  86. Deep multi-task representation learning: A tensor factorisation approach. In Int. Conf. Learn. Represent. (ICLR), 2017.
  87. Tensor ring decomposition. ArXiv, abs/1606.05535, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. James Oldfield (10 papers)
  2. Markos Georgopoulos (19 papers)
  3. Grigorios G. Chrysos (38 papers)
  4. Christos Tzelepis (24 papers)
  5. Yannis Panagakis (53 papers)
  6. Mihalis A. Nicolaou (17 papers)
  7. Jiankang Deng (96 papers)
  8. Ioannis Patras (73 papers)
Citations (3)