Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks (2306.04073v1)
Abstract: In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis, resulting in significant computation reduction. The recently proposed \underline{p}atch-level routing in \underline{MoE} (pMoE) divides each input into $n$ patches (or tokens) and sends $l$ patches ($l\ll n$) to each expert through prioritized routing. pMoE has demonstrated great empirical success in reducing training and inference costs while maintaining test accuracy. However, the theoretical explanation of pMoE and the general MoE remains elusive. Focusing on a supervised classification task using a mixture of two-layer convolutional neural networks (CNNs), we show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization (referred to as the sample complexity) by a factor in the polynomial order of $n/l$, and outperforms its single-expert counterpart of the same or even larger capacity. The advantage results from the discriminative routing property, which is justified in both theory and practice that pMoE routers can filter label-irrelevant patches and route similar class-discriminative patches to the same expert. Our experimental results on MNIST, CIFAR-10, and CelebA support our theoretical findings on pMoE's generalization and show that pMoE can avoid learning spurious correlations.
- Network of experts for large-scale image categorization. In European Conference on Computer Vision, pp. 516–532. Springer, 2016.
- What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019.
- Backward feature correction: How deep learning performs deep learning. arXiv preprint arXiv:2001.04413, 2020a.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020b.
- Feature purification: How adversarial training performs robust deep learning. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 977–988. IEEE, 2022.
- Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019b.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. PMLR, 2019.
- Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2019.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- An optimization and generalization analysis for max-pooling networks. In Uncertainty in Artificial Intelligence, pp. 1650–1660. PMLR, 2021.
- SGD learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, 2018.
- Improved learning algorithms for mixture of experts in multiclass classification. Neural networks, 12(9):1229–1252, 1999.
- Towards understanding mixture of experts in deep learning. arXiv preprint arXiv:2208.02813, 2022.
- On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019.
- A parallel mixture of SVMs for very large scale problems. Advances in Neural Information Processing Systems, 14, 2001.
- Scaling large learning problems with hard parallel mixtures. International Journal of pattern recognition and artificial intelligence, 17(03):349–365, 2003.
- Learning parities with neural networks. Advances in Neural Information Processing Systems, 33:20356–20365, 2020.
- Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
- Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Guaranteed recovery of one-hidden-layer neural networks via cross entropy. IEEE transactions on signal processing, 68:3225–3235, 2020.
- Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems, 32, 2019.
- When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
- Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
- Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6865–6873, 2017.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations, 2019.
- Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
- Local signal adaptivity: Provable feature learning in neural networks beyond kernels. Advances in Neural Information Processing Systems, 34:24883–24897, 2021.
- Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Canadian Institute For Advanced Research, 2009.
- MNIST handwritten digit database. AT&T labs [online]. available http. yann. lecun. com/exdb/mnist, 2010.
- Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2020.
- Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp. 6265–6274. PMLR, 2021.
- Generalization guarantee of training graph convolutional networks with graph topology sampling. In International Conference on Machine Learning, pp. 13014–13051. PMLR, 2022a.
- Learning and generalization of one-hidden-layer neural networks, going beyond standard gaussian data. In 2022 56th Annual Conference on Information Sciences and Systems (CISS), pp. 37–42. IEEE, 2022b.
- A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jClGv3Qjhb.
- Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31, 2018.
- Learning over-parametrized two-layer neural networks beyond NTK. In Conference on learning theory, pp. 2613–2682. PMLR, 2020.
- Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738, 2015.
- Quantifying the benefit of using differentiable learning over tangent kernels. In International Conference on Machine Learning, pp. 7379–7389. PMLR, 2021.
- Diversity and depth in per-example routing models. In International Conference on Learning Representations, 2018.
- Infinite mixtures of gaussian process experts. Advances in neural information processing systems, 14, 2001.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Shalev-Shwartz, S. et al. Computational separation between convolutional and fully-connected networks. In International Conference on Learning Representations, 2020.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017.
- A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2021.
- Tresp, V. Mixtures of gaussian processes. In Leen, T., Dietterich, T., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper/2000/file/9fdb62f932adf55af2c0e09e55861964-Paper.pdf.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems, 32, 2019.
- On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- On the learning dynamics of two-layer nonlinear convolutional neural networks. arXiv preprint arXiv:1905.10157, 2019.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Fast learning of graph neural networks with guaranteed generalizability: one-hidden-layer case. In International Conference on Machine Learning, pp. 11268–11277. PMLR, 2020a.
- Improved linear convergence of training CNNs with generalizability guarantees: A one-hidden-layer case. IEEE Transactions on Neural Networks and Learning Systems, 32(6):2622–2635, 2020b.
- Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017a.
- Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pp. 4140–4149. PMLR, 2017b.
- Mixture-of-experts with expert choice routing. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=jdJo1HIVinI.
- Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109(3):467–492, 2020.