Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts (2402.03460v2)

Published 5 Feb 2024 in stat.ML, cs.LG, cs.NA, cs.NE, math.CO, and math.NA

Abstract: Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level $\varepsilon>0$ and every Lipschitz function $f:[0,1]n\to \mathbb{R}$, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates $f$ to $\varepsilon$ accuracy over $[0,1]n$, while only requiring networks of $\mathcal{O}(\varepsilon{-1})$ parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of $\tilde{O}(L\max{nL,JW})$, if there are $L$ experts and each expert has a depth and width of $J$ and $W$, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Designing universal causal deep learning models: The geometric (hyper) transformer. Mathematical Finance, 2023.
  2. Ackley, D. A Connectionist Machine for Genetic Hillclimbing. The Springer International Series in Engineering and Computer Science. Springer US, 1987. ISBN 9780898382365. URL https://books.google.ca/books?id=p3hQAAAAMAAJ.
  3. Approximation theory of tree tensor networks: tensorized univariate functions. Constr. Approx., 58(2):463–544, 2023. ISSN 0176-4276,1432-0940. doi: 10.1007/s00365-023-09620-w. URL https://doi.org/10.1007/s00365-023-09620-w.
  4. Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  5. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
  6. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1–43, 2019.
  7. Geometric nonlinear functional analysis. Vol. 1, volume 48 of American Mathematical Society Colloquium Publications. American Mathematical Society, Providence, RI, 2000. ISBN 0-8218-0835-4. doi: 10.1090/coll/048. URL https://doi.org/10.1090/coll/048.
  8. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. ISSN 0004-5411,1557-735X. doi: 10.1145/76359.76371. URL https://doi.org/10.1145/76359.76371.
  9. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp.  446–461. Springer, 2014.
  10. Convex optimization. Cambridge University Press, Cambridge, 2004. ISBN 0-521-83378-7. doi: 10.1017/CBO9780511804441. URL https://doi.org/10.1017/CBO9780511804441.
  11. A characterization of multiclass learnability. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp.  943–955. IEEE, 2022.
  12. A theoretical analysis of the number of shots in few-shot learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HkgB2TNYPS.
  13. Efficient approximation of high-dimensional functions with neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  14. Conant, G. Upper bound on vc-dimension of partitioned class. MathOverflow. URL https://mathoverflow.net/q/461604. URL:https://mathoverflow.net/q/461604 (version: 2024-01-05).
  15. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
  16. Federer, H. Colloquium lectures on geometric measure theory. Bull. Amer. Math. Soc., 1978.
  17. Designing universal causal deep learning models: The case of infinite-dimensional dynamical systems from stochastic analysis. arXiv preprint arXiv:2210.13300, 2022.
  18. Hyperbolic busemann learning with ideal prototypes. Advances in Neural Information Processing Systems, 34:103–115, 2021.
  19. Approximation bounds for random neural networks and reservoir systems. The Annals of Applied Probability, 33(1):28–69, 2023.
  20. Approximation rates for neural networks with encodable weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
  21. Hanneke, S. The optimal sample complexity of PAC learning. J. Mach. Learn. Res., 17:Paper No. 38, 15, 2016. ISSN 1532-4435,1533-7928.
  22. Nearly-tight vc-dimension bounds for piecewise linear neural networks. In Conference on learning theory, pp.  1064–1068. PMLR, 2017.
  23. Parallel and distributed deep learning. May, 31:1–8, 2016.
  24. Heinonen, J. Lectures on analysis on metric spaces. Universitext. Springer-Verlag, New York, 2001.
  25. Jung, H. Ueber die kleinste kugel, die eine räumliche figur einschliesst. Journal für die reine und angewandte Mathematik, 123:241–257, 1901. URL http://eudml.org/doc/149122.
  26. Strain-minimizing hyperbolic network embeddings with landmarks. Journal of Complex Networks, 11(1):cnad002, 2023.
  27. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  28. Universal approximation theorems for differentiable geometric deep learning. The Journal of Machine Learning Research, 23(1):8896–8968, 2022.
  29. Learning sub-patterns in piecewise continuous functions. Neurocomputing, 480:192–211, 2022.
  30. Measured descent: a new embedding method for finite metrics. Geom. Funct. Anal., 15(4):839–858, 2005. ISSN 1016-443X,1420-8970. doi: 10.1007/s00039-005-0527-6. URL https://doi.org/10.1007/s00039-005-0527-6.
  31. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40(7):1–9, 2010.
  32. Dimensionality reduction for representing the knowledge of probabilistic models. In International Conference on Learning Representations, 2019.
  33. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  34. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  35. Deep network approximation for smooth functions. SIAM J. Math. Anal., 53(5):5465–5506, 2021a. ISSN 0036-1410,1095-7154. doi: 10.1137/20M134695X. URL https://doi.org/10.1137/20M134695X.
  36. Deep network approximation for smooth functions. SIAM J. Math. Anal., 53(5):5465–5506, 2021b. ISSN 0036-1410,1095-7154. doi: 10.1137/20M134695X. URL https://doi.org/10.1137/20M134695X.
  37. The specialization of function: Cognitive and neural perspectives. Cognitive Neuropsychology, 28:147 – 155, 2011. URL https://api.semanticscholar.org/CorpusID:15779627.
  38. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pp.  488–501. Springer, 2012.
  39. When and why are deep networks better than shallow ones? In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  40. Exponential relu dnn expression of holomorphic maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
  41. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  42. Lipschitz widths. Constr. Approx., 57(2):759–805, 2023. ISSN 0176-4276,1432-0940. doi: 10.1007/s00365-022-09576-3. URL https://doi.org/10.1007/s00365-022-09576-3.
  43. Pollard, D. Empirical processes: theory and applications, volume 2 of NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Mathematical Statistics, Hayward, CA; American Statistical Association, Alexandria, VA, 1990. ISBN 0-940600-16-1.
  44. Rastrigin, L. A. Systems of extremal control. Nauka, 1974.
  45. Robinson, J. C. Dimensions, embeddings, and attractors, volume 186 of Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, 2011. ISBN 978-0-521-89805-8.
  46. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
  47. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  48. Deep network with approximation error being reciprocal of width to power of square root of depth. Neural Computation, 33(4):1005–1036, 2021.
  49. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. The Journal of Machine Learning Research, 23(1):12653–12712, 2022a.
  50. Optimal approximation rate of relu networks in terms of width and depth. Journal de Mathématiques Pures et Appliquées, 157:101–135, 2022b.
  51. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  52. Expertnet: A symbiosis of classification and clustering, 2022.
  53. Suzuki, T. Adaptivity of deep reLU network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1ebTsActm.
  54. Hierarchical partition of unity networks: fast multilevel training. In Dong, B., Li, Q., Wang, L., and Xu, Z.-Q. J. (eds.), Proceedings of Mathematical and Scientific Machine Learning, volume 190 of Proceedings of Machine Learning Research, pp.  271–286. PMLR, 15–17 Aug 2022. URL https://proceedings.mlr.press/v190/trask22a.html.
  55. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  56. Voronoi, G. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. deuxième mémoire. recherches sur les parallélloèdres primitifs. Journal für die reine und angewandte Mathematik (Crelles Journal), 1908(134):198–287, 1908. doi: doi:10.1515/crll.1908.134.198. URL https://doi.org/10.1515/crll.1908.134.198.
  57. Cluster-former: Clustering-based sparse transformer for question answering. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3958–3968, 2021.
  58. Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  59. Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp.  639–649. PMLR, 06–09 Jul 2018. URL https://proceedings.mlr.press/v75/yarotsky18a.html.
  60. Yarotsky, D. Elementary superexpressive activations. In International Conference on Machine Learning, pp.  11932–11940. PMLR, 2021.
  61. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005–13015, 2020.
  62. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. Journal of Machine Learning Research, 23(276):1–60, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com