Distributed Learning of Mixtures of Experts (2312.09877v1)
Abstract: In modern machine learning problems we deal with datasets that are either distributed by nature or potentially large for which distributing the computations is usually a standard way to proceed, since centralized algorithms are in general ineffective. We propose a distributed learning approach for mixtures of experts (MoE) models with an aggregation strategy to construct a reduction estimator from local estimators fitted parallelly to distributed subsets of the data. The aggregation is based on an optimal minimization of an expected transportation divergence between the large MoE composed of local estimators and the unknown desired MoE model. We show that the provided reduction estimator is consistent as soon as the local estimators to be aggregated are consistent, and its construction is performed by a proposed majorization-minimization (MM) algorithm that is computationally effective. We study the statistical and numerical properties for the proposed reduction estimator on experiments that demonstrate its performance compared to namely the global estimator constructed in a centralized way from the full dataset. For some situations, the computation time is more than ten times faster, for a comparable performance. Our source codes are publicly available on Github.
- Convex Optimization. Cambridge University Press.
- A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24:1655–1684.
- Maximum likelihood from incomplete data via the EM algorithm. Journal of The Royal Statistical Society, B, 39(1):1–38.
- Dragomir, S. (2013). A generalization of f𝑓fitalic_f-divergence measure to convex functions defined on linear spaces. Communications in Mathematical Analysis, 15.
- Adaptive mixtures of local experts. Neural Computation, 3(1):79–87.
- On the identifiability of mixtures-of-experts. Neural Networks, 12(9):1253–1258.
- On the Identifiability of Mixtures-of-Experts. Neural Networks, 12:197–220.
- Convergence results for the EM approach to mixtures of experts architectures. Neural Networks, 8(9):1409–1431.
- A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 795–816.
- Lange, K. (2004). The MM Algorithm. In Optimization, pages 119–136. Springer.
- Distributed k-clustering for data with heavy noise. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- A distributed learning framework for heterogeneous data sources. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 208–217.
- Parallelizing multiple linear regression for speed and redundancy: an empirical study. Journal of Statistical Computation and Simulation, 39(4):205–214.
- Practical and theoretical aspects of mixture-of-experts modeling: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, pages e1246–n/a.
- A universal approximation theorem for mixture-of-experts models. Neural Computation, 28(12):2585–2593.
- Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. Journal of Statistical Distributions and Applications, 8(1):13.
- A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models. Electronic Journal of Statistics, 16(2):4742 – 4822.
- An efficient linear programming method for optimal transportation. arXiv: Numerical Analysis.
- A public dataset of 24-h multi-levels psycho-physiological responses in young healthy adults. Data, 5(4).
- Split and conquer method in penalized logistic regression with lasso (application on credit scoring data). Journal of Physics: Conference Series, 1108:012107.
- Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol., 10(2).
- Twenty years of mixture of experts. IEEE Trans. Neural Netw. Learning Syst., 23(8):1177–1193.
- Distributed Learning of Finite Gaussian Mixtures. Journal of Machine Learning Research, 23(99):1–40.
- Distributed learning of finite gaussian mixtures. Journal of Machine Learning Research, 23(99):1–40.
- Parallel k-means clustering based on mapreduce. In Jaatun, M. G., Zhao, G., and Rong, C., editors, Cloud Computing, pages 674–679. Springer Berlin Heidelberg.
- Parallelized stochastic gradient descent. In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A., editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc.