Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments (2511.19479v1)

Published 22 Nov 2025 in cs.DC and cs.LG

Abstract: As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At the same time, the combination of high- performance computing (HPC) and cloud infrastructure offers vast computing power but introduces new complexities, especially when dealing with heteroge- neous hardware, communication limits, and non-uniform data. In this work, we present a federated learning framework built to run efficiently across mixed HPC and cloud environments. Our system addresses key challenges such as system het- erogeneity, communication overhead, and resource scheduling, while maintaining model accuracy and data privacy. Through experiments on a hybrid testbed, we demonstrate strong performance in terms of scalability, fault tolerance, and convergence, even under non-Independent and Identically Distributed (non-IID) data distributions and varied hardware. These results highlight the potential of federated learning as a practical approach to building scalable AI systems in modern, distributed computing settings.