LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning (2405.10968v1)
Abstract: Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They may also be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual heavy-weight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism while minimizing the aggregation time and resource consumption. Our experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems.
- Autoscaling - Knative. https://knative.dev/docs/serving/autoscaling/, 2023a. [online].
- Autoscaling - OpenFaaS. https://docs.openfaas.com/architecture/autoscaling/, 2023b. [online].
- extended Berkeley Packet Filter. https://ebpf.io/, 2023a. [online].
- BPF-HELPERS - list of eBPF helper functions. https://manpages.ubuntu.com/manpages/focal/en/man7/bpf-helpers.7.html, 2023b. [online].
- BPF maps. https://docs.kernel.org/bpf/maps.html, 2023c. [online].
- Fate: An Industrial Grade Federated Learning Framework. https://fate.fedai.org/, 2023. [online].
- Knative. https://knative.dev, 2023. [online].
- Open Federated Learning (OpenFL) - An Open-Source Framework For Federated Learning. https://github.com/intel/openfl, 2023. [online].
- Flame: a federated learning system for the edge. https://github.com/cisco-open/flame, 2024. [online].
- Refl: Resource-efficient federated learning. In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys ’23, pp. 215–232, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394871. doi: 10.1145/3552326.3567485. URL https://doi.org/10.1145/3552326.3567485.
- Firecracker: Lightweight virtualization for serverless applications. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pp. 419–434, Santa Clara, CA, February 2020. USENIX Association. ISBN 978-1-939133-13-7. URL https://www.usenix.org/conference/nsdi20/presentation/agache.
- SAND: Towards High-Performance serverless computing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pp. 923–935, Boston, MA, July 2018. USENIX Association. ISBN 978-1-939133-01-4. URL https://www.usenix.org/conference/atc18/presentation/akkus.
- Kraken: Adaptive container provisioning for deploying dynamic dags in serverless platforms. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, pp. 153–167, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450386388. doi: 10.1145/3472883.3486992. URL https://doi.org/10.1145/3472883.3486992.
- Towards federated learning at scale: System design. In Proceedings of Machine Learning and Systems, volume 1, pp. 374–388, 2019. URL https://proceedings.mlsys.org/paper/2019/file/bd686fd640be98efaae0091fa301e613-Paper.pdf.
- Understanding host network stack overheads. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM ’21, pp. 65–77, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383837. doi: 10.1145/3452296.3472888. URL https://doi.org/10.1145/3452296.3472888.
- Towards federated learning using faas fabric. In Proceedings of the 2020 Sixth International Workshop on Serverless Computing, WoSC’20, pp. 49–54, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450382045. doi: 10.1145/3429880.3430100. URL https://doi.org/10.1145/3429880.3430100.
- Flame: Simplifying topology extension in federated learning. In Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023.
- The design and operation of CloudLab. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp. 1–14, Renton, WA, July 2019. USENIX Association. ISBN 978-1-939133-03-8. URL https://www.usenix.org/conference/atc19/presentation/duplyakin.
- Fedlesscan: Mitigating stragglers in serverless federated learning, 2022. URL https://arxiv.org/abs/2211.05739.
- Fast and efficient container startup at the edge via dependency scheduling. In 3rd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 20). USENIX Association, June 2020. URL https://www.usenix.org/conference/hotedge20/presentation/fu.
- Faascache: Keeping serverless computing alive with greedy-dual caching. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, pp. 386–400, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383172. doi: 10.1145/3445814.3446757. URL https://doi.org/10.1145/3445814.3446757.
- Sledge: A serverless-first, light-weight wasm runtime for the edge. Middleware ’20, pp. 265–279, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450381536. doi: 10.1145/3423211.3425680. URL https://doi.org/10.1145/3423211.3425680.
- Fedless: Secure and scalable federated learning using serverless computing. In 2021 IEEE International Conference on Big Data (Big Data), pp. 164–173, 2021. doi: 10.1109/BigData52589.2021.9672067.
- Hybrid local sgd for federated learning with heterogeneous communications. In International Conference on Learning Representations, 2022.
- Fedml: A research library and benchmark for federated machine learning, 2020. URL https://arxiv.org/abs/2007.13518.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Papaya: Practical, private, and scalable federated learning. In Proceedings of Machine Learning and Systems, volume 4, pp. 814–832, 2022. URL https://proceedings.mlsys.org/paper/2022/file/f340f1b1f65b6df5b5e3f94d95b11daf-Paper.pdf.
- Lambda fl: Serverless aggregation for federated learning. In International Workshop on Trustable, Verifiable and Auditable Federated Learning, pp. 9, 2022a.
- Adaptive aggregation for federated learning. In 2022 IEEE International Conference on Big Data (Big Data), pp. 180–185, 2022b. doi: 10.1109/BigData55660.2022.10021119.
- Just-in-time aggregation for federated learning. In 2022 30th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1–8, 2022c. doi: 10.1109/MASCOTS56607.2022.00009.
- Nightcore: Efficient and scalable serverless computing for latency-sensitive, interactive microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, pp. 152–166, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383172. doi: 10.1145/3445814.3446701. URL https://doi.org/10.1145/3445814.3446701.
- Pisces: Efficient federated learning via guided asynchronous training. In Proceedings of the 13th Symposium on Cloud Computing, SoCC ’22, pp. 370–385, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394147. doi: 10.1145/3542929.3563463. URL https://doi.org/10.1145/3542929.3563463.
- Ditto: Efficient serverless analytics with elastic parallelism. In Proceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM ’23, pp. 406–419, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702365. doi: 10.1145/3603269.3604816. URL https://doi.org/10.1145/3603269.3604816.
- Hermod: Principled and practical scheduling for serverless functions. In Proceedings of the 13th Symposium on Cloud Computing, SoCC ’22, pp. 289–305, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394147. doi: 10.1145/3542929.3563468. URL https://doi.org/10.1145/3542929.3563468.
- Oort: Efficient federated learning via guided participant selection. In USENIX Symposium on Operating Systems Design and Implementation, OSDI, pp. 19–35. USENIX Association, 2021.
- Fedscale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning, pp. 11814–11827. PMLR, 2022.
- Peer-to-peer federated learning on graphs, 2019. URL https://arxiv.org/abs/1901.11173.
- Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020a.
- Fair resource allocation in federated learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020b. URL https://openreview.net/forum?id=ByexElSYDr.
- Mitigating cold starts in serverless platforms: A pool-based approach, 2019. URL https://arxiv.org/abs/1903.12221.
- Auxo: Efficient federated learning via scalable client clustering. In Proceedings of the 2023 ACM Symposium on Cloud Computing, SoCC ’23, pp. 125–141, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9798400703874. doi: 10.1145/3620678.3624651. URL https://doi.org/10.1145/3620678.3624651.
- Venn: Resource management across federated learning jobs, 2023b. URL https://arxiv.org/abs/2312.08298.
- Ibm federated learning: an enterprise framework white paper v0. 1. arXiv preprint arXiv:2007.10987, 2020. URL https://arxiv.org/abs/2007.10987.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273–1282. PMLR, 20–22 Apr 2017. URL https://proceedings.mlr.press/v54/mcmahan17a.html.
- Mu: An efficient, fair and responsive serverless framework for resource-constrained edge clouds. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, pp. 168–181, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450386388. doi: 10.1145/3472883.3487014. URL https://doi.org/10.1145/3472883.3487014.
- Federated learning with buffered asynchronous aggregation. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 3581–3607. PMLR, 28–30 Mar 2022. URL https://proceedings.mlr.press/v151/nguyen22b.html.
- Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international conference on communications (ICC), pp. 1–7. IEEE, 2019.
- SOCK: Rapid task provisioning with Serverless-Optimized containers. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pp. 57–70, Boston, MA, July 2018. USENIX Association. ISBN 978-1-931971-44-7. URL https://www.usenix.org/conference/atc18/presentation/oakes.
- Graf: A graph neural network based proactive resource allocation framework for slo-oriented microservices. In Proceedings of the 17th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’21, pp. 154–167, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450390989. doi: 10.1145/3485983.3494866. URL https://doi.org/10.1145/3485983.3494866.
- Graf: A graph neural network based proactive resource allocation framework for slo-oriented microservices. In Proceedings of the 17th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’21, pp. 154–167, New York, NY, USA, 2021b. Association for Computing Machinery. ISBN 9781450390989. doi: 10.1145/3485983.3494866. URL https://doi.org/10.1145/3485983.3494866.
- Assessing container network interface plugins: Functionality, performance, and scalability. IEEE Transactions on Network and Service Management, 18(1):656–671, 2021. doi: 10.1109/TNSM.2020.3047545.
- Spright: Extracting the server from serverless computing! high-performance ebpf-based event-driven, shared-memory processing. In Proceedings of the ACM SIGCOMM 2022 Conference, SIGCOMM ’22, pp. 780–794, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394208. doi: 10.1145/3544216.3544259. URL https://doi.org/10.1145/3544216.3544259.
- Red Hat, Inc. Understanding the eBPF networking features in RHEL. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/assembly_understanding-the-ebpf-features-in-rhel-8_configuring-and-managing-networking, 2022. [online].
- Adaptive federated optimization, 2020. URL https://arxiv.org/abs/2003.00295.
- Lukewarm serverless functions: Characterization and optimization. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, pp. 757–770, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450386104. doi: 10.1145/3470496.3527390. URL https://doi.org/10.1145/3470496.3527390.
- Scheller, B. Best practices for resizing and automatic scaling in Amazon EMR. https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/, 2023. [online].
- Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pp. 205–218. USENIX Association, July 2020a. ISBN 978-1-939133-14-4. URL https://www.usenix.org/conference/atc20/presentation/shahrad.
- Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pp. 205–218. USENIX Association, July 2020b. ISBN 978-1-939133-14-4. URL https://www.usenix.org/conference/atc20/presentation/shahrad.
- Faasm: Lightweight isolation for efficient stateful serverless computing. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pp. 419–433. USENIX Association, July 2020. ISBN 978-1-939133-14-4. URL https://www.usenix.org/conference/atc20/presentation/shillaker.
- Fedbalancer: Data and pace control for efficient federated learning on heterogeneous clients. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, MobiSys ’22, pp. 436–449, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391856. doi: 10.1145/3498361.3538917. URL https://doi.org/10.1145/3498361.3538917.
- Atoll: A scalable low-latency serverless platform. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, pp. 138–152, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450386388. doi: 10.1145/3472883.3486981. URL https://doi.org/10.1145/3472883.3486981.
- Sequoia: Enabling quality-of-service in serverless computing. In Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC ’20, pp. 311–327, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450381376. doi: 10.1145/3419111.3421306. URL https://doi.org/10.1145/3419111.3421306.
- Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, pp. 559–572, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383172. doi: 10.1145/3445814.3446714. URL https://doi.org/10.1145/3445814.3446714.
- FaaSNet: Scalable and fast provisioning of custom serverless container runtimes at alibaba cloud function compute. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp. 443–457. USENIX Association, July 2021. ISBN 978-1-939133-23-6. URL https://www.usenix.org/conference/atc21/presentation/wang-ao.
- Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 191–195, 2021.
- Following the data, not the function: Rethinking function orchestration in serverless computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pp. 1489–1504, Boston, MA, April 2023. USENIX Association. ISBN 978-1-939133-33-5. URL https://www.usenix.org/conference/nsdi23/presentation/yu.