Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly (2310.03150v2)
Abstract: LLMs (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
- Scaling instruction-finetuned language models. 2022.
- FedBERT: When federated learning meets pre-training. ACM Transactions on Intelligent Systems and Technology, 13(4):1–26, August 2022.
- Towards building the federated gpt: Federated instruction tuning. arXiv preprint arXiv:2305.05644, 2023.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
- Amazon AWS p3 instance types. https://aws.amazon.com/ec2/instance-types/p3/. Accessed: 2023-09-27.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- A survey on edge performance benchmarking. 2020.
- MODI: Mobile deep inference made efficient by edge computing. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), Boston, MA, July 2018. USENIX Association.
- Errant: Realistic emulation of radio access networks. 2021.
- Flower: A friendly federated learning research framework, 2020.
- SelfWatts: On-the-fly selection of performance events to optimize software-defined power meters. In 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, May 2021.
- Benchmarking resource usage for efficient distributed deep learning, 2022.
- Communication-efficient learning of deep networks from decentralized data. 2016.
- Kai Hwang. Advanced Computer Architecture: Parallelism,Scalability,Programmability. McGraw-Hill Higher Education, 1st edition, 1992.
- Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. 2019.
- FedScale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning (ICML), 2022.
- Fedml: A research library and benchmark for federated machine learning. Advances in Neural Information Processing Systems, Best Paper Award at Federate Learning Workshop, 2020.
- Leaf: A benchmark for federated settings, 2018.
- Herbert Woisetschläger (8 papers)
- Alexander Isenko (4 papers)
- Shiqiang Wang (79 papers)
- Ruben Mayer (44 papers)
- Hans-Arno Jacobsen (62 papers)