Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes (2312.06353v5)

Published 11 Dec 2023 in cs.LG and cs.DC

Abstract: Pre-trained LLMs need fine-tuning to improve their responsiveness to natural language instructions. Federated learning offers a way to fine-tune LLMs using the abundant data on end devices without compromising data privacy. Most existing federated fine-tuning methods for LLMs rely on parameter-efficient fine-tuning techniques, which may not reach the performance height possible with full-parameter tuning. However, federated full-parameter tuning of LLMs is a non-trivial problem due to the immense communication cost. This work introduces FedKSeed that employs zeroth-order optimization with a finite set of random seeds. It significantly reduces transmission requirements between the server and clients to just a few random seeds and scalar gradients, amounting to only a few thousand bytes, making federated full-parameter tuning of billion-sized LLMs possible on devices. Building on it, we develop a strategy enabling probability-differentiated seed sampling, prioritizing perturbations with greater impact on model accuracy. Experiments across six scenarios with various LLMs, datasets and data partitions demonstrate that our approach outperforms existing federated LLM fine-tuning methods in both communication efficiency and new task generalization.

Federated Full-Parameter Tuning of Billion-Sized LLMs with Communication Cost under 18 Kilobytes

This paper addresses a significant challenge in the federated learning (FL) of LLMs: enabling full-parameter tuning of billion-sized LLMs on decentralized devices without incurring prohibitive communication costs. The authors introduce a novel method, FedKSeed, which strategically minimizes the communication overhead during the FL process to less than 18 kilobytes per round.

Problem Context

LLMs require fine-tuning to improve performance on specific tasks. However, traditional federated full-parameter tuning of LLMs can be computationally expensive and require considerable data transmission, typically on the order of gigabytes, making it infeasible for devices with limited bandwidth and storage capabilities. Most existing works focus on parameter-efficient fine-tuning (PEFT) methods, which, although reducing some overhead, often cannot match the performance of full-parameter adjustments.

Research Contributions

The core of the proposal, FedKSeed, utilizes zeroth-order optimization (ZOO) with a predefined set of random seeds, significantly reducing the need to transmit full model parameters. Each client's update only requires transmitting a seed and a scalar gradient, which can be encoded very efficiently. This reduction of data transmission represents a dramatic improvement over traditional methods that scale with the model size.

A notable innovation is the concept of zonal perturbation, where the perturbation introduced during model updates is selected from a fixed pool of random seeds (denoted as KK seeds). This concept theoretically ensures that full-parameter tuning in FL can achieve communication efficiency without losing the benefits of direct parameter tuning.

Moreover, the paper introduces a variant of FedKSeed called FedKSeed-Pro, which optimizes the selection of these seeds by assigning them non-uniform probabilities. This adjustment is based on the estimated importance of various perturbations, intending to improve both computation efficiency and model accuracy.

Experimental Setup and Results

The authors conduct experiments across six different scenarios incorporating models like DataJuicer-1.3B and LLaMA-3B, various datasets, and data partitions. The results are profound: FedKSeed not only outperforms other federated fine-tuning approaches in terms of communication efficiency but also achieves better zero-shot generalization. For instance, FedKSeed-Pro demonstrates an average 7.26% improvement in Rouge-L scores over the best alternatives.

Table 1 in the paper provides a qualitative comparison indicating that FedKSeed outperforms other methods in terms of both memory and communication costs, establishing its practicality for federated LLM tuning on devices.

Theoretical Insights

The paper builds on existing convergence theories of ZOO, adapting them to the federated context. By effectively managing noise in gradient estimation through a fixed set of seeds, it maintains convergence properties while drastically reducing communication demands.

The authors present two principles guiding the choice of KK: not too few to avoid insufficient model training and not too many to prevent excessive computation and tuning dilution. This ensures a balance between computational efficiency and model fidelity.

Implications and Future Work

FedKSeed and its enhanced version represent significant advancements in making federated full-parameter tuning of LLMs feasible on edge devices. This work opens the door to more equitable and extensive use of large models in decentralized settings, potentially democratizing access to advanced LLM capabilities.

Looking forward, the approach may encourage future research into decentralized FL architectures that can leverage the reduced communication requirements, as well as further explorations of ZOO-based methods in other model learning paradigms. Additionally, optimizing seed selection strategies and exploring other ZOO variants could extend these results to an even broader range of applications and model types.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Slora: Federated parameter efficient fine-tuning of language models. arXiv preprint arXiv:2308.06522, 2023.
  2. Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization. arXiv preprint arXiv:2310.15080, 2023.
  3. Federated large language model: A position paper. arXiv preprint arXiv:2307.08925, 2023a.
  4. Fs-real: Towards real-world cross-device federated learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3829–3841, 2023b.
  5. Data-juicer: A one-stop data processing system for large language models, 2023c.
  6. Efficient personalized federated learning via sparse model-adaptation. In International Conference on Machine Learning, ICML, volume 202, pp.  5234–5256, 2023d.
  7. Revisiting parameter-efficient tuning: Are we really there yet? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2612–2626, 2022.
  8. Fine-grained theoretical analysis of federated zeroth-order optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023e.
  9. Free dolly: Introducing the world’s first truly open instruction-tuned LLM, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  10. Qlora: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
  11. Towards next-generation intelligent assistants leveraging LLM techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  5792–5793, 2023.
  12. Docofl: downlink compression for cross-device federated learning. In International Conference on Machine Learning, pp.  8356–8388. PMLR, 2023.
  13. FATE-LLM: A industrial grade federated learning framework for large language models. CoRR, abs/2310.10049, 2023.
  14. Communication-efficient stochastic zeroth-order optimization for federated learning. IEEE Transactions on Signal Processing, 70:5058–5073, 2022.
  15. Does federated learning really need backpropagation? arXiv preprint arXiv:2301.12195, 2023.
  16. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, ICLR, 2022.
  17. Low-parameter federated learning with large language models. arXiv preprint arXiv:2307.13896, 2023.
  18. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), pp.  1–15, 2015.
  19. Federatedscope-LLM: A comprehensive package for fine-tuning large language models in federated learning. arXiv preprint arXiv:2309.00363, 2023.
  20. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  21. Communication-efficient decentralized zeroth-order method on heterogeneous data. In International Conference on Wireless Communications and Signal Processing, WCSP, pp.  1–6, 2021.
  22. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  23. Zeroth-order stochastic variance reduction for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018.
  24. GPT understands, too. AI Open, 2023.
  25. Fine-tuning language models with just forward passes. CoRR, abs/2305.17333, 2023.
  26. PEFT: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  27. FedZeN: Towards superlinear zeroth-order federated learning via incremental hessian estimation. arXiv preprint arXiv:2309.17174, 2023.
  28. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  29. Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022.
  30. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  31. Empirical analysis of the strengths and weaknesses of PEFT techniques for LLMs. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  32. Blockdfl: A blockchain-based fully decentralized federated learning framework, 2023.
  33. EvoFed: Leveraging evolutionary strategies for communication-efficient federated learning. arXiv preprint arXiv:2311.07485, 2023.
  34. Federated zeroth-order optimization using trajectory-informed surrogate gradients. arXiv preprint arXiv:2308.04077, 2023.
  35. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  36. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. arXiv preprint arXiv:2304.08109, 2023.
  37. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  38. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  40. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022.
  41. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, ICLR, 2022.
  42. Federated fine-tuning of LLMs on the very edge: The good, the bad, the ugly. arXiv preprint arXiv:2310.03150, 2023.
  43. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  44. Federated fine-tuning of billion-sized language models across mobile devices. arXiv preprint arXiv:2308.13894, 2023.
  45. Just one byte (per gradient): A note on low-bandwidth decentralized language model finetuning using shared randomness. arXiv preprint arXiv:2306.10015, 2023.
  46. Towards building the federated GPT: Federated instruction tuning. arXiv preprint arXiv:2305.05644, 2023a.
  47. FLIP: A provable defense framework for backdoor mitigation in federated learning. In The Eleventh International Conference on Learning Representations, ICLR, 2023b.
  48. Fedpetuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Findings of the Association for Computational Linguistics: ACL, pp.  9963–9977, 2023c.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhen Qin (105 papers)
  2. Daoyuan Chen (32 papers)
  3. Bingchen Qian (13 papers)
  4. Bolin Ding (112 papers)
  5. Yaliang Li (117 papers)
  6. Shuiguang Deng (45 papers)
Citations (18)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com