Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mutual Enhancement of Large and Small Language Models with Cross-Silo Knowledge Transfer (2312.05842v1)

Published 10 Dec 2023 in cs.AI and cs.CL
Mutual Enhancement of Large and Small Language Models with Cross-Silo Knowledge Transfer

Abstract: While LLMs are empowered with broad knowledge, their task-specific performance is often suboptimal. It necessitates fine-tuning LLMs with task-specific data, but such data may be inaccessible due to privacy concerns. In this paper, we propose a novel approach to enhance LLMs with smaller LLMs (SLMs) that are trained on clients using their private task-specific data. To enable mutual enhancement between LLMs and SLMs, we propose CrossLM, where the SLMs promote the LLM to generate task-specific high-quality data, and both the LLM and SLMs are enhanced with the generated data. We evaluate CrossLM using publicly accessible LLMs across a range of benchmark tasks. The results demonstrate that CrossLM significantly enhances the task-specific performance of SLMs on clients and the LLM on the cloud server simultaneously while preserving the LLM's generalization capability.

Overview of CrossLM Framework

The CrossLM framework presents an innovative approach that facilitates the enhancement of both LLMs and Small LLMs (SLMs) through a strategy that bypasses the need for direct data sharing. This is particularly important in scenarios where privacy concerns and data governance regulations restrict the usage of domain-specific data for model training.

Addressing Privacy and Resource Constraints

The novel aspect of CrossLM is rooted in its ability to expand the utility of federated learning for LLMs without imposing the concomitant resource burdens that typically accompany such models. Previous methods have either relied on updating a subset of LLM parameters or splitting the model training across the client and server—which pose their own challenges including significant resource demands and potential privacy issues.

CrossLM's Collaborative Training

CrossLM distinguishes itself by implementing a client-server collaborative training framework where clients' SLMs are tailored to the respective resource capabilities and privacy needs of each client. CrossLM's technical crux is its data-free knowledge transfer, catalyzed by the generative strengths of an LLM to synthesize task-specific datasets. These datasets are then subject to a feedback loop with the SLMs to refine the quality of the LLM's output. This mutualistic relationship fosters an environment where both the LLM and SLMs guide each other toward improved task-specific performance.

Experimental Validation

Empirical evaluations showcase CrossLM's proficiency in enhancing task-specific performances of SLMs by an average of 5.8% to 7.8%, a considerable margin over standalone methods. Furthermore, when compared to Data-free Knowledge Distillation (KD), CrossLM still achieves an additional accuracy improvement of 2% to 2.7%. The LLM's prowess in both natural language understanding (NLU) and generation (NLG) is significantly bolstered post-CrossLM training, as evidenced by accuracy boosts of 18.3% for GPT2-Large and 13.6% for Llama-7B.

Preserving Generalization Capabilities

A critical aspect of CrossLM is the retention of the LLM's generalization capabilities after task-specific enhancement. The empirical findings suggest only marginal performance regressions on unrelated benchmark tasks, signaling that the LLM's broad applicability remains intact.

Concluding Thoughts

CrossLM emerges as an elegant solution that strikes a balance between enhancing task-specific performance and preserving generalization without compromising client data privacy. Its approach not only addresses resource limitations but also adapts to heterogeneous model structures, offering a versatile tool in the practitioner's kit for federated LLM training. The framework's synchronous and one-shot learning characteristics add to its practical appeal, marking a step forward in the evolution of collaborative AI model training while safeguarding data privacy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
  2. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019.
  3. Don’t stop pretraining: Adapt language models to domains and tasks. In ACL, pages 8342–8360, 2020.
  4. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  5. General Data Protection Regulation. General data protection regulation (gdpr). Intersoft Consulting, Accessed in October, 24(1), 2018.
  6. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282, 2017.
  7. Reduce communication costs and preserve privacy: Prompt tuning method in federated learning. arXiv preprint arXiv:2208.12268, 2022.
  8. When federated learning meets pre-trained language models’ parameter-efficient tuning methods. In ACL, 2022.
  9. SLoRA: Federated parameter efficient fine-tuning of language models. arXiv preprint arXiv:2308.06522, 2023.
  10. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  11. Offsite-tuning: Transfer learning without full model. arXiv preprint arXiv:2302.04870, 2023.
  12. Fedbert: When federated learning meets pre-training. ACM Transactions on Intelligent Systems and Technology (TIST), 13(4):1–26, 2022.
  13. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  14. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  17. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  18. Superglue: A stickier benchmark for general-purpose language understanding systems. NeurIPS, 32, 2019.
  19. Federated learning meets natural language processing: A survey. arXiv preprint arXiv:2107.12603, 2021.
  20. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
  21. Federated learning for emoji prediction in a mobile keyboard. arXiv preprint arXiv:1906.04329, 2019.
  22. Pretraining federated text models for next word prediction. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 477–488, 2021.
  23. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903, 2018.
  24. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
  25. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  26. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, pages 1–9, 2022.
  27. Federated learning of large models at the edge via principal sub-model training. arXiv preprint arXiv:2208.13141, 2022.
  28. Hierarchical neural story generation. In ACL, pages 889–898, 2018.
  29. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  30. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  31. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  32. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2, 2019.
  33. Recursive deep models for semantic compositionality over a sentiment treebank. In ACL, pages 1631–1642, 2013.
  34. Learning word vectors for sentiment analysis. In ACL, pages 142–150, 2011.
  35. Character-level convolutional networks for text classification. NeurIPS, 28, 2015.
  36. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  37. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  38. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  39. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yongheng Deng (5 papers)
  2. Ziqing Qiao (2 papers)
  3. Ju Ren (33 papers)
  4. Yang Liu (2253 papers)
  5. Yaoxue Zhang (27 papers)
Citations (8)