Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom (2404.12273v1)

Published 18 Apr 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Federated Learning (FL) has emerged as a promising solution for collaborative training of LLMs. However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of LLMs, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. BloombergGPT: A Large Language Model for Finance, May 2023.
  2. HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge, April 2023.
  3. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Artificial intelligence and statistics, 2017, pp. 1273–1282.
  4. Federated Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, December 2019, 13(3):1–207.
  5. FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models, October 2023.
  6. Holistic Evaluation of Language Models, November 2022.
  7. A Survey of Large Language Models, June 2023.
  8. A Survey on Evaluation of Large Language Models, August 2023.
  9. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, May 2023.
  10. GPTScore: Evaluate as You Desire, February 2023.
  11. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. https://arxiv.org/abs/2303.16634v2, March 2023.
  12. Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, July 2004, pp. 74–81.
  13. OpenAI. GPT-4 Technical Report, March 2023.
  14. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases, June 2023.
  15. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, June 2023.
  16. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization, June 2023.
  17. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  18. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
  19. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  20. Koala: A dialogue model for academic research. Blog post, April 2023.
  21. Open-assistant sft-1 12b model.
  22. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  23. Tow J. Stablelm alpha v2 models.
  24. LLaMA: Open and Efficient Foundation Language Models, February 2023.
  25. Augmented Language Models: A Survey, February 2023.
  26. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation, July 2023.
  27. Structured information extraction from complex scientific text with fine-tuned large language models, December 2022.
  28. Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models, March 2022.
  29. LoRA: Low-Rank Adaptation of Large Language Models, October 2021.
  30. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. https://arxiv.org/abs/2110.07602v3, October 2021.
  31. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, November 2021, pp. 3045–3059.
  32. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, April 2023.
  33. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  34. BERTScore: Evaluating Text Generation with BERT, February 2020.
  35. Is ChatGPT a Good NLG Evaluator? A Preliminary Study, March 2023.
  36. Grounding foundation models through federated transfer learning: A general framework. arXiv preprint arXiv:2311.17431, 2023.
  37. When Federated Learning Meets Pre-trained Language Models’ Parameter-Efficient Tuning Methods, June 2023.
  38. PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models – Federated Learning in Age of Foundation Model, August 2022.
  39. FedPrompt: Communication-Efficient and Privacy Preserving Prompt Tuning in Federated Learning, January 2023.
  40. Towards Building the Federated GPT: Federated Instruction Tuning, May 2023.
  41. Training language models to follow instructions with human feedback, March 2022.
  42. Self-Instruct: Aligning Language Model with Self Generated Instructions, December 2022.
  43. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. https://arxiv.org/abs/2301.07597v1, January 2023.
  44. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742, 2023.
  45. TL;DR: Mining Reddit to Learn Automatic Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, September 2017, pp. 59–63.
  46. Learning to summarize from human feedback, February 2022.
  47. Threats to Federated Learning: A Survey. https://arxiv.org/abs/2003.02133v1, March 2020.
  48. LAMP: Extracting Text from Gradients with Language Model Priors, October 2022.
  49. Recovering Private Text in Federated Learning of Language Models, October 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuanqin He (9 papers)
  2. Yan Kang (49 papers)
  3. Lixin Fan (77 papers)
  4. Qiang Yang (202 papers)
Citations (1)