Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Federated Instruction Tuning of LLMs with Domain Coverage Augmentation (2409.20135v5)

Published 30 Sep 2024 in cs.LG, cs.CL, and cs.DC

Abstract: Federated Domain-specific Instruction Tuning (FedDIT) utilizes limited cross-client private data together with various strategies of instruction augmentation, ultimately boosting model performance within specific domains. To date, the factors affecting FedDIT remain unclear, and existing instruction augmentation methods primarily focus on the centralized setting without considering distributed environments. Our experiments reveal that the cross-client domain coverage, rather than data heterogeneity, drives model performance in FedDIT. In response, we propose FedDCA, which optimizes domain coverage through greedy client center selection and retrieval-based augmentation. At its core, the greedy selection procedure iteratively picks client centers that maximize the diversity and coverage of the instruction space while avoiding redundancy with previously selected centers. This ensures broad yet efficient coverage of the domain distribution across clients. For client-side computational efficiency and system scalability, FedDCA$^*$, the variant of FedDCA, utilizes heterogeneous encoders with server-side feature alignment. Extensive experiments across code, medical, financial, and mathematical domains substantiate the effectiveness of both methods, as well as plug-and-play capability. We further analyze privacy preservation against memory extraction attacks, showing that while privacy leakage risk is independent of augmented public data ratio, it decreases or converges as training progresses.

Summary

The paper demonstrates that enhancing cross-client domain coverage in federated settings boosts LLM instruction tuning performance using methods FedDCA and FedDCA*.
It employs a strategy combining greedy client center selection and retrieval-based augmentation to optimize domain-specific tuning across sectors like code, medical, financial, and math.
The paper validates privacy preservation against memory extraction attacks, showing that increased fine-tuning rounds help reduce privacy leakage risks.

The paper "Federated Instruction Tuning of LLMs with Domain Coverage Augmentation" introduces a novel approach called Federated Domain-specific Instruction Tuning (FedDIT). This method leverages limited cross-client private data alongside server-side public data to enhance instruction augmentation and improve model performance within specific domains. The focus of the paper is to explore the factors affecting FedDIT, particularly in distributed environments rather than centralized settings.

Key Contributions:

Domain Coverage Augmentation:
- The authors identify cross-client domain coverage as a crucial factor driving model performance in FedDIT, as opposed to data heterogeneity.
- FedDIT aims to optimize domain coverage, which is particularly relevant for federated learning environments where data is decentralized and clients may have diverse datasets.
FedDCA and FedDCA $^*$ :
- Two methods are proposed: FedDCA and its variant FedDCA $^*$ .
- FedDCA: This method employs greedy client center selection and retrieval-based augmentation to enhance domain coverage.
- **FedDCA $^*$ : Designed for computational efficiency, it uses heterogeneous encoders on the client-side with server-side feature alignment to ensure system scalability.
Experimental Validation:
- The authors conduct extensive experiments across four distinct domains: code, medical, financial, and mathematical.
- Results demonstrate the effectiveness of both FedDCA methods, substantiating their approach to augmenting domain-specific instruction tuning.
Privacy Preservation:
- The paper examines privacy preservation against memory extraction attacks using varying amounts of public data.
- Findings indicate no significant correlation between the volume of public data and privacy-preserving capability.
- Interestingly, as fine-tuning rounds increase, the risk of privacy leakage tends to decrease or converge, highlighting the robustness of the proposed method concerning privacy concerns.

This research contributes to the understanding of federated learning environments, particularly focusing on how domain coverage rather than data diversity can be optimized to enhance model performance in LLM instruction tuning. The investigation into privacy aspects further enriches the discourse, providing insights into securing federated systems against potential attacks.