Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection
This paper addresses the challenges associated with Parameter Efficient Fine-Tuning (PEFT) of LLMs within the context of Federated Learning (FL). It specifically focuses on the limitations posed by resource-constrained devices and diverse data distributions among clients, which impede the effective implementation of PEFT in privacy-preserving distributed frameworks. The authors propose a novel method for enhancing PEFT performance within FL by leveraging head pruning, weighted head-specific aggregation, and strategic client selection techniques.
Multi-Head Attention (MHA) mechanisms are integral to transformer-based architectures, offering a structured means of processing intricate textual details. However, redundancy in attention heads allows for potential pruning without negatively impacting model accuracy. The authors exploit this feature by implementing head pruning based on the computed importance scores derived from attention confidence levels. This approach yields a substantial reduction in training complexity for individual clients while maintaining accuracy levels under a 2% drop, as demonstrated on the MultiNLI benchmark.
Significant numerical results underline the success of this head pruning strategy, achieving sparsity levels of up to 90%. Consequently, this translates to communication improvements of up to 1.8 times and a reduction in operational training complexity by 3.9 times compared to training fully dense models using standard FedAvg.
Another essential contribution of this work is its innovative approach to model aggregation in FL. By employing a weighted aggregation mechanism, the server accentuates significant updates based on importance scores from diverse client data distributions. Additionally, the client selection strategy, driven by loss differences between the global and local models, further optimizes the training process by prioritizing clients with maximal impact updates.
The paper demonstrates the robustness of the proposed method across multiple datasets, including 20 Newsgroups, XL-Sum, and E2E NLG. The authors employ Low-Rank Adapters (LoRA) as the PEFT strategy, showing the versatility of the approach irrespective of the chosen PEFT method, assuming it aligns trainable parameters with specific MHA heads.
Implications of this research are profound, offering insights into efficiently tuning LLMs amidst federated settings and providing theoretical underpinnings for further investigations in resource-efficient model training. Speculation on future developments includes extending the pruning methodology to other transformer components, enhancing efficiency in FL paradigms beyond LLMs, such as vision transformers.
In summary, this paper offers a compelling strategy to scale PEFT operations within FL frameworks, considering inherent complexity challenges, and contributes significantly to the discourse on optimizing distributed learning systems.